Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: runtime: support "strict mode" run-time checks analogous to "go vet" #37681

Closed
mvdan opened this issue Mar 5, 2020 · 24 comments
Closed

Comments

@mvdan
Copy link
Member

mvdan commented Mar 5, 2020

Background

Android has had StrictMode for years, and it's very useful to catch bugs at run-time:

StrictMode is most commonly used to catch accidental disk or network access on the application's main thread, where UI operations are received and animations take place.

Go doesn't have a main thread for UI operations, so many of these sanity checks don't apply. However, a few might make sense in Go, if we can find a way to enable them appropriately.

The closest to a "strict mode" build flag that Go has today is -race, which enables the data race detector. As of Go 1.14, the flag also enables extra run-time checks, such as instrumentation to check invalid uses of unsafe pointers.

Proposed change

I propose that such "strict mode" checks should be enabled with -race. We can start by borrowing its idea to detect when closeable values are leaked without being closed.

Go does not have a way to detect if a value whose type implements io.Closer has been closed. Similarly, not all such values get closed automatically when finalized; such finalizers are set on a case-by-case basis, for types like net.netFD and os.File.

However, we can easily build on the existing code to catch "leaks" of such types. For example, the current finalizer for os.File is set up as follows, where a call to os.File.close removes the finalizer to avoid closing twice:

runtime.SetFinalizer(f.file, (*file).close)

We could adapt it as follows:

if race.Enabled {
    runtime.SetFinalizer(f.file, func(f *file) {
        panic("file was garbage collected without being closed: " + f.name)
    })
} else {
    runtime.SetFinalizer(f.file, (*file).close)
}

Given that race.Enabled is a constant, this should have no extra run-time cost when the race detector isn't enabled. And as said before, since closing a file removes its finalizer, our panic wouldn't be called if a file was properly closed.

Once this first check has been added and released successfully as part of a stable Go release, more issues could be filed to add more checks. For example, some potential ideas:

  • Catch non-blocked goroutines that were running when the main goroutine finished or exited
  • Catch when a goroutine gets blocked forever due to a likely bug, such as sending/receiving on a nil channel (where a select {} would be more evident and less error-prone), or when a goroutine gets blocked sending/receiving on a channel and we know that said channel isn't reachable by any other goroutine.

Alternatives considered

First, we considered static analysis, such as a go/analysis checker using syntax trees and type information. However, that would only catch the most basic cases, as it's impossible to statically know at what point an allocated object would become unreachable and garbage collected. If it were possible, run-time garbage collection wouldn't be needed.

Second, we considered an external tool to detect these cases at run-time. Let's briefly cover some options:

  • A fork of the standard library. Very similar to the proposed solution, but it would be hard to maintain and use.
  • A tool to instrument the user's code before compilation, such as wrapping calls to os.OpenFile. However, this solution would require a lot of complex code, and it would be difficult to catch all leaks (such as those originating in external libraries).
  • A change in the standard library, guarded by its own build tag like +build strictmode. Very similar to the proposed solution, given that the race detector uses +build race. However, the big advantage of reusing -race is that it's well understood and widely used already, so we don't need to teach Go users about new flags or build tags they should remember. We assume this is the same reason why -d=checkptr was added to -race.
  • A tool to wrap running the binary with strace/dtrace, like leakcheck. While it can work to catch leaked files via syscalls, it won't work for other checks such as goroutines blocked forever. Depending on strace/dtrace would also add external dependencies, vastly increase complexity, and probably mean higher or less predictable run-time costs.
  • A library to do manually insert run-time checks, such as goleak. While it seems possible to catch simple scenarios of "goroutine leaks" by looking at stack traces, catching other kinds of bugs such as leaked files seems near impossible to do in a correct and portable way. Moreover, this would require manually and carefully instrumenting one's code.
@beoran
Copy link

beoran commented Mar 5, 2020

A small remark, Go lang programs do have a main thread and on Android and other platforms we have to use LockOsThread to be sure that we are rendering, etc. on that main thread. But for the rest, extra checks are a useful idea.

@ianlancetaylor
Copy link
Contributor

CC @bradfitz

@ianlancetaylor
Copy link
Contributor

Another existing example of -race as StrictMode is that -race injects randomness into the goroutine scheduler. See the comment for randomizeScheduler at https://golang.org/src/runtime/proc.go?#L4933 .

@beoran
Copy link

beoran commented Mar 5, 2020

This is a great idea, but it seems the option name -race is already incorrect at this time. Sorry to bikeshed, but this should perhaps be renamed to -strict ?

@mvdan
Copy link
Member Author

mvdan commented Mar 6, 2020

@beoran it's true that we've piggybacked on -race for similar features. I'm not sure if deprecating the name in favor of another one, or replacing it entirely, is worth the pain it will cause the users. In any case, the name is already somewhat misleading today, so this proposal doesn't introduce the problem. The renaming of the flag should be filed as a separate issue.

@bcmills
Copy link
Contributor

bcmills commented Mar 11, 2020

Thinking about this some more: I think the “strict mode” check should be separate from -race.

-race currently has (or arguably should have) zero false-positives. Any issue it reports is a program behavior disallowed by the spec, and may cause the program to crash or corrupt memory depending on the platform and implementation.

In contrast, the “strict mode” checks would indicate something that is usually a problem — such as a memory or file-descriptor leak — but that (for example) might not cause incorrect behavior especially in a short-running program.

I don't want the “strict mode” checks to undermine users' confidence in -race reports, so I would prefer to keep them separate — just as we do for vet and lint.

@rsc
Copy link
Contributor

rsc commented Apr 8, 2020

100% agree with Bryan. Also I'd like to see a clearer definition of what 'strict mode' means before we add a separate mode. So far it seems like just forgetting to close files?

@mvdan
Copy link
Member Author

mvdan commented Apr 8, 2020

I have come to agree with Bryan too, after thinking about it for a few weeks. I think I should amend the proposal a bit, but it isn't clear what the new knob would be. Simply a standard around a strict build tag? Should that also be exposed via a flag similar to go build -race, like -strict?

So far it seems like just forgetting to close files?

For the first cut, yes. In the original issue I list a couple of other goroutine-related ideas, under "more ideas". I didn't want to spend too much effort here, because it feels like we could bikeshed for a long time on what extra checks would or would not be a good idea. If we can agree on the high-level definition, I don't think the first cut of checks is that relevant.

To me, the rough definition would be: the equivalent of vet, but at run-time instead of via static analysis. That is, program behavior that seems suspicious, and probably means a bug or unintended behavior of some sort. The key difference here is "suspicious" and "probably"; there is a possibility for false positives, which isn't the case for -race. For example, for some programs it might be okay to rely on the *os.File finalizer to close files.

@rsc rsc added this to Incoming in Proposals (old) Apr 15, 2020
@rsc rsc changed the title Proposal: add "strict mode" run-time checks to -race proposal: runtime: add "strict mode" run-time checks to -race Apr 15, 2020
@rsc
Copy link
Contributor

rsc commented Apr 15, 2020

Randomizing the scheduler really was about races.
Checkptr is not about races and maybe was a mistake, but it's there now.
Let's not continue to put non-race things into race mode.

Even with checkptr, right now -race is 100% accurate: if it crashes, you have a serious problem that needs fixing. That would not be true of forgetting to close one fd at the start of your program. The right things are happening anyway - the program is not incorrect.

This sounds kind of like -lint to me, not -strict. I'm not super excited about linters in general. They tell me that my program is not what someone else wants it to be, even though it's correct. This is why I asked what else might go in. If it really is -lint, let's put our time into other things. It seems counterproductive to open the runtime to bikeshedding about linting.

It's also worth asking whether the things in -strict/-lint are expected to be fast. There is more expensive cgo checking we could do, but it slows down the program quite a lot.

@rsc rsc moved this from Incoming to Active in Proposals (old) Apr 15, 2020
@bcmills
Copy link
Contributor

bcmills commented Apr 15, 2020

The three cases that @mvdan described originally — unclosed files, permanently-blocked goroutines, and abandoned runnable goroutines that aren't relevant to the program's output — are all what I would describe as “leaks”: they all cause the program to consume unneeded resources for an unbounded amount of time. (The resource being leaked varies: it may be file descriptors, or kernel resources, or in-process memory, or CPU cycles.)

So perhaps the thing to propose is a more narrowly defined “leak sanitizer”, which would cause programs to report detectably-leaked resources, and to which libraries could add their own best-effort leak detection (perhaps via a well-known build tag such as leak).

@rsc
Copy link
Contributor

rsc commented Apr 22, 2020

For every one of those three categories, I can give a completely reasonable, correct program that I'd be annoyed to have reported and "sanitized" for me. I define a lint check as something that's not about correctness. These are exactly lint checks by that definition. I don't believe we should start doing lint checks in the Go runtime. That path leads to no end of bikeshedding about exactly which checks to add, and grumbling from users like me who have perfectly fine programs getting flagged.

Note that goroutine leaks show up in the pprof goroutine profile and runaway goroutines show up in pprof cpu profiles. It might be interesting to somehow flag finalized allocations in the heap profile.

@bcmills
Copy link
Contributor

bcmills commented Apr 22, 2020

For every one of those three categories, I can give a completely reasonable, correct program that I'd be annoyed to have reported and "sanitized" for me. I define a lint check as something that's not about correctness. These are exactly lint checks by that definition.

I agree with that definition of a “lint check”.

However, I disagree that the existence of false-positives for these checks implies that they are “not about correctness”. A program that fails due to the system OOM killer (or due to running out of file descriptors) fails nonetheless. (Personally, I would generally consider a program with an unbounded goroutine or file-descriptor leak to be “incorrect”, and a program with a bounded leak to be “correct but inefficient”.)

So I would say that the leak checks for these categories are about correctness, but have a nonzero rate of false-positives. Moreover, those false-positives should be rare and easy to avoid by writing clearer code — for example, by making the lifetimes of goroutines and open files correlate more directly to lexical blocks and/or explicit calls in the program.

We already have a compile-time tool that detects correctness issues with a small-but-nonzero rate of false-positives, but that tool is vet, not lint. (The compile-time tool that detects correctness issues with a zero rate of false-positives is the compiler itself.)

I suspect that what @mvdan intends here is a run-time analogue to vet: a tool with a small-but-nonzero false-positive rate, for which the false-positives can be easily avoided by writing clearer code.


That is: we currently have three compile-time tools, but only two of the three corresponding run-time tools:

False-positives? Static tool Dynamic tool
none compile -race
few, easily avoided vet (this proposal)
diagnostic golint, other linters, -gcflags=-m pprof, -bench, cover

@mvdan mvdan changed the title proposal: runtime: add "strict mode" run-time checks to -race proposal: runtime: support "strict mode" run-time checks analogous to "go vet" Apr 23, 2020
@mvdan
Copy link
Member Author

mvdan commented Apr 23, 2020

I suspect that what @mvdan intends here is a run-time analogue to vet: a tool with a small-but-nonzero false-positive rate, for which the false-positives can be easily avoided by writing clearer code.

Correct - this is what I intended to say in an earlier comment, by "the equivalent of vet, but at run-time".

Let's not continue to put non-race things into race mode.

@bcmills already argued about doing this elsewhere, like a leak build tag, and I already agreed with him in an earlier comment. I've now retitled the issue, as we all at least agree on not touching -race now.

The only reason I'm convinced this should be part of the toolchain is for two reasons:

  • Some leaks can be easily detected from the standard library (e.g. runtime for goroutine leaks, and os for files). Some might be possible via pprof or tracing, but those require extra machinery and overhead.
  • A standard would be great so that different third-party modules can agree on a knob to turn on "run-time checks analogous to vet", even if those are only around detecting leaks.

@rsc
Copy link
Contributor

rsc commented Apr 29, 2020

One important difference between static checks and runtime checks is that static checks can be scoped to one package. I can decide to run go vet on my code and not worry about the fact that go vet complains about stuff in your code, even if I depend on your code. At runtime that's a lot harder. So if I write a popular library that isn't "strict" the result is I'm going to get a steady stream of bug reports telling me I should make it "strict". It's not optional in the same way as go vet.

The retitling still says the word "strict" but you mentioned a leak tag. Is this now scoped to be only about reporting leaks? If so probably the next thing to do is to enumerate the leaks we care about. And again we've used custom pprof profiles for leaks in the past and maybe that's the right answer here. Then the aggregate info can be inspected instead of reporting every single one. (An O(1) leak is not usually a leak; only O(n) leaks are.)

@josharian
Copy link
Contributor

One possible example of such a check is #8606 (comment)

@bcmills
Copy link
Contributor

bcmills commented Apr 29, 2020

@josharian, I think that one really is more in -race's wheelhouse. (When two variants of behavior are both allowed as “correct” by the spec, the compiler and/or runtime is free to choose antagonistically, and the resulting panics would not be false-positives.)

@bcmills
Copy link
Contributor

bcmills commented Apr 29, 2020

if I write a popular library that isn't "strict" the result is I'm going to get a steady stream of bug reports telling me I should make it "strict". It's not optional in the same way as go vet.

That's true, but if you write a popular library that leaks resources (even O(1) resources), you're likely to receive the same bug reports from performance-sensitive users. (Witness, for example, a few recent CLs from @bradfitz making package initialization lazy in order to prune out O(1) “leaks” induced by otherwise-unused compile-time dependencies.)

@rsc
Copy link
Contributor

rsc commented May 6, 2020

Even if this is a -leak mode, there's still a big difference between a user asking for a performance optimization (you can say no) and a user reporting that your package doesn't work with some standard (implicitly blessed) tool.

I'm still not 100% sure exactly what's being proposed, but it seems like too heavy a change - it will cause work for everyone to clean up code that was allowed as correct before.

(In contrast to the race detector, there are very few harmless races but plenty of harmless O(1) leaks, like the occasional os.File opened for read that the GC closes.)

@rsc
Copy link
Contributor

rsc commented May 20, 2020

Does anyone want to take another stab at saying precisely what is being proposed here?
The discussion so far seems like it is trending toward a likely decline (too heavy a change, ends up being a language change in disguise), but I want to leave a chance to clarify if I'm misunderstanding the proposal. Thanks.

@rsc
Copy link
Contributor

rsc commented May 27, 2020

Based on the discussion above, this proposal seems like a likely decline.

@rsc rsc moved this from Active to Likely Decline in Proposals (old) May 27, 2020
@mvdan
Copy link
Member Author

mvdan commented May 28, 2020

I still think there's value in such checks, and I disagree that leaks are generally not bugs. The fact that most people don't notice them, and that the GC tends to clean up after a subset of them like open files, doesn't really mean they're not a problem.

Right now, if one has such leaks in a running Go program, they can be hard to find. I disagree that tools like pprof or trace can be a good solution here. For example, I routinely work with Go servers that have tens of thousands of goroutines. Sifting through those in the hope of finding a stuck goroutine feels backwards. In comparison, something like -tags leak could spot them immediately.

I think my latest thoughts are well summarized by @bcmills in #37681 (comment). I don't feel like adding to what he said, because it feels like the discussion would continue going in circles.

In hindsight, maybe I should have spent more time figuring out what qualifies as a "leak" in Go, before filing a proposal to implement a solution. But I also want to emphasize that this proposal isn't just about leaks. For example, see Josh's #37681 (comment). Unless we are OK with adding more of those to -race?

@bcmills
Copy link
Contributor

bcmills commented May 29, 2020

Since @mvdan is deferring to my comment, I'll take a stab at a precise statement of what I would like the concrete proposal to be:


Define and document a well-known build tag (perhaps leak) that causes the Go runtime and packages in the Go standard library to make a best effort to report program behaviors that:

  • almost always indicate leaks of finite resources that:
    • are permanent
    • or are not (directly or indirectly) reclaimed by the runtime's garbage collector,
    • or do not affect the runtime's GC pacing enough to ensure a rate of collection proportional to the rate of exhaustion of that resource,
  • and are usually straightforward to avoid by a small change to the program, such as closing a channel, selecting on a channel, or invoking a Close method,
  • and can be attributed to an unambiguous point of creation (such as a New or Open function or a go statement).

If the runtime changes for such a mode require something that cannot be achieved with just build tags, also add a corresponding -leak flag to cmd/go. (However, I doubt that such a flag would be required.)

Also encourage authors of third-party packages to support the same build tag, with the same meaning, in their own packages.

@bcmills
Copy link
Contributor

bcmills commented May 29, 2020

See also #6705, particularly #6705 (comment) and #6705 (comment).

(However, note that the leak tag proposed here would also be usable for general binaries — such as canary instances or debug builds — not just tests proper as proposed in that issue.)

@rsc
Copy link
Contributor

rsc commented Jun 3, 2020

Thanks for the attempt to state the proposal clearly, @bcmills. I think that basically matches our discussion earlier. In particular, as I noted on Apr 29 O(1) leaks that will be GC'ed don't need to be fixed and O(N) leaks show up in profiles.

No change in consensus since marking likely decline, so declined.

@rsc rsc closed this as completed Jun 3, 2020
@rsc rsc moved this from Likely Decline to Declined in Proposals (old) Jun 3, 2020
@golang golang locked and limited conversation to collaborators Jun 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Development

No branches or pull requests

7 participants