Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/debug: add SetCrashOutput(file *os.File) #42888

Closed
egonelbre opened this issue Nov 30, 2020 · 128 comments
Closed

runtime/debug: add SetCrashOutput(file *os.File) #42888

egonelbre opened this issue Nov 30, 2020 · 128 comments

Comments

@egonelbre
Copy link
Contributor

Currently there are difficulties with nicely capturing crash output from Go programs. Crashes print to stderr, however, usually both stdout and stderr are used by programs for other purposes. While it's not wrong to output it to stderr, it mixes two outputs together making later separation more difficult.

Capturing crashes is useful for post-mortem debugging and sending reports. Of course, due to unknown crashed program state using a custom callback is understandably dangerous.

A simpler alternative would be to allow changing file descriptor where to print crashes. In the simplest form it could look like:

package main
import "runtime/debug"

func init(){
	debug.SetCrashOutputFD(3)
}

This would allow passing in a fd for separate file, pipe to another "crash monitor" process or connect to an external server. Of course there's a slight possibility of write taking a long time, when passing in a pipe/connection slowing down the crashing of the program.

With regards to naming it could also be SetCrashFD, SetTracebackFD or SetTracebackOutputFD.

@gopherbot gopherbot added this to the Proposal milestone Nov 30, 2020
@ianlancetaylor ianlancetaylor changed the title proposal: allow setting fd for crash output proposal: runtime/debug: allow setting fd for crash output Nov 30, 2020
@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) Nov 30, 2020
@prattmic
Copy link
Member

As a data point, the gVisor project wants to redirect panic/throw output. It works around this problem by dup'ing over stderr with the FD of the desired crash output location (https://cs.opensource.google/gvisor/gvisor/+/master:runsc/cli/main.go;l=194-198?q=dup3&ss=gvisor%2Fgvisor). stderr is otherwise unused by the program, so this works OK. This is a workable solution, but it would certainly be nicer to be able to change the destination directly.

@mvdan
Copy link
Member

mvdan commented Nov 30, 2020

Thinking outloud, the runtime prints other output besides just crashes, such as what one can enable via GODEBUG. Should that be included in this option?

@prattmic
Copy link
Member

I'd think (initially, at least), yes. I'd probably call this something like SetRuntimeFD for any runtime-generated output.

On the other hand, there are nice opportunities if a separate "fatal" FD could be provided that only includes fatal panic/throw output, plus any related prints immediately before a throw. That would allow a nice simplification of "any bytes received on this FD indicates a crash". But it means extra complexity for a rather edge case [1].

[1] I can't think of any non-fatal runtime output that isn't behind a GODEBUG option.

@egonelbre
Copy link
Contributor Author

egonelbre commented Nov 30, 2020

I would say, yes, to including other runtime output as well. If there's a need for additional separation it can be always introduced later. But, I wouldn't be also opposed to introducing it immediately either.

I think the fatal output can be detected when the fd write closes due to the program stopping.

@alexbrainman
Copy link
Member

@egonelbre how do you propose to use your API on Windows?

I agree. I wanted this feature to capture crash dumps of a Windows service.

/cc @zx2c4

Alex

@zx2c4
Copy link
Contributor

zx2c4 commented Dec 1, 2020

I've handled this until now with defer/recover in each go routine, which is really very subpar and doesn't handle all crashes.

Rather than setting an output fd, what about setting an output callback? There'd be certain restrictions on what it could do, obviously, but it could still probably accomplish most things that people want. In my case, it'd just copy the bytes to some mmap'd ring buffer log file.

@egonelbre
Copy link
Contributor Author

@alexbrainman it would behave in a similar manner that you can provide a fd that outputs to a file or a pipe. Flushing it to Event Log would require some external "crash reporting" service that reads that file or pipe. This isn't ideal, but would allow better handling than what can be currently done. Ideal scenario would be to write directly to event-log, however that would probably require specialized machinery.

@zx2c4 I believe the callback has been proposed a few times, however I cannot find the exact proposals. I think the fundamental issue is that you don't know really much about the crashed system - what's broken and what's not. e.g. maybe you got a panic during sigprof where some code holds a runtime lock "x", which your code needs and you'll deadlock when calling file.Write. This is all hypothetical of course.

@zx2c4
Copy link
Contributor

zx2c4 commented Dec 1, 2020

what's not. e.g. maybe you got a panic during sigprof where some code holds a runtime lock "x", which your code needs and you'll deadlock when calling file.Write. This is all hypothetical of course.

Right. If you're using debug.SetCrashCallback, then you are signing up for that kind of fun, and it's up to to write Go code that calls the minimal set of things needed. If the runtime can already do that for spitting to a fd, so can a custom handler function.

Ideal scenario would be to write directly to event-log, however that would probably require specialized machinery.

Writing to event log isn't too bad. It'd support the callback-based approach I mentioned.
(In my case, log writes look like this: https://git.zx2c4.com/wireguard-windows/tree/ringlogger/ringlogger.go#n105 )

@prattmic
Copy link
Member

prattmic commented Dec 1, 2020

Right. If you're using debug.SetCrashCallback, then you are signing up for that kind of fun, and it's up to to write Go code that calls the minimal set of things needed. If the runtime can already do that for spitting to a fd, so can a custom handler function.

This means no heap allocation, no stack splits, no map access (IIRC), particularly if you want this to cover runtime throws. Not impossible, but a pretty high bar to set for an API.

@zx2c4
Copy link
Contributor

zx2c4 commented Dec 1, 2020

This means no heap allocation, no stack splits, no map access (IIRC), particularly if you want this to cover runtime throws. Not impossible, but a pretty high bar to set for an API.

But also not so out of place for something in debug, right? We're offering some way to hook into runtime internals, and with that comes the responsibilities of being in the runtime. But maybe there's a better idea here:

The other way of doing this might be to have an unstable and unexported function runtime.setCrashHook, which can then be used by wrapper packages like:

//go:linkname setCrashHook runtime.setCrashHook

That then represents the lowest level. At a higher level, then the debug package can use that to implement things like debug.SetCrashFD and the Windows service package can implement things like svc.SetCrashEventlog. And then most users use these normal functions that our libraries can provide.

Meanwhile, insane users like me can dip down into runtime.setCrashHook, along with all of its dangers, for hooking it up my custom ringbuffer mmap'd logger thing, knowing full well that if anything goes wrong, this is unsupported and my fault, etc.

@ianlancetaylor
Copy link
Contributor

The restrictions on an output callback would be severe, as @prattmic says, and failure modes would be unpredictable. In Go we try to avoid that kind of deeply unsafe operation.

I think it would be better to keep this issue focused on the much safer operation of setting a file descriptor.

That said, I don't understand what would happen if the descriptor is a pipe that nothing is reading from. Or a file opened on a networked file system that is not responding. What should do in cases like that?

@zx2c4
Copy link
Contributor

zx2c4 commented Dec 2, 2020

The restrictions on an output callback would be severe, as @prattmic says, and failure modes would be unpredictable. In Go we try to avoid that kind of deeply unsafe operation.

What do you think of my proposal above of allowing this to be a general callback via go:linkname -- i.e. just for libraries? That's a mere "implementation detail", but it'd go a long way of enabling this to be extensible for the adventurous.

@prattmic
Copy link
Member

prattmic commented Dec 2, 2020

That said, I don't understand what would happen if the descriptor is a pipe that nothing is reading from. Or a file opened on a networked file system that is not responding. What should do in cases like that?

I'm not sure we need to worry about those (beyond documentation, perhaps). stderr could already be any of those kinds of descriptors, as set by the parent process, so the same problems would exist today.

@egonelbre
Copy link
Contributor Author

That said, I don't understand what would happen if the descriptor is a pipe that nothing is reading from. Or a file opened on a networked file system that is not responding. What should do in cases like that?

I guess this is the main danger with the proposal. As @prattmic mentioned, somebody could pipe stderr to somewhere that isn't being read. I didn't test it, but I think that would block in the same way.

Using a non-blocking write that drops data when the fd is not responding would be nice, however I suspect that would be difficult to implement.

It'll definitely need examples how to write the listening side.

Other than that, I don't have any further ideas.

@egonelbre
Copy link
Contributor Author

This might be still difficult due to the constraints, but if the writing is on a different thread and the previous write hasn't finished in appropriate time, the write can be dropped. The threshold could be configured via GODEBUG or similar.

@rsc
Copy link
Contributor

rsc commented Dec 2, 2020

Should we just have GODEBUG=debugfd=3?

/cc @aclements

@rsc rsc moved this from Incoming to Active in Proposals (old) Dec 2, 2020
@ianlancetaylor
Copy link
Contributor

@zx2c4 I'm not fond of documenting a callback to use with go:linkname either. I tend to think that people who operate at that level can customize their Go distribution.

@aclements
Copy link
Member

I'd be okay with GODEBUG=debugfd=3. It's not straightforward to implement, though, since we don't distinguish between runtime output and anything else printed by print(ln). It would actually be much easier to implement just for tracebacks because we could use something like the gp.writebuf redirection, just to an FD instead. But maybe that's better anyway?

I agree with @alexbrainman that I'm not sure how this would work on Windows. Internally, we treat any "fd" other than 1 or 2 as a raw Windows handle value, but I don't know if such a thing can be meaningfully passed as an environment variable. Is it possible for a parent process to create a handle and pass it to a child like this?

On the topic of a callback, in addition to the subtle and rather onerous restrictions on the callback, those restrictions can also change with new releases. We don't expose anything remotely like that right now.

@egonelbre
Copy link
Contributor Author

egonelbre commented Dec 2, 2020

I did think about GODEBUG, but I suspect that would be complicated to use in Windows services since you don't have a nice way to figure out the fd prior to starting it. The main process needs to respond to messages (https://github.com/golang/sys/blob/ef89a241ccb347aa16709cf015e91af5a08ddd33/windows/svc/example/service.go#L23).

debug.SetCrashFD allows the main process to start a "monitor process" and set the fd accordingly, while the main process can handle the "service" responsibilities.

@alexbrainman
Copy link
Member

@alexbrainman it would behave in a similar manner that you can provide a fd that outputs to a file or a pipe. Flushing it to Event Log would require some external "crash reporting" service that reads that file or pipe. This isn't ideal, but would allow better handling than what can be currently done. Ideal scenario would be to write directly to event-log, however that would probably require specialized machinery.

@egonelbre I don't understand what you are suggesting.

If I have a Windows service written in Go, how can I redirect its crash dump to a file by using debug.SetCrashOutputFD(3) ? Let's say I want crash dump file to be called c:\a.txt.

Similarly, I don't see how using debug.SetCrashOutputFD(3) would help me write to Event Log.

Maybe we can use file path instead of file descriptor? Like debug.SetCrashOutput("/tmp/a.txt").

Alex

@aclements
Copy link
Member

@alexbrainman , for writing to a file, I imagine you would open the file and then pass that handle to debug.SetCrashOutputFD (or whatever). I don't know how the event log works, but I imagine you'd start a second process with a pipe between them and the second process would be responsible for writing to the event log.

debug.SetCrashFD allows the main process to start a "monitor process" and set the fd accordingly, while the main process can handle the "service" responsibilities.

Is there a reason this can't be done the other way around? The first process is the monitor process and opens the FD, then starts the second process with the GODEBUG environment set. Does this not work for Windows services?

(I'm not necessarily opposed to having a runtime/debug function for this, just trying to understand the space of constraints. One nice thing about the environment variable is that it can catch very early panics, like during package init.)

@zx2c4
Copy link
Contributor

zx2c4 commented Dec 2, 2020

@zx2c4 I'm not fond of documenting a callback to use with go:linkname either. I tend to think that people who operate at that level can customize their Go distribution.

Yea, documenting it is probably the wrong thing to do. But having something in there akin to nanotime would be "very nice"...and would allow me to implement something for Go's Windows service package that uses it to ship things to eventlog automatically.

Anyway, I realize pushing this too hard is an loosing battle, for good reason. But if somehow the implementation happened to be structured in a way that made it coincidentally possible to go:linkname it from elsewhere, I would be very happy.

@zx2c4
Copy link
Contributor

zx2c4 commented Dec 2, 2020

Is it possible for a parent process to create a handle and pass it to a child like this?

You can make a handle inheritable, and then pass its value as an argument to the new process. The value will remain the same in the new process. Alternatively, DuplicateHandleEx allows the one process to duplicate a handle into another process.

@egonelbre
Copy link
Contributor Author

@alexbrainman, so roughly what I'm thinking is the following. However, I might also forget some internal details that could make the following difficult with Windows.

The most basic thing is to send the output to a separate file:

package main

import (
	"os"
	"runtime/debug"
)

func main() {
	f, err := os.Open("service_log.txt")
	if err != nil {
		os.Exit(1)
		return
	}

	debug.SetCrashFD(f.Fd())

	// rest of the service logic
}

The next step would be to create a watchdog sub-process:

package main

import (
	"os"
	"os/exec"
	"runtime/debug"
)

func main() {
	r, w, err := os.Pipe()
	if err != nil {
		os.Exit(1)
		return
	}

	// this could also self-start, but in a different mode
	// but we'll use a separate binary for now
	cmd := exec.Command("watchdog.exe") 
	cmd.Stdin = r
	// also set other flags here to ensure that the watchdog
	// doesn't immediately close together with the parent.
	cmd.Start()

	debug.SetCrashFD(w.Fd())

	// rest of the service logic
}

You could also create governors using ExtraFiles:

package main

import (
	"os"
	"os/exec"
	"runtime/debug"
)

func main() {
	r, w, err := os.Pipe()
	if err != nil {
		os.Exit(1)
		return
	}

	cmd := exec.Command("watchdog.exe") 
	cmd.ExtraFiles = []*os.File{w}
	cmd.Start()

	// monitor `r`
}

// The other process has `debug.SetCrashFD(3)` in init,
// with some check that `3` actually exists.

For windows, I think (named) pipes could also be used:

f, err := os.Open(`\\.\pipe\randomhexvalue`)
...
debug.SetCrashFD(f.Fd())

@hajimehoshi
Copy link
Member

Apparently, there is no plan to export some exported APIs to modify the overrideWrite, but is there other plan for this issue?

@martinrode
Copy link

What's the status on this? Proposal has been accepted but no patch yet, correct? Is anybody working on that? We have a Windows service executable which once in a while crashes, we suspect concurrent map writes, but without the traces we are pretty much lost how and where this happens. So a way to redirect that to a file would be greatly appreciated!

@egonelbre
Copy link
Contributor Author

I intended to submit a patch, but never got around to it due to various reasons. I still had it in my github inbox though. But, if you (or anyone else) wants to start working on it before I sort my things out then that's certainly welcome.

@adonovan
Copy link
Member

adonovan commented Dec 5, 2023

Through telemetry, the gopls project has recently been getting a lot of really helpful crash report information from places where we added explicit calls to bug.Reportf to assert conditions we weren't confident enough to panic on. I suspect we'd get a lot more crash reports yet, if only we could detect unhandled panics and fatals.

The idea of a runtime hook for the "recover of last resort" (like the JVM's Thread.UncaughtExceptionHandler) fills me with dread, but this proposal would enable us to develop a watchdog as a child process. The parent would call SetCrashOutput on its end of the pipe feeding the child's stdin, and the child would wait for the death of the parent then parse its input and report the crash to telemetry. Having only a textual traceback, it might be slightly impoverished in what it can report--no access to the pclntab means no function-relative line numbers---though if the parent and child are the same executable, there may be more we can do.

In short, we're interested to see this move forward.

@gopherbot
Copy link

Change https://go.dev/cl/547978 mentions this issue: runtime/debug: SetCrashOutput sets the FD for fatal panics

@adonovan adonovan self-assigned this Dec 7, 2023
@gopherbot
Copy link

Change https://go.dev/cl/548121 mentions this issue: runtime/debug: add Example use of SetCrashOutput for a watchdog

@gopherbot
Copy link

Change https://go.dev/cl/548735 mentions this issue: gopls/internal/lsp/protocol: report panics in RPC goroutines

gopherbot pushed a commit to golang/tools that referenced this issue Dec 11, 2023
This change adds recover() + bug.Reportf to the generated
dispatch routines for the server and client, so that we learn
through telemetry of unexpected panics in RPC goroutines.

Panics in other goroutines cannot yet be reported, but go1.23's
runtime/debug.SetCrashOutput feature (golang/go#42888) will
make that possible, with some extra work.

Change-Id: I893333d296096632fa47e3238a72b0ac48386375
Reviewed-on: https://go-review.googlesource.com/c/tools/+/548735
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Robert Findley <rfindley@google.com>
@gopherbot
Copy link

Change https://go.dev/cl/558256 mentions this issue: gopls: a watchdog for unexpected crashes

@gopherbot
Copy link

Change https://go.dev/cl/559503 mentions this issue: crashmonitor: a monitor for unexpected crashes

@adonovan adonovan modified the milestones: Backlog, Go1.23 Jan 31, 2024
@gopherbot
Copy link

Change https://go.dev/cl/559800 mentions this issue: runtime/debug: clarify SetCrashOutput dup behavior

@gopherbot
Copy link

Change https://go.dev/cl/559801 mentions this issue: runtime: be careful about crash FD changes during panic

gopherbot pushed a commit to golang/telemetry that referenced this issue Feb 1, 2024
This change defines a crash monitor that increments
a telemetry counter for the stack any time the Go
runtime crashes.

It relies on the runtime/debug.SetCrashOutput feature
of go1.23 (#42888); if built with an earlier
version of Go, the monitor is a no-op. This feature
causes the Go runtime to write the crash to an arbitrary
file instead of stderr.

In this case, the file is a pipe to a subprocess that
executes the application executable in monitor mode.
When the pipe is closed, the monitor parses the
program counters out of the backtrace, then uses a
the usual logic to convert it to the name of a
counter, which it then increments.

The CL includes two tests, an internal test that parses
a crash from the child process's stderr, and an
integration test that uses crashmonitor.Start and
SetCrashOutput, if available.

Updates golang/go#42888

Change-Id: Ie40f9403fa99a9e0bab3b4edc9430be5e18150d7
Reviewed-on: https://go-review.googlesource.com/c/telemetry/+/559503
Auto-Submit: Alan Donovan <adonovan@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Robert Findley <rfindley@google.com>
gopherbot pushed a commit to golang/tools that referenced this issue Feb 2, 2024
This change enables the crash monitor in gopls, so that it
increments a telemetry counter for the stack any time
the Go runtime crashes.

x/telemetry cannot be built with go1.18, which gopls still
supports (though not for long). So we must ensure that there
are no direct references to x/telemetry from gopls; all
references must go through the build-tagged non-functional
shims, which avoids the dependency.

Also, internal/telemetry can no longer depend on protocol,
to avoid a cycle via protocol -> bug -> telemetry.

Updates golang/go#42888

Change-Id: I7a57b9a93ab76a58ec6dd73f895d4b42ed29ea86
Reviewed-on: https://go-review.googlesource.com/c/tools/+/558256
Reviewed-by: Robert Findley <rfindley@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Auto-Submit: Alan Donovan <adonovan@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
gopherbot pushed a commit that referenced this issue Feb 2, 2024
Updates #42888

Change-Id: I72e408301e17b1579bbea189bed6b1a0154bd55f
Reviewed-on: https://go-review.googlesource.com/c/go/+/548121
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
ezz-no pushed a commit to ezz-no/go-ezzno that referenced this issue Feb 18, 2024
This feature makes it possible to record unhandled panics
in any goroutine through a watchdog process (e.g. the same
application forked+exec'd as a child in a special mode)
that can process the panic report, for example by sending
it to a crash-reporting system such as Go telemetry
or Sentry.

Fixes golang#42888

Change-Id: I5aa7be8f726bbc70fc650540bd1a14ab60c62ecb
Reviewed-on: https://go-review.googlesource.com/c/go/+/547978
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Alan Donovan <adonovan@google.com>
Reviewed-by: Russ Cox <rsc@golang.org>
ezz-no pushed a commit to ezz-no/go-ezzno that referenced this issue Feb 18, 2024
Updates golang#42888

Change-Id: I72e408301e17b1579bbea189bed6b1a0154bd55f
Reviewed-on: https://go-review.googlesource.com/c/go/+/548121
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
gopherbot pushed a commit that referenced this issue Feb 21, 2024
SetCrashOutput dup's the input file for safety, but I don't think that
the docs are very clear about what the caller can/should do with f. "it
does not close the previous file" is particularly confusing, as it does
close the previous FD (but not the previous passed os.File).

Expand and attempt to clarify the explanation, borrowing wording from
net.FileConn, which also dup's the input os.File.

For #42888.

Change-Id: I1c96d2dce7899e335d8f1cd464d2d9b31aeb4e5e
Reviewed-on: https://go-review.googlesource.com/c/go/+/559800
Reviewed-by: Alan Donovan <adonovan@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
gopherbot pushed a commit that referenced this issue Feb 22, 2024
There are two separate cases here:

The base case is simple: a concurrent call to SetCrashOutput while
panicking will switch the crash FD, which could cause the first half of
writes to go to the old FD, and the second half to the new FD. This
isn't a correctness problem, but would be annoying to see in practice.
Since it is easy to check for, I simply drop any changes if panicking is
already in progress.

The second case is more important: SetCrashOutput will close the old FD
after the new FD is swapped, but writeErrData has no locking around use
of the fd, so SetCrashOutput could close the FD out from under
writeErrData, causing lost writes. We handle this similarly, by not
allowing SetCrashOutput to close the old FD if a panic is in progress,
but we have to be more careful about synchronization between
writeErrData and setCrashFD to ensure that writeErrData can't observe
the old FD while setCrashFD allows close.

For #42888.

Change-Id: I7270b2cc5ea58a15ba40145b7a96d557acdfe842
Reviewed-on: https://go-review.googlesource.com/c/go/+/559801
Reviewed-by: Alan Donovan <adonovan@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Accepted
Development

No branches or pull requests