-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: "runtime·unlock: lock count" on Go 1.10.1 when cgo is enabled #25128
Comments
/CC @ianlancetaylor |
Beyond that, this sounds like memory corruption. You should try to eliminate that as a cause first.
|
@bcmills Unforutunately the rest of the stack trace doesn't seem to get logged. I have no idea why. I ran tests with debug level logging enabled and eventually all I get is that throw (sometimes just a few times, sometimes hundreds of times that same line gets emitted). Aside from the normal logging done by the node exporter itself, part of the stack trace seems to just go missing. I don't know if that's a consequence of running this under systemd but I'm having a hard time seeing how that would cause part of the stack trace to not get captured. Given that the same code compiles and runs fine (i.e doesn't crash) on Go 1.9.5 but does crash on 1.10.0, 1.10.1 I'm wondering what changes have been made to cgo between those releases that could point us in the right direction? I also confirmed that building under Go 1.10.0, 1.10.1 with |
|
Maybe try searching for |
Thanks. The only call I could find was in logrus:
The code is here: https://github.com/sirupsen/logrus/blob/v1.0.4/writer.go#L21-L47. I have no idea if this looks sane or not. I don't see any import statements for C in logrus. The only thing mentioning |
Can anyone theorize as to what might be causing the stack trace to just not show up? It's not really helping the debugging efforts and I'm wondering if finding the reason the stack trace doesn't show up might be able to point us in the direction of the bug. For example: prometheus/node_exporter#870 (comment)
|
FWIW, two users (myself and one more) have been able to get stacktraces. I've seen crashes a couple dozen times, not sure why this time was special. My stack trace is here: The other is here: |
Timed out in state WaitingForInfo. Closing. (I am just a bot, though. Please speak up if this is a mistake or you have the requested information.) |
@gopherbot Hey now, that's rude. We've provided some more info and asked for further help on how to possibly track it down. Waiting on Go project members here. |
Reopening because we did get a stack trace. |
I don't see anything useful in the stack trace in a quick look. How reliable is the crash? Does it fail every time or just occasionally? Is there any way we can get a copy of the program to recreate the crash ourselves? The cause is most likely memory corruption from the C code, or invalid cgo pointer passing (see https://golang.org/cmd/cgo/#hdr-Passing_pointers), but it's hard to be sure. |
I can typically crash the node_exporter in about 30 minutes with 3 requests per second, but none of my crashes have provided a stack trace. I will try and do some feature bisecting this week to see if I can pin it down to a part of the code that causes it. |
node_exporter-0.16.0-rc.2-go-1.10.1.tar.gz @ianlancetaylor This binary crashes for me pretty reliably, though often taking a few days. Git repo: https://github.com/prometheus/node_exporter Option used to run it: |
Change https://golang.org/cl/120416 mentions this issue: |
Unfortunately, in https://github.com/prometheus/node_exporter/files/2032750/node_exporter.log, the "fatal error: schedule: holding locks" happens when a goroutine is in the final throes of exiting ( prometheus/node_exporter#870 (comment) looks like it involves the recursive panic the CL I just sent should help with. But there's also a clear recursive malloc failure in there. That one definitely seems like memory corruption, since from the stack trace it's clear that there was no recursive malloc. If it's possible to bisect this between Go 1.9 and Go 1.10, that could be very useful. |
Currently, if lock or unlock calls throw because the g.m.lock count is corrupted, we're unlikely to get a stack trace because startpanic_m will itself attempt to acquire a lock, causing a recursive failure. Avoid this by forcing the g.m.locks count to a sane value if it's currently bad. This might be enough to get a stack trace from #25128. Change-Id: I52d7bd4717ffae94a821f4249585f3eb6cd5aa41 Reviewed-on: https://go-review.googlesource.com/120416 Reviewed-by: Ian Lance Taylor <iant@golang.org>
I am in the process of doing this. It seems like it has to do with I have provided some more (rather node_exporter-specific) analysis in prometheus/node_exporter#870 Also, this issue looks very similar to #22227 (except for the platform). We currently do not have a "simple" reproducer as in "some tens of lines of code". However, this crashes pretty reliably for me (after ~1-10 minutes): The error messages vary between:
Will update when I've traced down the actual Go commit which makes the difference here. |
One of the nicest stack traces so far (with yet another error message: The bisect ended by pointing to 4f34a52 as the first bad commit. Will verify in the coming days (Go at the mentioned commit definitely contains the problematic change [it crashes]; to verify it is indeed the causing change, I'll have to run some long-term test on the previous commit to make sure). BTW, |
Keep in mind that just because bisect points at a commit does not mean the commit is bad. It could be a perfectly correct commit that is just exposing an issue that already existed. I could certainly see how thrashing OS threads more often by forcing termination when a locked goroutine exits could expose other issues (e.g. memory corruption) elsewhere. |
I'm no expert in the area, but that's certainly an interesting thought. node_exporter does use a fair amount of Cgo, so perhaps it's the combination of a potential bug in our Cgo code and the fact that my netlink package is killing off OS threads. I did try to set up a minimal reproduction with my (Go-only) code and wasn't able to reproduce the issue, even though the |
Interesting; @mdlayher managed to stop the bug by adding a If I had to guess, the call sequence is
So, how the heck is mstart() being called with locks set? |
FWIW I'm having rather hit and miss results reproducing - I had it crash in 2 seconds once, but other times it's taking 30 minutes to hours. I'm using vegeta at ~200qps. It doesn't seem like the crashes scale linearly with load. I'm switching to ApacheBench now (which unlike vegeta can max out performance at ~1000 requests/sec without triggering runaway thread creation) but I'm not sure it'll make much of a difference. It would be helpful to improve reproducibility, otherwise tracking this down is going to be a huge pain. Anyone have any hints? |
Interestingly, I left it all night running with ApacheBench (at ~800 qps) and it didn't crash. So either there's something different about how ApacheBench does things, or actively having requests always queued prevents the issue from reproducing. |
I was able to cause a couple of interesting crashes this morning. While I can't fully rule out memory corruption at this point, the race detector-instrumented program crashed before pointing out any data races. I'm not using very much
Program: https://github.com/mdlayher/netlink/blob/master/cmd/nlstress/main.go
I'll try again with the following patch applied to see if it makes a difference if I unlock the OS thread when the syscall-handling goroutine exits. ✔ ~/src/github.com/mdlayher/netlink [master|…1]
12:16 $ git diff --patch
diff --git a/conn_linux.go b/conn_linux.go
index aeaa38a..5639fb5 100644
--- a/conn_linux.go
+++ b/conn_linux.go
@@ -325,6 +325,7 @@ func newSysSocket(lockThread bool) *sysSocket {
// the goroutine exits starting in Go 1.10:
// https://go-review.googlesource.com/c/go/+/46038.
runtime.LockOSThread()
+ defer runtime.UnlockOSThread()
}
defer wg.Done() EDIT: I've been running the non-race test for two hours now with the above patch applied and have not seen any crashes. I'll try with -race. EDIT 2: No crash for over two hours with -race either. EDIT 3: I've applied the above patch to master in mdlayher/netlink@3d8cc9a. If you'd like to try reproducing this issue as above, just undo this patch. |
From my perspective, the question is whether or not the bug is purely runtime, or some interaction between the program and the runtime. In other words, if you create, lock, and kill a bunch of Ms without any cgo, unsafe calls, or syscalls, will we still see this crash? So, I'm running some tests as well. I suspect at least one other factor will be required, and this is not a pure runtime bug. |
I completely forgot about this thing. At least with regards to the node exporter I've not experienced anything like this in the past 2 years. It's most likely safe to close at this point. |
I have a similar issue here: https://github.com/containerd/containerd/actions/runs/3237587338/jobs/5304881990#step:9:56393 |
This is with go1.19.2. |
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?Go 1.10.1
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?amd64
What did you do?
We have an application that makes use of CGO to collect kernel metrics. When we compile with Go 1.9.5 or with 1.10.1 and CGO_ENABLED=0, we see no crashes. With CGO enabled an Go 1.10.1 we see crashes.
Our test setup is documented here.
What did you expect to see?
No crashes
What did you see instead?
The text was updated successfully, but these errors were encountered: