runtime: "runtime·unlock: lock count" on Go 1.10.1 when cgo is enabled #25128

SuperQ · 2018-04-27T14:43:58Z

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (`go version`)?

Go 1.10.1

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

amd64

What did you do?

We have an application that makes use of CGO to collect kernel metrics. When we compile with Go 1.9.5 or with 1.10.1 and CGO_ENABLED=0, we see no crashes. With CGO enabled an Go 1.10.1 we see crashes.

Our test setup is documented here.

What did you expect to see?

No crashes

What did you see instead?

fatal error: runtime·unlock: lock count

The text was updated successfully, but these errors were encountered:

SuperQ · 2018-04-27T14:45:10Z

/CC @ianlancetaylor

bcmills · 2018-04-27T16:14:56Z

runtime·unlock: lock count is a throw. Where is the rest of the stack trace?

Beyond that, this sounds like memory corruption. You should try to eliminate that as a cause first.

Are your C dependencies installing signal handlers? If so, are they setting SA_ONSTACK as required by the Go runtime?
Are you using finalizers in your cgo wrapper? If so, you may be missing some needed calls to runtime.KeepAlive. (I seem to recall that the garbage collector got a little more aggressive in 1.10.)
Does your program report any races when compiled with -race?
Does your program report any failures when you compile using -fsanitize=memory (for the C compiler) and -msan (for the Go compiler)?
Can you construct a smaller reproducer?

daenney · 2018-04-28T09:37:23Z

@bcmills Unforutunately the rest of the stack trace doesn't seem to get logged. I have no idea why. I ran tests with debug level logging enabled and eventually all I get is that throw (sometimes just a few times, sometimes hundreds of times that same line gets emitted). Aside from the normal logging done by the node exporter itself, part of the stack trace seems to just go missing. I don't know if that's a consequence of running this under systemd but I'm having a hard time seeing how that would cause part of the stack trace to not get captured.

Given that the same code compiles and runs fine (i.e doesn't crash) on Go 1.9.5 but does crash on 1.10.0, 1.10.1 I'm wondering what changes have been made to cgo between those releases that could point us in the right direction? I also confirmed that building under Go 1.10.0, 1.10.1 with CGO_ENABLED=0 results in non-crashing builds.

daenney · 2018-04-29T14:57:59Z

I did a go build -race for the node exporter, no errors reported.
I tried a go build -msan and simply got this:
```
# runtime/cgo
gcc: error: unrecognized argument to -fsanitize= option: 'memory'
```
```
$ gcc --version
gcc (GCC) 7.3.1 20180406
```
I tried this with both Go 1.9.5 and 1.10.1, same story/error. Afaik fsanitize=memory is only valid for clang and has no equivalent to gcc. According to go env it correctly detected GCC: CC="gcc". I'm guessing the Go tooling gets confused by the fact that /usr/bin/cc is symlinked to /usr/bin/gcc on Arch and decides to call that and assume it's clang essentially passing in the wrong flags? Or it simply defaults to caling cc under the assumption that's clang when -msan is passed? It shouldn't though given it correctly detected CC as gcc and CXX as g++. I would expect it to raise an error if a flag is passed that's invalid for the detected CC and CXX.
env CC=clang env CXX=clang++ go build -msan resulted in no errors
Wrt constructing a smaller reproducer, I honestly have no idea where to start. Don't even know if it's in the node exporter itself or one of its dependencies that use a CGO thing. Any pointers on how one could go about tracking this down would be appreciated

Given it's related to CGO I did a quick search for "C" in the code base, the only possibly relevant ones I can find in node exporter itself are:

collector/timex.go
collector/loadavg_unix.go though I suspect that doesn't get used in my builds since there also is a collector/loadavg_linux.go. I'm unsure as to what happens in this case, does _linux take precedence over _unix or is this a game of chance?

I can't find anything immediately obviously wrong with those collectors. All other collectors leveraging C wouldn't/shouldn't have gotten compiled since I did the builds on a Linux system:

$ ag '"C"'

timex.go
20:import "C"

cpu_openbsd.go
30:import "C"

filesystem_bsd.go
32:import "C"

loadavg_unix.go
24:import "C"

loadavg_solaris.go
29:import "C"

netdev_openbsd.go
32:import "C"

sysctl_bsd.go
29:import "C"

cpu_dragonfly.go
82:import "C"

meminfo_openbsd.go
41:import "C"

meminfo_darwin.go
19:import "C"

cpu_darwin.go
46:import "C"

netdev_darwin.go
33:import "C"

devstat_freebsd.go
29:import "C"

netdev_bsd.go
35:import "C"

interrupts_openbsd.go
95:import "C"

kvm_bsd.go
26:import "C"

devstat_dragonfly.go
89:import "C"

I did a similar search in the vendor/ directory. None of them would appear to apply to builds done on a Linux system:

golang.org/x/sys/unix/mkerrors.sh
331:	echo 'import "C"'

github.com/lufia/iostat/iostat_darwin.go
9:import "C"

github.com/godbus/dbus/transport_unixcred_dragonfly.go
13:import "C"

github.com/godbus/dbus/transport_unixcred_freebsd.go
15:import "C"

bcmills · 2018-05-08T22:51:34Z

Given it's related to CGO I did a quick search for "C" in the code base,

Maybe try searching for runtime.SetFinalizer? Given that the behavior changed between 1.9 and 1.10, a bad finalizer seems like one of the more likely causes. (I seem to recall that we had to fix several in the Google codebase during the internal 1.10 rollout.)

daenney · 2018-05-14T08:14:39Z

Thanks. The only call I could find was in logrus:

vendor/github.com/sirupsen/logrus/writer.go
44:	runtime.SetFinalizer(writer, writerFinalizer)

The code is here: https://github.com/sirupsen/logrus/blob/v1.0.4/writer.go#L21-L47. I have no idea if this looks sane or not. I don't see any import statements for C in logrus. The only thing mentioning runtime in Logrus is writer.go, in the import statements and that one call.

daenney · 2018-05-14T08:18:26Z

Can anyone theorize as to what might be causing the stack trace to just not show up? It's not really helping the debugging efforts and I'm wondering if finding the reason the stack trace doesn't show up might be able to point us in the direction of the bug.

For example: prometheus/node_exporter#870 (comment)

Well, that didn't help.. the notimex variation crashed.

fatal error: runtime·unlock: lock count
fatal error: fatal: morestack on g0

And again, no other useful stack log.

austin987 · 2018-05-23T20:32:01Z

FWIW, two users (myself and one more) have been able to get stacktraces. I've seen crashes a couple dozen times, not sure why this time was special.

My stack trace is here:
https://github.com/prometheus/node_exporter/files/2032750/node_exporter.log

The other is here:
prometheus/node_exporter#870 (comment)

gopherbot · 2018-05-27T16:19:32Z

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

daenney · 2018-05-27T18:10:55Z

@gopherbot Hey now, that's rude. We've provided some more info and asked for further help on how to possibly track it down. Waiting on Go project members here.

ianlancetaylor · 2018-05-27T18:13:35Z

Reopening because we did get a stack trace.

ianlancetaylor · 2018-05-27T18:16:32Z

I don't see anything useful in the stack trace in a quick look.

How reliable is the crash? Does it fail every time or just occasionally?

Is there any way we can get a copy of the program to recreate the crash ourselves?

The cause is most likely memory corruption from the C code, or invalid cgo pointer passing (see https://golang.org/cmd/cgo/#hdr-Passing_pointers), but it's hard to be sure.

SuperQ · 2018-05-27T19:41:42Z

I can typically crash the node_exporter in about 30 minutes with 3 requests per second, but none of my crashes have provided a stack trace. I will try and do some feature bisecting this week to see if I can pin it down to a part of the code that causes it.

austin987 · 2018-05-28T04:59:59Z

node_exporter-0.16.0-rc.2-go-1.10.1.tar.gz

@ianlancetaylor This binary crashes for me pretty reliably, though often taking a few days.

Git repo: https://github.com/prometheus/node_exporter
Tarball for this binary: https://github.com/prometheus/node_exporter/archive/v0.16.0-rc.2.tar.gz

Option used to run it:
command_args="--collector.textfile.directory=/var/lib/node_exporter/ --collector.bonding --collector.buddyinfo --collector.ntp"

gopherbot · 2018-06-21T21:00:39Z

Change https://golang.org/cl/120416 mentions this issue: runtime: avoid recursive panic on bad lock count

aclements · 2018-06-21T21:12:01Z

Unfortunately, in https://github.com/prometheus/node_exporter/files/2032750/node_exporter.log, the "fatal error: schedule: holding locks" happens when a goroutine is in the final throes of exiting (runtime.goexit0). The goroutine has already been marked dead, so we don't see it in the stack trace (though it probably wouldn't have any stack anyway, unless it called runtime.Goexit).

prometheus/node_exporter#870 (comment) looks like it involves the recursive panic the CL I just sent should help with. But there's also a clear recursive malloc failure in there. That one definitely seems like memory corruption, since from the stack trace it's clear that there was no recursive malloc.

If it's possible to bisect this between Go 1.9 and Go 1.10, that could be very useful.

Currently, if lock or unlock calls throw because the g.m.lock count is corrupted, we're unlikely to get a stack trace because startpanic_m will itself attempt to acquire a lock, causing a recursive failure. Avoid this by forcing the g.m.locks count to a sane value if it's currently bad. This might be enough to get a stack trace from #25128. Change-Id: I52d7bd4717ffae94a821f4249585f3eb6cd5aa41 Reviewed-on: https://go-review.googlesource.com/120416 Reviewed-by: Ian Lance Taylor <iant@golang.org>

hoffie · 2018-07-31T20:00:34Z

If it's possible to bisect this between Go 1.9 and Go 1.10, that could be very useful.

I am in the process of doing this. It seems like it has to do with runtime.LockOSThread (as @mdlayher pointed out).
My testing is focused on the difference between Go 1.9 and Go1.10.
Others (@chrj) still observe this problem on go1.11beta2 as well.

I have provided some more (rather node_exporter-specific) analysis in prometheus/node_exporter#870

Also, this issue looks very similar to #22227 (except for the platform).

We currently do not have a "simple" reproducer as in "some tens of lines of code". However, this crashes pretty reliably for me (after ~1-10 minutes):
GOMAXPROCS=2 ./node_exporter --no-collector.arp --no-collector.bcache --no-collector.bonding --no-collector.conntrack --no-collector.cpu --no-collector.diskstats --no-collector.edac --no-collector.entropy --no-collector.filefd --no-collector.filesystem --no-collector.hwmon --no-collector.infiniband --no-collector.ipvs --no-collector.loadavg --no-collector.mdadm --no-collector.meminfo --no-collector.netdev --no-collector.netstat --no-collector.nfs --no-collector.nfsd --no-collector.sockstat --no-collector.stat --no-collector.textfile --no-collector.time --no-collector.timex --no-collector.uname --no-collector.vmstat --no-collector.netclass --no-collector.xfs --no-collector.zfs

The error messages vary between:

fatal error: runtime·unlock: lock count
fatal error: systemstack called from unexpected goroutine
panic during panic
fatal: morestack on g0
Sometimes its one line, sometimes its thousands lines of repeated error messages. Sometimes with stack traces, sometimes with cleanly formatted stack traces, sometimes with broken stack traces (interleaved with fatal error messages).

Will update when I've traced down the actual Go commit which makes the difference here.

hoffie · 2018-07-31T20:58:58Z

One of the nicest stack traces so far (with yet another error message: fatal error: fatal error: stopm holding locks): https://gist.github.com/hoffie/ae0fab8aeac60d6283233eeccfe93823
And another one: https://gist.github.com/hoffie/3f55ea14cf977c86d086ac2c9830d55c

The bisect ended by pointing to 4f34a52 as the first bad commit. Will verify in the coming days (Go at the mentioned commit definitely contains the problematic change [it crashes]; to verify it is indeed the causing change, I'll have to run some long-term test on the previous commit to make sure).

BTW, GOMAXPROC=1 seems to avoid this issue.

marcan · 2018-08-03T14:40:38Z

Keep in mind that just because bisect points at a commit does not mean the commit is bad. It could be a perfectly correct commit that is just exposing an issue that already existed. I could certainly see how thrashing OS threads more often by forcing termination when a locked goroutine exits could expose other issues (e.g. memory corruption) elsewhere.

mdlayher · 2018-08-03T15:06:48Z

I'm no expert in the area, but that's certainly an interesting thought. node_exporter does use a fair amount of Cgo, so perhaps it's the combination of a potential bug in our Cgo code and the fact that my netlink package is killing off OS threads.

I did try to set up a minimal reproduction with my (Go-only) code and wasn't able to reproduce the issue, even though the *netlink.Conn internally is killing its OS thread on close: prometheus/node_exporter#870 (comment).

squeed · 2018-08-03T15:18:13Z

Interesting; @mdlayher managed to stop the bug by adding a runtime.UnlockOSThread(). Before that, the process was just letting the g die. The bisected change is to let the associated m die when a locked g dies.

If I had to guess, the call sequence is

g dies
gogo(&_g_.m.g0.sched), which should be mstart
mstart
mstart1
stopm
throw("stopm holding locks")

So, how the heck is mstart() being called with locks set?

marcan · 2018-08-03T16:15:17Z

FWIW I'm having rather hit and miss results reproducing - I had it crash in 2 seconds once, but other times it's taking 30 minutes to hours. I'm using vegeta at ~200qps. It doesn't seem like the crashes scale linearly with load. I'm switching to ApacheBench now (which unlike vegeta can max out performance at ~1000 requests/sec without triggering runaway thread creation) but I'm not sure it'll make much of a difference.

It would be helpful to improve reproducibility, otherwise tracking this down is going to be a huge pain. Anyone have any hints?

marcan · 2018-08-04T02:58:26Z

Interestingly, I left it all night running with ApacheBench (at ~800 qps) and it didn't crash. So either there's something different about how ApacheBench does things, or actively having requests always queued prevents the issue from reproducing.

mdlayher · 2018-08-06T16:17:54Z

I was able to cause a couple of interesting crashes this morning. While I can't fully rule out memory corruption at this point, the race detector-instrumented program crashed before pointing out any data races. I'm not using very much unsafe in these code paths in the package but I'll double check them.

matt@nerr-2:~$ go version
go version go1.10.3 linux/amd64
matt@nerr-2:~$ uname -a
Linux nerr-2 4.15.0-29-generic #31-Ubuntu SMP Tue Jul 17 15:39:52 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Program: https://github.com/mdlayher/netlink/blob/master/cmd/nlstress/main.go

go build + GOTRACEBACK=crash ./nlstress: https://gist.github.com/mdlayher/a46edb0cb062a961431ebaad741b5b24
go build -race + GOTRACEBACK=crash ./nlstress: https://gist.github.com/mdlayher/2079ac2ed8201f183c8de043c8882e14

I'll try again with the following patch applied to see if it makes a difference if I unlock the OS thread when the syscall-handling goroutine exits.

✔ ~/src/github.com/mdlayher/netlink [master|…1] 
12:16 $ git diff --patch
diff --git a/conn_linux.go b/conn_linux.go
index aeaa38a..5639fb5 100644
--- a/conn_linux.go
+++ b/conn_linux.go
@@ -325,6 +325,7 @@ func newSysSocket(lockThread bool) *sysSocket {
                        // the goroutine exits starting in Go 1.10:
                        // https://go-review.googlesource.com/c/go/+/46038.
                        runtime.LockOSThread()
+                       defer runtime.UnlockOSThread()
                }
 
                defer wg.Done()

EDIT: I've been running the non-race test for two hours now with the above patch applied and have not seen any crashes. I'll try with -race.

EDIT 2: No crash for over two hours with -race either.

EDIT 3: I've applied the above patch to master in mdlayher/netlink@3d8cc9a. If you'd like to try reproducing this issue as above, just undo this patch.

squeed · 2018-08-12T07:56:52Z

From my perspective, the question is whether or not the bug is purely runtime, or some interaction between the program and the runtime.

In other words, if you create, lock, and kill a bunch of Ms without any cgo, unsafe calls, or syscalls, will we still see this crash?

So, I'm running some tests as well. I suspect at least one other factor will be required, and this is not a pure runtime bug.

bcmills · 2021-11-17T20:02:31Z

A lot has changed in the runtime since Go 1.10. 😅

Is anyone on this thread still seeing these failures? They were showing up on the Go builders for a while, but it's not clear to me that the handful of failures there are particularly related to the original report.

greplogs --dashboard -md -l -e 'runtime.unlock: lock count'

2021-11-16T17:13:33-29ec902/freebsd-386-12_2
2021-11-03T18:37:22-3a5865c-cfd016d/netbsd-amd64-9_0
2021-04-01T15:50:43-45ca9ef/linux-arm-scaleway
2021-04-01T01:26:29-5f646f0/linux-arm-scaleway
2021-04-01T00:51:26-ec721d9/linux-arm-scaleway
2021-04-01T00:51:24-1f29e69/linux-arm-scaleway
2021-04-01T00:51:23-3304b22/linux-arm-scaleway
2021-03-31T20:21:57-5d6581d/linux-arm-scaleway
2020-12-15T16:30:24-5046cb8/plan9-arm
2020-11-19T19:30:38-e73697b/plan9-arm
2020-05-05T18:32:35-8627b4c/plan9-arm
2020-05-05T18:05:10-a8e83d5/plan9-arm
2020-05-05T15:41:37-b4ecafc/plan9-arm
2020-05-05T05:13:26-9b18968/plan9-arm
2020-05-05T00:36:44-c9d5f60/plan9-arm
2020-05-04T17:40:00-a1ffbe9/plan9-arm
2019-11-07T14:53:38-e8f01d5/linux-arm64-packet
2019-11-07T14:51:06-543c6d2/linux-arm64-packet
2019-11-07T09:32:43-3b1ae30/aix-ppc64
2019-11-07T09:32:43-3b1ae30/solaris-amd64-oraclerel
2019-11-07T08:25:32-7a2baa9/aix-ppc64
2019-11-07T06:58:44-f5949b6/aix-ppc64
2019-11-07T06:58:44-f5949b6/linux-arm64-packet
2019-11-07T06:58:44-f5949b6/solaris-amd64-oraclerel
2019-11-07T05:59:33-a8f57f4/aix-ppc64
2019-11-07T05:59:33-a8f57f4/linux-arm64-packet

daenney · 2021-11-17T23:01:21Z

I completely forgot about this thing. At least with regards to the node exporter I've not experienced anything like this in the past 2 years. It's most likely safe to close at this point.

cpuguy83 · 2022-10-12T21:39:43Z

I have a similar issue here: https://github.com/containerd/containerd/actions/runs/3237587338/jobs/5304881990#step:9:56393
This is from this change containerd/containerd#7513 which is purposefully not calling runtime.UnlockOSThread after a unix.Unshare(unix.CLONE_FS)

cpuguy83 · 2022-10-12T21:44:38Z

This is with go1.19.2.

bcmills changed the title ~~CGO fatal error: runtime·unlock: lock count~~ runtime: "runtime·unlock: lock count" on Go 1.10.1 when cgo is enabled Apr 27, 2018

bcmills added WaitingForInfo NeedsInvestigation labels Apr 27, 2018

bcmills added this to the Go1.10.2 milestone Apr 27, 2018

andybons modified the milestones: Go1.10.2, Go1.10.3 Apr 27, 2018

SuperQ mentioned this issue May 22, 2018

fatal error: runtime.unlock: lock count with Go >= 1.10 prometheus/node_exporter#870

Closed

gopherbot closed this as completed May 27, 2018

ianlancetaylor reopened this May 27, 2018

gopherbot closed this as completed May 27, 2018

ianlancetaylor removed the WaitingForInfo label May 27, 2018

ianlancetaylor modified the milestones: Go1.10.3, Go1.11 May 27, 2018

ianlancetaylor reopened this May 27, 2018

ianlancetaylor modified the milestones: Go1.11, Go1.12 Jul 6, 2018

mdlayher mentioned this issue Aug 14, 2018

vendor: bump github.com/mdlayher/wifi and dependencies prometheus/node_exporter#1045

Merged

andybons modified the milestones: Go1.12, Go1.13 Feb 12, 2019

andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019

rsc modified the milestones: Go1.14, Backlog Oct 9, 2019

bcmills added the WaitingForInfo label Nov 17, 2021

bcmills closed this as completed Nov 18, 2021

bcmills removed the WaitingForInfo label Nov 18, 2021

corhere mentioned this issue Oct 15, 2022

runtime: "runtime·lock: lock count" fatal error when cgo is enabled #56243

Closed

golang locked and limited conversation to collaborators Oct 12, 2023

gopherbot added the FrozenDueToAge label Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: "runtime·unlock: lock count" on Go 1.10.1 when cgo is enabled #25128

runtime: "runtime·unlock: lock count" on Go 1.10.1 when cgo is enabled #25128

SuperQ commented Apr 27, 2018

SuperQ commented Apr 27, 2018

bcmills commented Apr 27, 2018

daenney commented Apr 28, 2018 •

edited

Loading

daenney commented Apr 29, 2018 •

edited

Loading

bcmills commented May 8, 2018

daenney commented May 14, 2018 •

edited

Loading

daenney commented May 14, 2018 •

edited

Loading

austin987 commented May 23, 2018

gopherbot commented May 27, 2018

daenney commented May 27, 2018

ianlancetaylor commented May 27, 2018

ianlancetaylor commented May 27, 2018

SuperQ commented May 27, 2018

austin987 commented May 28, 2018

gopherbot commented Jun 21, 2018

aclements commented Jun 21, 2018

hoffie commented Jul 31, 2018 •

edited

Loading

hoffie commented Jul 31, 2018

marcan commented Aug 3, 2018

mdlayher commented Aug 3, 2018

squeed commented Aug 3, 2018 •

edited

Loading

marcan commented Aug 3, 2018

marcan commented Aug 4, 2018

mdlayher commented Aug 6, 2018 •

edited

Loading

squeed commented Aug 12, 2018

bcmills commented Nov 17, 2021

daenney commented Nov 17, 2021

cpuguy83 commented Oct 12, 2022 •

edited

Loading

cpuguy83 commented Oct 12, 2022

runtime: "runtime·unlock: lock count" on Go 1.10.1 when cgo is enabled #25128

runtime: "runtime·unlock: lock count" on Go 1.10.1 when cgo is enabled #25128

Comments

SuperQ commented Apr 27, 2018

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

SuperQ commented Apr 27, 2018

bcmills commented Apr 27, 2018

daenney commented Apr 28, 2018 • edited Loading

daenney commented Apr 29, 2018 • edited Loading

bcmills commented May 8, 2018

daenney commented May 14, 2018 • edited Loading

daenney commented May 14, 2018 • edited Loading

austin987 commented May 23, 2018

gopherbot commented May 27, 2018

daenney commented May 27, 2018

ianlancetaylor commented May 27, 2018

ianlancetaylor commented May 27, 2018

SuperQ commented May 27, 2018

austin987 commented May 28, 2018

gopherbot commented Jun 21, 2018

aclements commented Jun 21, 2018

hoffie commented Jul 31, 2018 • edited Loading

hoffie commented Jul 31, 2018

marcan commented Aug 3, 2018

mdlayher commented Aug 3, 2018

squeed commented Aug 3, 2018 • edited Loading

marcan commented Aug 3, 2018

marcan commented Aug 4, 2018

mdlayher commented Aug 6, 2018 • edited Loading

squeed commented Aug 12, 2018

bcmills commented Nov 17, 2021

daenney commented Nov 17, 2021

cpuguy83 commented Oct 12, 2022 • edited Loading

cpuguy83 commented Oct 12, 2022

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

daenney commented Apr 28, 2018 •

edited

Loading

daenney commented Apr 29, 2018 •

edited

Loading

daenney commented May 14, 2018 •

edited

Loading

daenney commented May 14, 2018 •

edited

Loading

hoffie commented Jul 31, 2018 •

edited

Loading

squeed commented Aug 3, 2018 •

edited

Loading

mdlayher commented Aug 6, 2018 •

edited

Loading

cpuguy83 commented Oct 12, 2022 •

edited

Loading