Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: executable compiled under Go 1.17.7 will occasionally wedge #52226

Closed
alecthomas opened this issue Apr 8, 2022 · 38 comments
Closed
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. OS-Darwin
Milestone

Comments

@alecthomas
Copy link

alecthomas commented Apr 8, 2022

What version of Go are you using (go version)?

$ go version
go version go1.17.7 darwin/amd64

Does this issue reproduce with the latest release?

I have not yet tried Go 1.17.8. Will do that.

What operating system and processor architecture are you using (go env)?

The executable in question is running on darwin amd64, and is cross-compiled from linux/amd64 running on GitHub Actions.

What did you do?

Not repeatably unfortunately, but one of our Go tools will occasionally wedge. The executable is cross-compiled to darwin/amd64 from linux/amd64 in GitHub Actions.

What did you expect to see?

My application not wedging and/or an uncorrupted stack trace that I could use to debug why it is wedging.

What did you see instead?

The following stack trace produced with CTRL-\:

fatal error: notesleep not on g0
panic during panic

goroutine 0 [idle]:
runtime: unexpected return pc for runtime.sigtramp called from 0x7ff8136dce2d
stack: frame={sp:0xc000009a60, fp:0xc000009ab0} stack=[0xc000002000,0xc00000a000)
0x000000c000009960:  0x0000000000000000  0x000000c0000099c0
0x000000c000009970:  0x0000000000000000  0x0000000000000000
0x000000c000009980:  0x000000c000000000  0x0000000000000000
0x000000c000009990:  0x0000000000000000  0x0000000000000000
0x000000c0000099a0:  0x000000c000009ee0  0x000000c000009f48
0x000000c0000099b0:  0x000000c000009a28  0x0000000001049811 <runtime.sigtrampgo+0x00000000000001b1>
0x000000c0000099c0:  0x000000c000000003  0x0000000001b57540
0x000000c0000099d0:  0x000000c0000099e8  0x0000000001b57160
0x000000c0000099e0:  0x0000000000000000  0x0000000000000000
0x000000c0000099f0:  0x0000000000000000  0x0000000000000000
0x000000c000009a00:  0x0000000000000000  0x0000000000000000
0x000000c000009a10:  0x0000000001b57160  0x000000c000009ee0
0x000000c000009a20:  0x000000c000009f48  0x000000c000009a50
0x000000c000009a30:  0x0000000001067e6e <runtime.sigtrampgo+0x000000000000002e>  0x0000000000000003
0x000000c000009a40:  0x000000c000009ee0  0x000000c000009f48
0x000000c000009a50:  0x000000c000009aa0  0x000000000106735c <runtime.sigtramp+0x000000000000003c>
0x000000c000009a60: <0x0000000000000003  0x000000c000009ee0
0x000000c000009a70:  0x000000c000009f48  0x000000c000009f48
0x000000c000009a80:  0x4e36ff976a55ba54  0x0000000001b578a0
0x000000c000009a90:  0x0000370100003800  0x0000000000003600
0x000000c000009aa0:  0x000000c000009ab0 !0x00007ff8136dce2d
0x000000c000009ab0: >0x00007ff7bfefe850  0x0000000000000000
0x000000c000009ac0:  0x0000000000000000  0x0000000000000000
0x000000c000009ad0:  0x0000000000000000  0x0200013100000085
0x000000c000009ae0:  0x00007ff7bfefe7b8  0x0000000000000104
0x000000c000009af0:  0x000000000acd8600  0x00007ff7bfefe7a8
0x000000c000009b00:  0x0000000000003600  0x0000000001b578e0
0x000000c000009b10:  0x0000370100003800  0x00007ff7bfefe850
0x000000c000009b20:  0x00007ff7bfefe7a8  0x0000000000000000
0x000000c000009b30:  0x00000000000000a0  0x0000000000000000
0x000000c000009b40:  0x0000000000000246  0x0000000000000016
0x000000c000009b50:  0x0000000001b578a0  0x0000370100003800
0x000000c000009b60:  0x0000000000003600  0x00007ff81368d506
0x000000c000009b70:  0x0000000000000247  0x0000000000000007
0x000000c000009b80:  0x0000000000000000  0x0000000000000000
0x000000c000009b90:  0x0000000000000000  0x000000000000037f
0x000000c000009ba0:  0x0000000000000000  0x0000000000000000
runtime.throw({0x1634b57, 0x0})
	/home/runner/.cache/hermit/pkg/go-1.17.7/src/runtime/panic.go:1198 +0x71
runtime.notesleep(0x0)
	/home/runner/.cache/hermit/pkg/go-1.17.7/src/runtime/lock_sema.go:169 +0x11a
runtime.(*rwmutex).rlock.func1()
	/home/runner/.cache/hermit/pkg/go-1.17.7/src/runtime/rwmutex.go:54 +0x78
runtime.(*rwmutex).rlock(0x0)
	/home/runner/.cache/hermit/pkg/go-1.17.7/src/runtime/rwmutex.go:41 +0x59
runtime.preemptM(0xc000066900)
	/home/runner/.cache/hermit/pkg/go-1.17.7/src/runtime/signal_unix.go:357 +0x29
runtime.preemptone(0x0)
	/home/runner/.cache/hermit/pkg/go-1.17.7/src/runtime/proc.go:5596 +0x5e
runtime.preemptall()
	/home/runner/.cache/hermit/pkg/go-1.17.7/src/runtime/proc.go:5558 +0x55
runtime.freezetheworld()
	/home/runner/.cache/hermit/pkg/go-1.17.7/src/runtime/proc.go:874 +0x73
runtime.startpanic_m()
	/home/runner/.cache/hermit/pkg/go-1.17.7/src/runtime/panic.go:1346 +0x16a
runtime.sighandler(0x3, 0x1b57540, 0xc0000099e8, 0x1b57160)
	/home/runner/.cache/hermit/pkg/go-1.17.7/src/runtime/signal_unix.go:620 +0x309
runtime.sigtrampgo(0x3, 0xc000009ee0, 0xc000009f48)
	/home/runner/.cache/hermit/pkg/go-1.17.7/src/runtime/signal_unix.go:460 +0x1b1
runtime.sigtrampgo(0x3, 0xc000009ee0, 0xc000009f48)
	<autogenerated>:1 +0x2e
runtime: unexpected return pc for runtime.sigtramp called from 0x7ff8136dce2d
stack: frame={sp:0xc000009a60, fp:0xc000009ab0} stack=[0xc000002000,0xc00000a000)
0x000000c000009960:  0x0000000000000000  0x000000c0000099c0
0x000000c000009970:  0x0000000000000000  0x0000000000000000
0x000000c000009980:  0x000000c000000000  0x0000000000000000
0x000000c000009990:  0x0000000000000000  0x0000000000000000
0x000000c0000099a0:  0x000000c000009ee0  0x000000c000009f48
0x000000c0000099b0:  0x000000c000009a28  0x0000000001049811 <runtime.sigtrampgo+0x00000000000001b1>
0x000000c0000099c0:  0x000000c000000003  0x0000000001b57540
0x000000c0000099d0:  0x000000c0000099e8  0x0000000001b57160
0x000000c0000099e0:  0x0000000000000000  0x0000000000000000
0x000000c0000099f0:  0x0000000000000000  0x0000000000000000
0x000000c000009a00:  0x0000000000000000  0x0000000000000000
0x000000c000009a10:  0x0000000001b57160  0x000000c000009ee0
0x000000c000009a20:  0x000000c000009f48  0x000000c000009a50
0x000000c000009a30:  0x0000000001067e6e <runtime.sigtrampgo+0x000000000000002e>  0x0000000000000003
0x000000c000009a40:  0x000000c000009ee0  0x000000c000009f48
0x000000c000009a50:  0x000000c000009aa0  0x000000000106735c <runtime.sigtramp+0x000000000000003c>
0x000000c000009a60: <0x0000000000000003  0x000000c000009ee0
0x000000c000009a70:  0x000000c000009f48  0x000000c000009f48
0x000000c000009a80:  0x4e36ff976a55ba54  0x0000000001b578a0
0x000000c000009a90:  0x0000370100003800  0x0000000000003600
0x000000c000009aa0:  0x000000c000009ab0 !0x00007ff8136dce2d
0x000000c000009ab0: >0x00007ff7bfefe850  0x0000000000000000
0x000000c000009ac0:  0x0000000000000000  0x0000000000000000
0x000000c000009ad0:  0x0000000000000000  0x0200013100000085
0x000000c000009ae0:  0x00007ff7bfefe7b8  0x0000000000000104
0x000000c000009af0:  0x000000000acd8600  0x00007ff7bfefe7a8
0x000000c000009b00:  0x0000000000003600  0x0000000001b578e0
0x000000c000009b10:  0x0000370100003800  0x00007ff7bfefe850
0x000000c000009b20:  0x00007ff7bfefe7a8  0x0000000000000000
0x000000c000009b30:  0x00000000000000a0  0x0000000000000000
0x000000c000009b40:  0x0000000000000246  0x0000000000000016
0x000000c000009b50:  0x0000000001b578a0  0x0000370100003800
0x000000c000009b60:  0x0000000000003600  0x00007ff81368d506
0x000000c000009b70:  0x0000000000000247  0x0000000000000007
0x000000c000009b80:  0x0000000000000000  0x0000000000000000
0x000000c000009b90:  0x0000000000000000  0x000000000000037f
0x000000c000009ba0:  0x0000000000000000  0x0000000000000000
runtime.sigtramp()
	/home/runner/.cache/hermit/pkg/go-1.17.7/src/runtime/sys_darwin_amd64.s:227 +0x3c

goroutine 1 [running]:
	goroutine running on other thread; stack unavailable

goroutine 35 [sleep]:
time.Sleep(0x5f5e100)
	/home/runner/.cache/hermit/pkg/go-1.17.7/src/runtime/time.go:193 +0x12e
github.com/cashapp/hermit/app.Main.func2()
	/home/runner/go/pkg/mod/github.com/cashapp/hermit@v0.18.2/app/main.go:117 +0x3d
created by github.com/cashapp/hermit/app.Main
	/home/runner/go/pkg/mod/github.com/cashapp/hermit@v0.18.2/app/main.go:115 +0x372

goroutine 36 [chan receive, 8 minutes]:
github.com/cashapp/hermit/ui.New.func2()
	/home/runner/go/pkg/mod/github.com/cashapp/hermit@v0.18.2/ui/ui.go:90 +0x4d
created by github.com/cashapp/hermit/ui.New
	/home/runner/go/pkg/mod/github.com/cashapp/hermit@v0.18.2/ui/ui.go:89 +0x23d

goroutine 19 [syscall, 8 minutes]:
os/signal.signal_recv()
	/home/runner/.cache/hermit/pkg/go-1.17.7/src/runtime/sigqueue.go:166 +0x28
os/signal.loop()
	/home/runner/.cache/hermit/pkg/go-1.17.7/src/os/signal/signal_unix.go:24 +0x19
created by os/signal.Notify.func1.1
	/home/runner/.cache/hermit/pkg/go-1.17.7/src/os/signal/signal.go:151 +0x2c
@mengzhuo mengzhuo added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 8, 2022
@mengzhuo
Copy link
Contributor

mengzhuo commented Apr 8, 2022

CC @thanm @mknyszek

@mknyszek
Copy link
Contributor

mknyszek commented Apr 8, 2022

Does this reproduce with Go 1.18?

It's hard to tell what's going on here. The second panic is from a signal handler presumably invoked to crash the program, and is caused by the signal handler apparently not being set up right (or something is corrupted). It would be nice to know what's going on in goroutine 1 and setting GOTRACEBACK=crash (or calling runtime/debug.SetTraceback("crash")) might be enough to reveal that.

At first glance, I'm really not sure. More example crashes and more detail via GOTRACEBACK=crash would help. A core dump would be really useful too if you're willing to share, but there may be PII issues.

@alecthomas
Copy link
Author

alecthomas commented Apr 8, 2022

The only signal handler we have in our code is on this line in the traceback, where it's blocking on waiting for a signal. It's possible there is another signal handler being installed by a third party package, but I can't see it in any of the tracebacks provided.

I will enable GOTRACEBACK=crash

@alecthomas
Copy link
Author

But that said, doesn't the fact that we're getting to this state indicate a corruption (somehow) of the Go runtime?

@alecthomas
Copy link
Author

More information from one of our users:

hey, I have a confusing bug to report. running a hermit-managed command, I see a deadlock in the hermit process when attempting its syscall.Exec operation. attaching a debugger, the goroutine running exec seems to be stuck waiting on pre-empting signals [src]. I'm using go1.17.7 (managed by Hermit) and hermit itself is v0.18.3.

@mknyszek
Copy link
Contributor

mknyszek commented Apr 8, 2022

I mean signal handler in the Unix sense of a bunch of code that gets injected onto an OS thread when a signal gets delivered, not something that handles incoming signals in a Go application. I do mean to say that data in the runtime is corrupted. The signal handler runs on the signal stack which I believe "looks" like a g0 stack to most checks.

The exec note is interesting. #52086 might be related (also darwin/amd64). What macOS version are you running on? This may be some bad interaction between Go and new macOS versions, or even a bug in the OS. It certainly wouldn't be the first time.

@mknyszek
Copy link
Contributor

mknyszek commented Apr 8, 2022

The note about exec also made me recall #41702. Not sure if that's related exactly, but it may be worth trying to run your application with GODEBUG=asyncpreemptoff=1 which will disable preemption signals. Whether that hangs or not it will help narrow the problem space a bit.

@mknyszek
Copy link
Contributor

mknyszek commented Apr 8, 2022

Probing our GitHub issues further, does hermit use the plugin package? (See #38824, which we in theory fixed in the last release, and backported to 1.17.)

@alecthomas
Copy link
Author

Will GODEBUG=asyncpreemptoff=1 take effect if I set it from within the executable itself, or only from outside?

@alecthomas
Copy link
Author

Probing further, does hermit use the plugin package? (See #38824, which we in theory fixed in the last release, and backported to 1.17.)

No, not Hermit itself (though I can't exclude some imported package I guess).

@alecthomas
Copy link
Author

FWIW I think I have also seen this behaviour on darwin/arm64 as well.

@mknyszek
Copy link
Contributor

mknyszek commented Apr 8, 2022

Will GODEBUG=asyncpreemptoff=1 take effect if I set it from within the executable itself, or only from outside?

It's used at runtime initialization, so it needs to be set from the outside. We have a few outstanding issues about making GODEBUG settable from inside the program, but there hasn't been any movement on them.

No, not Hermit itself (though I can't exclude some imported package I guess).

Got it. Apparently #38824 could happen even if plugin was just imported and not used. (i.e. import ( _ "plugin" )). Could you check if plugin is transitively imported at any point? I think go mod why plugin is what you want but I'm not 100% sure how that interacts with standard library packages.

@mknyszek
Copy link
Contributor

mknyszek commented Apr 8, 2022

FWIW I think I have also seen this behaviour on darwin/arm64 as well.

Good to know. That correlates more strongly with this being either an OS bug, or a bad interaction with new OS versions, or an old bug on our end that really wasn't truly fixed. It could be something new, but I'm not sure we've made any major changes to how macOS syscalls work, or how we call exec on macOS (other than to fix bugs).

@alecthomas
Copy link
Author

🐚 ~/Projects/hermit $ go mod why -m plugin
# plugin
(main module does not need module plugin)

Also FWIW (and my recollection is a bit hazy at this point), but I think I only started seeing this after upgrading from Go 1.16 to Go 1.17

@alecthomas
Copy link
Author

I also tried nm hermit | grep plugin just in case, which returned nothing.

@alecthomas
Copy link
Author

Further anecdata for the Darwin-specific hypothesis: we use hermit in every single CI job running on linux/amd64 , across 100's of repo's and I've never seen or heard of a report of it wedging there.

@bwester
Copy link

bwester commented Apr 8, 2022

I was able to dump a stack trace from a wedged Hermit process with GOTRACEBACK=crash set using CTRL-\. The output is in a Github gist: [omitted]

Edit: Sorry, the wrong process's stack was dumped.

@ianlancetaylor
Copy link
Contributor

@bwester Thanks. What that stack dump shows is an instance of #23019. That is, a known problem when waiting for a child process that has started a grandchild process that is holding standard output/standard error open.

@alecthomas
Copy link
Author

Hmmm, are you sure it's the same issue? We are not waiting for child processes (in our Go code at least), we're calling syscall.Exec() directly.

@alecthomas
Copy link
Author

That's on the main path only though, we do have some other calls to exec.Command() elsewhere. Let me take a look.

@alecthomas
Copy link
Author

Actually I don't see any hermit paths in that traceback (unlike in the initial dump), so perhaps that was somehow from the sub-process that Hermit was executing.

@alecthomas
Copy link
Author

Confirmed that this is a stack trace from a wrapping Go executable (mockery), apologies. We'll try to get one from Hermit itself.

@bwester
Copy link

bwester commented Apr 11, 2022

Sorry for the earlier mistake in posting the wrong output. Full stack trace info from the correct process is here: deadlock-hermit-0.18.3-go-db7183ccf9.txt.

This time, I was able to reproduce using a locally-built Go toolchain from commit db7183c (tip as of 2022-04-08) to build Hermit at v0.18.3 (through an internal wrapper). Interestingly, I was only able to reproduce when compiling Hermit with CGO_ENABLED=0.

@bwester
Copy link

bwester commented Apr 14, 2022

I was able to use git bisect to determine when this deadlock was introduce to the Go toolchain. I was able to reproduce the deadlock at commit 05739d6 (first released in go1.14.11 and go1.15.4) but not before. That commit, "runtime: wait for preemption signals before syscall.Exec" is exactly what I see in my stack trace (linked above): the process is waiting for pending preemption signals to be received, but apparently they never come.

@ianlancetaylor
Copy link
Contributor

Thanks for tracking that down.

Why is your code calling syscall.Exec rather than using the os/exec package to start a child process? I mean, obviously syscall.Exec is supposed to work, but it's rarely used.

In delve, when looking at a wedged program, what is the value of runtime.pendingPreemptSignals?

@bwester
Copy link

bwester commented Apr 15, 2022

Why is your code calling syscall.Exec

This deadlock was discovered in Hermit, which is a virtual environment for tools. It is invoked as a symlinked shim and execs the real command's binary. exec is used to keep Hermit itself out of the process tree.

Though it was discovered in Hermit, I got the behavior to reproduce in a minimal program. Its source code is here: exec-deadlock.go. The one signal handler seems to be needed. To reproduce, compile it with CGO_ENABLED=0 and run the resulting binary in a loop. On my 2018 MacBook Pro (macOS 11.6.5), it should produce a deadlock within 100,000 iterations.

I also have a core dump (produced by dlv) from an execution of this program that has locked up. I can't post it as a gist—it's 2.3 GiB—but I can answer questions about it. A stack trace from this core is here: exec-deadlock.stack.

In delve, when looking at a wedged program, what is the value of runtime.pendingPreemptSignals?

From the core image:

(dlv) print runtime.pendingPreemptSignals
1

@bwester
Copy link

bwester commented Apr 15, 2022

Also, setting GODEBUG=asyncpreemptoff=1 makes the deadlock non-reproducible for me in my minimal program.

@ianlancetaylor
Copy link
Contributor

Thanks for the simpler recreator.

Some notes on a deadlocked instance:

pendingPreemptSignals is 1.

signalPending is 1 for one M, specifically m0. I think this M may be parked in stoplockedm.

The program is creating a new M. The new M has been added to the allm list, but newm1 is blocked trying to acquire execLock, so the new thread has not yet been created.

The sysmon thread has called retake which has decided to preempt a thread. The preemptM function is blocked trying to acquire execLock. It is possible that retake is trying to preempt the thread due to the deadlock, though. That is, this may be an effect rather than a cause.

OK, I see one possible problem. The goroutine started by ensureSigM blocks many signals including the preemption signal. If something wants to preempt it, nothing may ever decrement pendingPreemptSignals. I'll test a fix.

@ianlancetaylor
Copy link
Contributor

OK, sending a fix. Thanks.

@gopherbot
Copy link

Change https://go.dev/cl/400315 mentions this issue: runtime: don't block preemption signal in new M's or ensureSigM

@ianlancetaylor
Copy link
Contributor

@gopherbot Please open backport issues

This bug can cause rare cases of deadlock on macOS in programs that use syscall.Exec. The fix is simple.

@gopherbot
Copy link

Backport issue(s) opened: #52374 (for 1.17), #52375 (for 1.18).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@mknyszek
Copy link
Contributor

@ianlancetaylor Nice catch! I left a comment on your CL. Paraphrasing that comment here for visibility:

Basically, I'm wondering if this is a more general problem.

For instance, should Go programs still be able to work if every thread had sigPreempt blocked? Ultimately asynchronous preemption is an optimization. Maybe there's no other way to implement the pendingPreemptSignals functionality, so enabling sigPreempt on all Go threads is fundamentally necessary. But I figured I'd ask.

@ianlancetaylor
Copy link
Contributor

The pendingPreemptSignals count is horrible and I would certainly prefer a different approach. But I haven't been able to think of one. As noted in #41702 it's working around a macOS bug.

@gopherbot
Copy link

Change https://go.dev/cl/400317 mentions this issue: [release-branch.go1.17] runtime: don't block preemption signal in new M's or ensureSigM

@gopherbot
Copy link

Change https://go.dev/cl/400318 mentions this issue: [release-branch.go1.18] runtime: don't block preemption signal in new M's or ensureSigM

@alecthomas
Copy link
Author

Thanks @ianlancetaylor, much appreciated.

@alecthomas
Copy link
Author

And thanks to @bwester for digging in and coming up with a reproducible test case.

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Apr 16, 2022
@dmitshur dmitshur added this to the Go1.19 milestone Apr 16, 2022
alecthomas added a commit to cashapp/hermit that referenced this issue May 5, 2022
alecthomas added a commit to cashapp/hermit that referenced this issue May 5, 2022
Revert "fix: create process group"

This reverts commit 0f3b2f3.

Revert "chore: fix linting"

This reverts commit 1555512.

Revert "fix: workaround for golang/go#52226"

This reverts commit 0cd74ef.
gopherbot pushed a commit that referenced this issue May 6, 2022
… M's or ensureSigM

No test because we already have a test in the syscall package.
The issue reports 1 failure per 100,000 iterations, which is rare enough
that our builders won't catch the problem.

For #52226
Fixes #52374

Change-Id: I17633ff6cf676b6d575356186dce42cdacad0746
Reviewed-on: https://go-review.googlesource.com/c/go/+/400315
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
(cherry picked from commit e398266)
Reviewed-on: https://go-review.googlesource.com/c/go/+/400317
Reviewed-by: Austin Clements <austin@google.com>
gopherbot pushed a commit that referenced this issue May 6, 2022
… M's or ensureSigM

No test because we already have a test in the syscall package.
The issue reports 1 failure per 100,000 iterations, which is rare enough
that our builders won't catch the problem.

For #52226
Fixes #52375

Change-Id: I17633ff6cf676b6d575356186dce42cdacad0746
Reviewed-on: https://go-review.googlesource.com/c/go/+/400315
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
(cherry picked from commit e398266)
Reviewed-on: https://go-review.googlesource.com/c/go/+/400318
Reviewed-by: Austin Clements <austin@google.com>
@golang golang locked and limited conversation to collaborators Apr 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. OS-Darwin
Projects
None yet
Development

No branches or pull requests

7 participants