Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: segmentation fault on linux/amd64 #49370

Closed
mknyszek opened this issue Nov 5, 2021 · 10 comments
Closed

runtime: segmentation fault on linux/amd64 #49370

mknyszek opened this issue Nov 5, 2021 · 10 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Linux release-blocker
Milestone

Comments

@mknyszek
Copy link
Contributor

mknyszek commented Nov 5, 2021

There's been a slew of segmentation fault (core dumped) failures (without additional context) on various linux/amd64 builders as of 961aab2 (AFAICT).

2021-11-04T21:52:51-1e0c3b2/linux-amd64
2021-11-04T21:52:51-1e0c3b2/linux-amd64-bullseye
2021-11-04T21:52:36-8ad0a7e/linux-amd64
2021-11-04T21:52:36-8ad0a7e/linux-amd64-fedora
2021-11-04T21:52:36-8ad0a7e/linux-amd64-staticlockranking
2021-11-04T21:52:06-37634ee/linux-amd64-fedora
2021-11-04T21:33:23-71fc881/linux-amd64-unified
2021-11-04T20:01:10-9b2dd1f/linux-amd64-sid
2021-11-04T20:00:54-961aab2/linux-amd64-unified

Notably, the failure always seems to pop up in the same place in the runtime tests (around 5.2 to 5.5 seconds), probably because it's the same single test failing. I've been unable to reproduce it on my VM, or on the gomotes, despite many all.bash runs.

My guess is that this is another failure related to a bad test that previously never had a GC run during the test, but now because of 961aab2 and the smaller minimum heap size, does (I have fixed 4 failures like this so far; it's not proof, but it does appear to be part of the pattern).

@mknyszek mknyszek added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Linux release-blocker labels Nov 5, 2021
@mknyszek mknyszek added this to the Go1.18 milestone Nov 5, 2021
@mknyszek
Copy link
Contributor Author

mknyszek commented Nov 5, 2021

I'm trying to reproduce this with 10 gomotes, ignoring other flakes and failures on the dashboard. I upgraded my tooling for this so I should be able to run it indefinitely and get back a core file when it does actually fail.

@mknyszek
Copy link
Contributor Author

mknyszek commented Nov 5, 2021

I reproduced it but failed to produce a core.

@mknyszek
Copy link
Contributor Author

mknyszek commented Nov 5, 2021

Reproduced again, no core file. Trying a different strategy to try to get the core file to be produced...

@mknyszek
Copy link
Contributor Author

mknyszek commented Nov 5, 2021

Great news! I have cores. The problem is I have 97 of them, but all of them are indeed from this test run. So one of them should capture the failure. In theory.

@mknyszek
Copy link
Contributor Author

mknyszek commented Nov 5, 2021

OK, that was easier than I thought. Backtrace:

#0  runtime.gcWriteBarrier ()
    at /usr/local/google/home/mknyszek/tmp/dlgo-linux-amd64/go/src/runtime/asm_amd64.s:1622
#1  0x00000000004726e7 in runtime.gcWriteBarrierCX ()
    at /usr/local/google/home/mknyszek/tmp/dlgo-linux-amd64/go/src/runtime/asm_amd64.s:1697
#2  0x00000000004679d4 in runtime.(*debugCallHandler).inject (h=0x657350, info=0xc000115bf0,
    ctxt=<optimized out>, gp2=<optimized out>)
    at /usr/local/google/home/mknyszek/tmp/dlgo-linux-amd64/go/src/runtime/export_debug_test.go:133
#3  0x0000000000475172 in runtime.(*debugCallHandler).inject-fm ()
    at /usr/local/google/home/mknyszek/tmp/dlgo-linux-amd64/go/src/runtime/export_debug_test.go:109
#4  0x000000000044fdc4 in runtime.sighandler (sig=5, info=<optimized out>, ctxt=<optimized out>,
    gp=0xc0000f8048) at /usr/local/google/home/mknyszek/tmp/dlgo-linux-amd64/go/src/runtime/signal_unix.go:606
#5  0x000000000044f8e5 in runtime.sigtrampgo (sig=5, info=0xc000115bf0, ctx=0xc000115ac0)
    at /usr/local/google/home/mknyszek/tmp/dlgo-linux-amd64/go/src/runtime/signal_unix.go:465
#6  0x0000000000474d8e in runtime.sigtrampgo (sig=5, info=0xc000115bf0, ctx=0xc000115ac0) at <autogenerated>:1
#7  0x00000000004741dd in runtime.sigtramp ()
    at /usr/local/google/home/mknyszek/tmp/dlgo-linux-amd64/go/src/runtime/sys_linux_amd64.s:361

Sadly, this is something I'm aware of. The debug call injection tests are riddled with write barriers that are not safe to execute. The quick fix here is to disable the GC for these tests. Notably some of these tests call runtime.GC in the call injection handler -- that's totally fine, the issue is if a GC is currently executing while the handler is running.

Note also that this is a test-only failure. Real-world implementations of the debug call protocol (like in delve) don't have this problem, because the handler is implemented in another process. In the past I made an effort to remove these write barriers, but it's surprisingly difficult. There are a lot of them, and arranging for something different for all of them would be a pain.

I propose disabling the GOGC-based GC during the test to unblock the longtest builder, and then taking some time later to rewrite this entire handler to be careful not to use write barriers (and to be annotated with go:nowritebarrierrec). Another useful note: in theory the handler should already have had write barriers turned off! It's called from a signal handler. Unfortunately it's called indirectly, so the static analysis that enforces this fails.

@mknyszek
Copy link
Contributor Author

mknyszek commented Nov 5, 2021

CC @prattmic who I discussed this with in the past.

@gopherbot
Copy link

Change https://golang.org/cl/361896 mentions this issue: runtime: disable GC during debug call tests

@gopherbot
Copy link

Change https://golang.org/cl/369751 mentions this issue: runtime: clean up redundant calls to SetGCPercent in debug_test.go

gopherbot pushed a commit that referenced this issue Dec 7, 2021
SetGCPercent(-1) is called by several tests in debug_test.go (followed
by a call to runtime.GC) due to #49370. However, startDebugCallWorker
already actually has this, just without the runtime.GC call (allowing an
in-progress GC to still mess up the test).

This CL consolidates SetGCPercent into startDebugDebugCallWorker where
applicable.

Change-Id: Ifa12d6a911f1506e252d3ddf03004cf2ab3f4ee4
Reviewed-on: https://go-review.googlesource.com/c/go/+/369751
Trust: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
@gopherbot
Copy link

Change https://golang.org/cl/369815 mentions this issue: runtime: fix comments on the behavior of SetGCPercent

@mknyszek
Copy link
Contributor Author

mknyszek commented Dec 7, 2021

In investigating the root causes for #49680, #49695, and #45867, I discovered that the logic behind the fix here was erroneous. In fact, I suspect that the cause of these failures was that just one of the tests was missing a call to SetGCPercent(-1) and that the additional runtime.GC calls are actually pointless. I'm not going to pull them out for 1.18 because that risks introducing flaky tests, but I sent a CL that leaves TODOs to remove them.

gopherbot pushed a commit that referenced this issue Dec 7, 2021
Fixes for #49680, #49695, #45867, and #49370 all assumed that
SetGCPercent(-1) doesn't block until the GC's mark phase is done, but
it actually does. The cause of 3 of those 4 failures comes from the fact
that at the beginning of the sweep phase, the GC does try to preempt
every P once, and this may run concurrently with test code. In the
fourth case, the issue was likely that only *one* of the debug_test.go
tests was missing a call to SetGCPercent(-1). Just to be safe, leave a
TODO there for now to remove the extraneous runtime.GC calls, but leave
the calls in.

Updates #49680, #49695, #45867, and #49370.

Change-Id: Ibf4e64addfba18312526968bcf40f1f5d54eb3f1
Reviewed-on: https://go-review.googlesource.com/c/go/+/369815
Reviewed-by: Austin Clements <austin@google.com>
Trust: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@golang golang locked and limited conversation to collaborators Jun 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Linux release-blocker
Projects
None yet
Development

No branches or pull requests

2 participants