Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: frequent SIGSEGV in readgstatus in bootstrap toolchain #49686

Closed
bcmills opened this issue Nov 19, 2021 · 22 comments
Closed

runtime: frequent SIGSEGV in readgstatus in bootstrap toolchain #49686

bcmills opened this issue Nov 19, 2021 · 22 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Nov 19, 2021

greplogs --dashboard -md -l -e '(?ms)^fatal error: .*^runtime\.readgstatus' --since=2021-01-01

2021-11-18T22:40:42-2375b6e/linux-mips64-rtrk
2021-11-18T19:16:38-c4aae23/linux-mips64le-rtrk
2021-11-15T23:01:05-d156101/linux-mips64-rtrk
2021-11-12T16:58:34-95d0657/linux-mips64le-rtrk
2021-11-11T11:47:33-c49627e/linux-mips64le-rtrk
2021-11-10T05:08:25-17980df/linux-mips64le-rtrk
2021-11-09T22:11:33-4aa0746/linux-arm64-packet
2021-11-09T16:20:19-e90fd9a/linux-arm64-packet
2021-11-04T20:43:07-8248152/linux-arm64-packet
2021-11-02T16:12:23-599de4b/linux-arm64-packet
2021-11-01T21:27:46-036812b-f9cb33c/linux-arm64-packet
2021-10-28T16:54:29-5fce1d9/linux-arm64-packet
2021-10-26T05:02:53-ec6c004/linux-arm64-packet
2021-10-15T07:13:43-1cbec68/linux-arm64-packet
2021-10-11T22:32:23-c1b0ae4/linux-arm64-packet
2021-10-08T17:58:41-59d4e92-16a3cef/android-amd64-emu
2021-10-08T00:24:52-59d4e92-78d749f/android-amd64-emu
2021-10-08T00:18:29-59d4e92-7cef831/android-amd64-emu
2021-10-07T21:39:17-59d4e92-6b4cf2b/darwin-arm64-11_0-toothrot
2021-10-07T18:58:33-7286502/darwin-arm64-11_0-toothrot
2021-10-07T18:58:33-7286502/linux-arm64-packet
2021-10-07T18:58:33-59d4e92-7286502/android-amd64-emu
2021-10-07T18:45:53-77f2750/android-arm64-corellium
2021-10-07T18:45:53-59d4e92-77f2750/android-amd64-emu
2021-10-07T14:01:52-ecb2f23/android-386-emu
2021-10-07T03:41:40-62292e8-6f74ed0/android-amd64-emu
2021-10-07T02:30:47-62292e8-39bbf08/darwin-arm64-11_0-toothrot
2021-10-07T00:39:17-62292e8-812a33d/aix-ppc64
2021-10-06T23:01:01-62292e8-4ffa2f1/android-amd64-emu
2021-10-06T22:42:28-62292e8-4002616/darwin-arm64-11_0-toothrot
2021-10-06T21:17:08-62292e8-2e107b4/android-amd64-emu
2021-10-06T20:30:12-62292e8-4a37a1d/android-amd64-emu
2021-10-06T19:59:27-62292e8-17c513e/android-amd64-emu
2021-10-06T19:59:27-62292e8-17c513e/darwin-arm64-11_0-toothrot
2021-10-06T17:38:01-d2e5035-415f0a3/android-amd64-emu
2021-10-06T16:46:09-d2e5035-3160571/android-amd64-emu
2021-10-05T23:58:29-d2e5035-ac60900/android-amd64-emu
2021-10-05T21:25:51-6ae3afa/darwin-arm64-11_0-toothrot
2021-10-05T20:53:02-d4b1ae0-695a59b/android-amd64-emu
2021-10-05T20:35:54-d4b1ae0-990c9c6/darwin-arm64-11_0-toothrot
2021-10-05T20:15:01-d4b1ae0-0b4d499/android-amd64-emu
2021-10-05T20:15:01-d4b1ae0-0b4d499/darwin-arm64-11_0-toothrot
2021-10-05T19:28:36-d4b1ae0-7ae83c8/android-amd64-emu
2021-09-12T01:06:53-915f620-0d8a4bf/linux-arm64-packet
2021-08-31T08:41:16-3e0d083-f118d14/linux-arm64-packet
2021-08-08T18:37:08-507cc34/linux-mips64-rtrk
2021-06-25T17:31:39-d01bc57/linux-mips64-rtrk
2021-06-07T21:12:46-909dd5e/linux-arm64-packet
2021-05-26T16:11:00-bfd7798/linux-mips64le-rtrk
2021-05-21T21:07:03-3c65644/linux-mips64-rtrk
2021-05-17T18:03:56-a2c07a9/linux-mips64-rtrk
2021-05-07T18:14:25-0714010-af6123a/linux-mips64-rtrk
2021-05-07T17:43:31-b44c78b/linux-arm64-packet
2021-04-23T20:57:54-691e1b8/linux-mips64le-rtrk
2021-04-16T21:20:31-02a2ff4/linux-mips64-rtrk
2021-03-25T13:57:08-d1beb07-9f4d5c9/linux-mips64-rtrk
2021-03-17T00:17:33-77fc1ea-68f8e1a/linux-mips64le-rtrk
2021-03-12T01:47:01-a607408/linux-mips64le-rtrk
2021-03-08T20:41:06-a08adda/linux-mips64le-rtrk
2021-03-05T18:46:36-51d8d35/linux-mips64le-rtrk
2021-02-23T10:04:13-a78b0e6/linux-mips64-rtrk
2021-01-27T12:18:00-cd176b3/linux-mips64-rtrk
2021-01-19T12:49:13-d047c91/linux-arm64-packet

Note that several of these failures are on linux/arm64 and darwin/arm64, which are first-class ports.

(CC @cherrymui @jeremyfaller)

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels Nov 19, 2021
@bcmills bcmills added this to the Go1.18 milestone Nov 19, 2021
@jeremyfaller
Copy link
Contributor

Also, CC – @mknyszek

@cherrymui
Copy link
Member

I'll take a look.

@cherrymui
Copy link
Member

This looks like #43824 . The most recent ones are all failing in markroot with a nil gp, which is the same as #43824.

@mknyszek
Copy link
Contributor

I poked a little bit more and there's one thing I don't understand: why is the failing address exactly zero? Shouldn't the address be slightly offset for the atomicstatus field? Is there a nil pointer check here, before the load?

Also, the code around this is kind of fishy:

    default:
        // the rest is scanning goroutine stacks
        workCounter = &gcController.stackScanWork
        var gp *g
        if work.baseStacks <= i && i < work.baseEnd {
            // N.B. Atomic read of allglen in gcMarkRootPrepare
            // acts as a barrier to ensure that allgs must be large
            // enough to contain all relevant Gs.
            gp = allgs[i-work.baseStacks]
        } else {
            throw("markroot: bad index")
        }

        // remember when we've first observed the G blocked
        // needed only to output in traceback
        status := readgstatus(gp) // We are not in a scan state
        if (status == _Gwaiting || status == _Gsyscall) && gp.waitsince == 0 {
            gp.waitsince = work.tstart
        }

Note the unsynchronized access to allg and the N.B. above it. A lot of the recent failures appear to be on weak memory model architectures, like MIPS and arm64. The amd64 ones are also weird environments, like an Android emulator (Android amd64 is weird enough on its own). Maybe that's how a nil pointer is slipping through?

It's unclear to me whether this is a beta blocker, though. As @cherrymui pointed out, it seems old.

CC @aclements @prattmic

@cherrymui
Copy link
Member

why is the failing address exactly zero? Shouldn't the address be slightly offset for the atomicstatus field? Is there a nil pointer check here, before the load?

I looked into this a while ago. Yes, there is a nil check. At least on some architectures, atomic load instruction is not marked "fault on nil" in the compiler, so the compiler does not eliminate the nil check. We probably want to fix that (but that won't fix the issue).

@aclements
Copy link
Member

aclements commented Nov 30, 2021

I can definitely see this failing on a weak memory machine. I don't think it even requires subtlety around STW.

  1. We start a GC. gcMarkRootPrepare snapshots allglen atomically. We restart the world.
  2. Thread 1 does allgadd, which appends to allgs. Suppose this requires a grow, so thread 1 allocates a new all-nil slice, copies the old slice into the new slice (write A), and assigns the new slice to allgs (write B).
  3. Thread 2 concurrently enters markroot and reads from allgs. It reads the slice base (read A) and then the n'th value in the slice (read B).

On a weak memory machine, thread 2 could observe write B before write A (or, equivalently, the reads could be reordered), leading to thread 2 reading a nil from allgs.

(Edited to add: Actually, I don't think the reads can be reordered because they're dependent, but the writes could be.)

@aclements
Copy link
Member

I've got a couple ideas for how to fix this. Probably the most straightforward is to save a copy of allgs during gcMarkRootPrepare and use that in markroot.

@cherrymui
Copy link
Member

Does it work if we use atomicAllGIndex ? allgptr is written atomically so if it points to the new memory it should observe the writes to the array?

@aclements
Copy link
Member

Probably, though then we also have to call atomicAllG either on every markroot or save that somewhere. (In general I really don't like the atomicAllG API... I might be happier with that if atomicAllG just returned a slice.)

@gopherbot
Copy link

Change https://golang.org/cl/368134 mentions this issue: runtime: fix racy allgs access on weak memory architectures

@mknyszek
Copy link
Contributor

mknyszek commented Sep 6, 2022

2022-09-05T08:28:34-83b083e-bd5595d/linux-mips-rtrk

This looks like it may be related? It's a nil pointer dereference in readgstatus from markroot.

@mknyszek mknyszek reopened this Sep 6, 2022
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Sep 6, 2022
@cherrymui
Copy link
Member

It looks like cmd/dist crashed, which is built from the bootstrap toolchain, which is go1.17.13, which doesn't seem to include the fix above.

@bcmills
Copy link
Contributor Author

bcmills commented Sep 6, 2022

Given that it affects the bootstrap toolchain, should the fix be backported to release-branch.go1.17? (Or would we rather live with the crashes until the bootstrap version is raised again?)

@randall77
Copy link
Contributor

I think we should backport the fix.

Normally we just add a workaround to tip's toolchain to avoid bootstrap bugs, but I suspect there isn't a feasible workaround in this case.

@cherrymui
Copy link
Member

Go 1.17 is no longer supported. Not sure how we're going to backport. I guess we could cherry-pick to release-branch.go1.17. We still need to manually update the bootstrap toolchains on the builders, right?

@randall77
Copy link
Contributor

We can release another 1.17.x even though it is not supported. Didn't we do that for 1.4.x? Or maybe I am misremembering.

@rsc
Copy link
Contributor

rsc commented Sep 7, 2022

We never did an updated 1.4.x release. We only backported things to the release branch and posted new src tar files.

That said, we're in a brave new world and we can chart whatever course we think best. Since this seems to be entirely the relaxed memory model architectures, one option is to bump all those builders to Go 1.18 or Go 1.19. We'll still have plenty on Go 1.17 to keep us honest (a few builders are already bumped for other reasons, like the port being new since Go 1.17).

If people generally agree with bumping the bootstrap toolchain forward on those systems, I'm happy to do the work or advise someone who wants to learn the process.

@mknyszek
Copy link
Contributor

mknyszek commented Sep 7, 2022

Sounds like we have 3 paths forward here:

  1. Do nothing, it's not that common and we may want to update the bootstrap toolchain in the next release anyway.
  2. Backport the fix to the bootstrap compiler.
  3. Move just the relaxed memory architecture bootstrap toolchains up a version or two.

I think this just needs a decision and someone to do the work (or no one to do no work).

@mknyszek mknyszek added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Sep 7, 2022
@gopherbot gopherbot removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 7, 2022
@mknyszek mknyszek self-assigned this Sep 7, 2022
@rsc
Copy link
Contributor

rsc commented Sep 7, 2022

I talked with Austin briefly about this. I think (2) is probably wrong; it's strictly more work than (3).
I don't think we should do (1) and leave lingering flakes (assuming flakes are still happening), so I'm inclined to (3).

@rsc
Copy link
Contributor

rsc commented Sep 7, 2022

Re "assuming flakes are still happening", taking a closer look at the history, it looks like there were many failures during run.bash originally, and that got fixed and released.

Now we have a single flake where the build faulted in readgstatus in a similar way, during bootstrap, which suggests maybe older toolchains are problematic. But it's just one flake. It might be best to wait for a couple more to really be sure that this is the problem. I did a greplogs for runtime.readgstatus, and I found this fault too, which is from a commit back in May (not too old) and was during run.bash, so not bootstrap-dependent. It's interesting because it is pprof faulting on readgstatus, while the world is ostensibly stopped.

2022-05-09T20:57:58-3f43096/openbsd-arm-jsing

Maybe there is something lurking here, or maybe not. Let's keep accumulating failures mentioning runtime.readgstatus and see what we find.

@mknyszek mknyszek modified the milestones: Go1.18, Go1.20 Sep 7, 2022
@mknyszek mknyszek added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Sep 7, 2022
@mknyszek
Copy link
Contributor

mknyszek commented Sep 7, 2022

Since we're waiting, I put this in WaitingForInfo. At the very least when it closes we'll have a signal that we might want to look back here. I'll keep it assigned to myself to make sure I get notified.

@cherrymui cherrymui changed the title runtime: frequent SIGSEGV in readgstatus runtime: frequent SIGSEGV in readgstatus in bootstrap toolchain Sep 21, 2022
@cherrymui
Copy link
Member

Moved to #55304 to not confuse the old and new bug. Closing this.

@bcmills bcmills modified the milestones: Go1.20, Go1.18 Sep 21, 2022
@bcmills bcmills removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Sep 21, 2022
@golang golang locked and limited conversation to collaborators Sep 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. release-blocker
Projects
None yet
Development

No branches or pull requests

8 participants