-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: error message: P has cached GC work at end of mark termination #27993
Comments
On windows-386-2008: https://storage.googleapis.com/go-build-log/fe06deee/windows-386-2008_bd7ef5ad.log |
See #27993 .
…On Mon, Nov 5, 2018 at 12:39 PM Ian Lance Taylor ***@***.***> wrote:
linux-amd64:
https://storage.googleapis.com/go-build-log/64b96186/linux-amd64_ef4d4c1f.log
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#27993 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AIIkT_KHSnzEuYAJq8wmvhDbk5g59LvUks5usKH7gaJpZM4XG7zF>
.
|
Hey @griesemer in your comment #27993 (comment)
you self referenced this same issue 27993, which issue did you mean to refer to? |
@odeke-em This was just an e-mail reply - I missed that the mail already referred to this issue. Please ignore. |
Change https://golang.org/cl/149968 mentions this issue: |
Change https://golang.org/cl/149969 mentions this issue: |
Another on https://go-review.googlesource.com/c/go/+/150517 1 of 19 TryBots failed: |
For #27993. Change-Id: I20127e8a9844c2c488f38e1ab1f8f5a27a5df03e Reviewed-on: https://go-review.googlesource.com/c/149968 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Change https://golang.org/cl/150778 mentions this issue: |
This adds a debug check to throw immediately if any pointers are added to the gcWork buffer after the mark completion barrier. The intent is to catch the source of the cached GC work that occasionally produces "P has cached GC work at end of mark termination" failures. The result should be that we get "throwOnGCWork" throws instead of "P has cached GC work at end of mark termination" throws, but with useful stack traces. This should be reverted before the release. I've been unable to reproduce this issue locally, but this issue appears fairly regularly on the builders, so the intent is to catch it on the builders. This probably slows down the GC slightly. For #27993. Change-Id: I5035e14058ad313bfbd3d68c41ec05179147a85c Reviewed-on: https://go-review.googlesource.com/c/149969 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
For #27993. Change-Id: I20127e8a9844c2c488f38e1ab1f8f5a27a5df03e Reviewed-on: https://go-review.googlesource.com/c/149968 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
This adds a debug check to throw immediately if any pointers are added to the gcWork buffer after the mark completion barrier. The intent is to catch the source of the cached GC work that occasionally produces "P has cached GC work at end of mark termination" failures. The result should be that we get "throwOnGCWork" throws instead of "P has cached GC work at end of mark termination" throws, but with useful stack traces. This should be reverted before the release. I've been unable to reproduce this issue locally, but this issue appears fairly regularly on the builders, so the intent is to catch it on the builders. This probably slows down the GC slightly. For #27993. Change-Id: I5035e14058ad313bfbd3d68c41ec05179147a85c Reviewed-on: https://go-review.googlesource.com/c/149969 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
I got a little data over the weekend with the added debug code.
2018-11-23T21:21:35-c6d4939/linux-amd64-noopt:
2018-11-21T16:28:17-2a7f904/linux-386-clang:
2018-11-21T16:28:17-2a7f904/linux-mipsle:
|
It's really interesting that in three of the four failures, it did not trip the check on adding work to the gcWork. There's a pretty narrow window when we're still running the termination detection algorithm where work can still be added without triggering this check. Perhaps for debugging we should do a second round after setting In the one |
FWIW, I was able to reproduce this once out of 1,334 runs of all.bash on my linux/amd64 box.
|
@aclements how serious is this and how long do you think it will take to fix (worst question to ask I know)? We’re asking because we may delay the beta due to it so getting more insight into the scope of the issue will help us with that decision. Thanks. |
Another repro in https://storage.googleapis.com/go-build-log/f64c1069/windows-386-2008_34d5c37b.log, if it helps:
|
@aclements We're assuming we should wait on this for RC1, let us know if you feel otherwise. |
Currently it's possible for the runtime to deadlock if checkPut is called in a non-preemptible context. In this case, checkPut may spin, so it won't leave the non-preemptible context, but the thread running gcMarkDone needs to preempt all of the goroutines before it can release the checkPut spin loops. Fix this by returning from checkPut if it's called under any of the conditions that would prevent gcMarkDone from preempting it. In this case, it leaves a note behind that this happened; if the runtime does later detect left-over work it can at least indicate that it was unable to catch it in the act. For #27993. Updates #29385 (may fix it). Change-Id: Ic71c10701229febb4ddf8c104fb10e06d84b122e Reviewed-on: https://go-review.googlesource.com/c/156017 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
Change https://golang.org/cl/156140 mentions this issue: |
@mark-rushakoff, thanks for the pointer to the influx tests. I'll see if I can reproduce it as easily using those. For reference, I had to install bzr and fetch that package in modules-mode since the HEAD of some of its dependencies is broken: mkdir /tmp/z
cd /tmp/z
echo "module m" > go.mod
go get -d github.com/influxdata/platform/http@278f679
go test -c -race github.com/influxdata/platform/http |
We still don't understand what's causing there to be remaining GC work when we enter mark termination, but in order to move forward on this issue, this CL implements a work-around for the problem. If debugCachedWork is false, this CL does a second check for remaining GC work as soon as it stops the world for mark termination. If it finds any work, it starts the world again and re-enters concurrent mark. This will increase STW time by a small amount proportional to GOMAXPROCS, but fixes a serious correctness issue. This works-around #27993. Change-Id: Ia23b85dd6c792ee8d623428bd1a3115631e387b8 Reviewed-on: https://go-review.googlesource.com/c/156140 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Rick Hudson <rlh@golang.org>
CL 156140 has been submitted, which works around this issue, so I'm going to remove release-blocker. We still don't understand the root cause, however, so the issue will remain open. |
Change https://golang.org/cl/156318 mentions this issue: |
A workaround has been submitted. Updates #27993 Change-Id: Ife6443c32673b38000b90dd2efb2985db37ab773 Reviewed-on: https://go-review.googlesource.com/c/156318 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
I think I've figured it out. I've been running the influx test with a lot of extra runtime debug logging and have been getting some interesting output. Here's an annotated subset of a failure (go-stress-20190104T162318-083562928; I've posted just the last GC cycle since the log is huge). The unmarked object is 0xc000575800, which is a _defer.
The high-level sequence of events is:
Now all of the Ps have passed the barrier, but in fact P 2 has pending work. If we enter mark termination at this point, there may even be white objects reachable from x. I'm not sure why this affects defers so disproportionately. It may simply be that defers depend more heavily on write barrier marking than most objects because of their access patterns. The workaround I committed a couple days ago will fix this problem. I'm still thinking about how to fix it "properly". I also need to go back to the proof and figure out where reality diverged from the math. I think the core of the problem is that pointers in the write barrier buffer are only "sort of" grey: they haven't been marked yet, so other Ps can't observe their grey-ness). As a result, you wind up with a black-to-"sort-of-grey" pointer, and that isn't strong enough for the proof's assumptions. |
The proof missed a case. It assumed that new GC work could only be created by consuming GC work from a local work queue. But, in fact, the write barrier allows new work to be created even if the local work queue is empty, as long as some queue is non-empty. In particular, if some queue is non-empty, that means some object is grey. Since that's just in the heap, any P can walk from that object to a white object reachable from it and mark that object using a write barrier, causing work to be added to its local queue (which may have been empty). This implicit communication of work isn't tracked by the "flushed" bit in the algorithm/proof, which is what causes the algorithm to fail. I'm still thinking about how to fix the algorithm. A stronger algorithm could ensure that all work queues were empty at the beginning of the successful completion round. I think a two-round algorithm could do this. However, there's some cost to this since the GC would be stuck in limbo for the whole second round with no work to do, but unable to enter mark termination. I think a weaker algorithm is still possible where the write barrier sets a global flag if it creates work during termination detection, and we check that flag after the ragged barrier. I haven't convinced myself that this is right yet. |
Since we have a workaround and I'm not going to fix the root cause for 1.12, I'm bumping this to 1.13. |
It looks like this may have been hitting the below issue: golang/go#27993 These changes appear to mediate it (though not really sure)
Change https://golang.org/cl/262350 mentions this issue: |
debugCachedWork and all of its dependent fields and code were added to aid in debugging issue #27993. Now that the source of the problem is known and mitigated (via the extra work check after STW in gcMarkDone), these extra checks are no longer required and simply make the code more difficult to follow. Remove it all. Updates #27993 Change-Id: I594beedd5ca61733ba9cc9eaad8f80ea92df1a0d Reviewed-on: https://go-review.googlesource.com/c/go/+/262350 Trust: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
I just saw this in a trybot run on OpenBSD amd64: https://storage.googleapis.com/go-build-log/e4418337/openbsd-amd64-62_9bec130b.log
greplogs shows:
2018-10-03T02:09:38-06ff477/freebsd-386-10_4
2018-10-03T17:40:17-048de7b/freebsd-386-11_2
2018-10-03T19:54:23-9dac0a8/linux-amd64-nocgo
CC @aclements @RLH
The text was updated successfully, but these errors were encountered: