-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: fatal error: scan missed a g #16083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
CC @aclements @RLH Is this reproducible? |
Is there anyway we can get the input that caused the error so we can try Correct me if I missed something but it looks like the program does not use Finally, can you reproduce it reliably? If not how often does it show up? On Thu, Jun 16, 2016 at 11:52 AM, Ian Lance Taylor <notifications@github.com
|
There were some bugs fixed between 1.6 and 1.6.2 that might be related, can On Thu, Jun 16, 2016 at 2:00 PM, Rick Hudson rlh@golang.org wrote:
|
This happened only once on 1 machine out of few thousand. I will look for updates and come back if this happens again. |
I do not use any of it. |
There doesn't seem to be anything that can be done here. 1.6.2 fixes On Fri, Jun 17, 2016 at 4:42 AM, Oleg Obleukhov notifications@github.com
|
Hello again.
Maybe I should try |
Thanks for the additional report with 1.6.2. Is there any way that we can reproduce this problem ourselves? There is no way to catch this kind of error. |
@ianlancetaylor we are accepting ~2000 metrics per second, check if it is matching regexp and process only valid metrics. Nothing really special, except maybe running gc manually: https://github.com/leoleovich/grafsy/blob/master/client.go#L130 (but this is another goroutine). |
Unless you are running into some sort of hard memory limit, there is really no reason to call In order to fix this problem, we need some way to reproduce it. Is there any way we can reproduce it ourselves? Has it happened more than twice? Is there any similarity between the times that it happened? For example, did it happen on the same machine, or on different ones? Is there a chance that you could try the 1.7 beta release to see if it makes any difference? |
@ianlancetaylor This happened only on 2 hosts our of few thousand. But on these hosts we have most loaded instances of Grafsy. |
We don't know how to reproduce and we don't know what the problem is. Punting to 1.8. |
What version of Go are you using (go version)? What did you see instead?
We encouter the error, the same error point at go/src/runtime/malloc.go:798 |
@eaglerayp have you confirmed your program is absolutely free of data races ? Are you able to provide a piece of code which reproduces the error ? |
@davecheney, here is more information. We are still testing now, if we have reproduced this problem, I will let you know |
Please build and test your application with the race detector to confirm there are no data races in your program.
|
@davecheney , after tested with the race detector. The only data race comes from the container/ring which I originally thought that is a safe library. By the way, below is another fatal message with the same program. Hope this helpful.
|
If you program has a data race, no guarantees about its operation can be Please ask questions about container/ring on the mailing list. The short On Mon, Aug 1, 2016 at 8:42 PM, 施舜元 notifications@github.com wrote:
|
Closing since there isn't a repro that doesn't involve a data race. |
I have seen this as well. This program ran for about 1.5 years under Go 1.3 and was upgraded to Go 1.6 a few weeks ago without further code modifications. It is race-free (so far as the race detector indicates) and is tested with representative load. This is rare, but I've now seen it twice today and it leaves the process hung and utterly unresponsive. gdb offered little insight, but I was in a hurry as this was in production. I couldn't send it a SIGABRT. I also suspect something in the RPC framework may have caught this panic and not crashed out correctly.
|
Please upgrade to go 1.7.3, 1.6 is no longer supported. On Thu, 17 Nov 2016, 13:10 Mike Solomon notifications@github.com wrote:
|
@msolo-dropbox, thanks for the report. If you can reproduce it, either on 1.6 or, preferably, on 1.7.3, please grab a core file so we can work through postmortem debugging (either by sending me the core if that's possible, or I can tell you what to look at). I'll add some extra diagnostics around this on master so that if this pops up in future versions we'll know more.
This isn't a regular panic, so there's nothing the RPC framework could have done. Fatal errors in the runtime are supposed to abort the whole program, but it looks like we hold the |
Thanks. Can you update this issue when you've committed? On Thu, Nov 17, 2016 at 7:44 AM, Austin Clements notifications@github.com
|
CL https://golang.org/cl/33339 mentions this issue. |
@lmb, perfect, thanks! Interestingly, according to that dump, all g's have been scanned. This could indicate a race where a goroutine was marked scanned between when the check failed and when the panic finished. @lmb, can I get you to check another thing in the core file?
Thanks! |
The core doesn't contain a thread with that function on the stack, see
|
Bleh, that's unfortunate. It's actually thread 38, but you're right that GDB isn't helping here. I'll try to pull together some gdb script that can get what I need from thread 38. What version of gdb are you running? (Really recent GDB versions may make this easier.) |
Okay, here's something to try.
Hopefully that works. |
Ok, I'll try that next thing in the morning. I need some time to understand what the python files do. Meanwhile, I have |
I'm having trouble working with the core dump, unfortunately. It seems like the dump contains a mmapped file, which means its roughly 300 GiB in size. I've not been able to strip out the troublesome section (using I'll look into performing the modifications in place, since I have a copy of the core on another system. |
Thanks.
I tested with GDB 7.9, but I think the scripts should work with earlier GDBs. |
Managed to get the output: debug.txt. cleanwithpcsp.py is missing an |
@aclements is there anything else you would like me to look into? |
Thanks for going to the trouble of getting this. Unfortunately, it looks like the state I needed was already gone even at the
Oops, fixed. Thanks. (GDB doesn't respect module boundaries between Python scripts, so I must have had something else loaded that imported struct.)
I think we'll just have to wait for this to happen in the wild with the added diagnostics in Go 1.8. |
Ah! I have a theory. The condition for calling All of this requires a logical race on the termination condition that goes something like this:
|
CL https://golang.org/cl/35353 mentions this issue. |
Currently we check that all roots are marked as soon as gcMarkDone decides to transition from mark 1 to mark 2. However, issue #16083 indicates that there may be a race where we try to complete mark 1 while a worker is still scanning a stack, causing the root mark check to fail. We don't yet understand this race, but as a simple mitigation, move the root check to after gcMarkDone performs a ragged barrier, which will force any remaining workers to finish their current job. Updates #16083. This may "fix" it, but it would be better to understand and fix the underlying race. Change-Id: I1af9ce67bd87ade7bc2a067295d79c28cd11abd2 Reviewed-on: https://go-review.googlesource.com/35353 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Russ Cox <rsc@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
I still haven't been able to work out an exact race that leads to this, but if it's at least along the lines of what I'm thinking, the commit I just pushed should mitigate the issue. And if it doesn't, we'll at least get better diagnostics out of Go 1.8. Moving to Go 1.9 so we can continue to keep track of this. I'd still like to find and fix this real race. |
CL https://golang.org/cl/35678 mentions this issue. |
CL https://golang.org/cl/35677 mentions this issue. |
…a g" Updates #18700 (backport) Currently there are no diagnostics for mark root check during marking. Fix this by printing out the same diagnostics we print during mark termination. Also, drop the allglock before throwing. Holding that across a throw causes a self-deadlock with tracebackothers. For #16083. Change-Id: Ib605f3ae0c17e70704b31d8378274cfaa2307dc2 Reviewed-on: https://go-review.googlesource.com/35677 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
…k roots Fixes #18700 (backport) Currently we check that all roots are marked as soon as gcMarkDone decides to transition from mark 1 to mark 2. However, issue #16083 indicates that there may be a race where we try to complete mark 1 while a worker is still scanning a stack, causing the root mark check to fail. We don't yet understand this race, but as a simple mitigation, move the root check to after gcMarkDone performs a ragged barrier, which will force any remaining workers to finish their current job. Updates #16083. This may "fix" it, but it would be better to understand and fix the underlying race. Change-Id: I1af9ce67bd87ade7bc2a067295d79c28cd11abd2 Reviewed-on: https://go-review.googlesource.com/35678 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
This last happened on November 5th, so I believe this is fixed. |
kubernetes' kubelet v1.7.8 crashes with "fatal error: scan missed a g".
|
@sigxcpu76 This issue is closed and we believe it is fixed. The "scan missed a g" error can occur for many different reasons. Please open a new issue with instructions for how to reproduce the problem you are seeing. Thanks. |
go version
)?go version go1.6 linux/amd64
go env
)?If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.
https://github.com/leoleovich/grafsy/blob/master/metric.go#L24
It was just working as a daemon.
true or false
The text was updated successfully, but these errors were encountered: