Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: fatal error: scan missed a g #16083

Closed
leoleovich opened this issue Jun 16, 2016 · 52 comments
Closed

runtime: fatal error: scan missed a g #16083

leoleovich opened this issue Jun 16, 2016 · 52 comments
Milestone

Comments

@leoleovich
Copy link

  1. What version of Go are you using (go version)?
    go version go1.6 linux/amd64
  2. What operating system and processor architecture are you using (go env)?
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/opt/go"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GO15VENDOREXPERIMENT="1"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"
CXX="g++"
CGO_ENABLED="1"
  1. What did you do?
    If possible, provide a recipe for reproducing the error.
    A complete runnable program is good.
    A link on play.golang.org is best.
    https://github.com/leoleovich/grafsy/blob/master/metric.go#L24
    It was just working as a daemon.
  2. What did you expect to see?
    true or false
  3. What did you see instead?
Jun  4 08:53:14 storm05 grafsy[32624]: fatal error: scan missed a g
Jun  4 08:53:14 storm05 grafsy[32624]: goroutine 783212 [running]:
Jun  4 08:53:14 storm05 grafsy[32624]: runtime.throw(0x620fd0, 0xf)
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/runtime/panic.go:530 +0x90 fp=0xc82095c2f8 sp=0xc82095c2e0
Jun  4 08:53:14 storm05 grafsy[32624]: runtime.gcMarkRootCheck()
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/runtime/mgcmark.go:87 +0x110 fp=0xc82095c320 sp=0xc82095c2f8
Jun  4 08:53:14 storm05 grafsy[32624]: runtime.gcMarkDone()
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/runtime/mgc.go:1066 +0xcf fp=0xc82095c340 sp=0xc82095c320
Jun  4 08:53:14 storm05 grafsy[32624]: runtime.gcAssistAlloc(0xc820bf0300)
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/runtime/mgcmark.go:428 +0x18c fp=0xc82095c3c0 sp=0xc82095c340
Jun  4 08:53:14 storm05 grafsy[32624]: runtime.mallocgc(0x140, 0x5ec4e0, 0xc800000000, 0xc820dd97e0)
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/runtime/malloc.go:537 +0x184 fp=0xc82095c498 sp=0xc82095c3c0
Jun  4 08:53:14 storm05 grafsy[32624]: runtime.newarray(0x5ec4e0, 0x8, 0x7225a0)
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/runtime/malloc.go:798 +0xc9 fp=0xc82095c4d8 sp=0xc82095c498
Jun  4 08:53:14 storm05 grafsy[32624]: runtime.growslice(0x584a00, 0xc820f6bb80, 0x4, 0x4, 0x5, 0x0, 0x0, 0x0)
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/runtime/slice.go:100 +0x2c1 fp=0xc82095c548 sp=0xc82095c4d8
Jun  4 08:53:14 storm05 grafsy[32624]: regexp/syntax.(*compiler).rune(0xc82095d838, 0xc8207c0d50, 0x1, 0x4, 0xd4, 0x0)
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/regexp/syntax/compile.go:267 +0x3fe fp=0xc82095c608 sp=0xc82095c548
Jun  4 08:53:14 storm05 grafsy[32624]: regexp/syntax.(*compiler).compile(0xc82095d838, 0xc820bba850, 0x0)
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/regexp/syntax/compile.go:110 +0x54b fp=0xc82095c890 sp=0xc82095c608
Jun  4 08:53:14 storm05 grafsy[32624]: regexp/syntax.(*compiler).compile(0xc82095d838, 0xc820bbaa10, 0x0)
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/regexp/syntax/compile.go:163 +0x1b9a fp=0xc82095cb18 sp=0xc82095c890
Jun  4 08:53:14 storm05 grafsy[32624]: regexp/syntax.(*compiler).compile(0xc82095d838, 0xc820bba7e0, 0x0)
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/regexp/syntax/compile.go:138 +0x14ca fp=0xc82095cda0 sp=0xc82095cb18
Jun  4 08:53:14 storm05 grafsy[32624]: regexp/syntax.(*compiler).compile(0xc82095d838, 0xc820bbaa80, 0x0)
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/regexp/syntax/compile.go:154 +0x1a4d fp=0xc82095d028 sp=0xc82095cda0
Jun  4 08:53:14 storm05 grafsy[32624]: regexp/syntax.(*compiler).compile(0xc82095d838, 0xc820bba770, 0x0)
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/regexp/syntax/compile.go:138 +0x14ca fp=0xc82095d2b0 sp=0xc82095d028
Jun  4 08:53:14 storm05 grafsy[32624]: regexp/syntax.(*compiler).compile(0xc82095d838, 0xc820bbaaf0, 0x0)
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/regexp/syntax/compile.go:146 +0x1783 fp=0xc82095d538 sp=0xc82095d2b0
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/regexp/syntax/compile.go:110 +0x54b fp=0xc82095c890 sp=0xc82095c608
Jun  4 08:53:14 storm05 grafsy[32624]: regexp/syntax.(*compiler).compile(0xc82095d838, 0xc820bbaa10, 0x0)
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/regexp/syntax/compile.go:163 +0x1b9a fp=0xc82095cb18 sp=0xc82095c890
Jun  4 08:53:14 storm05 grafsy[32624]: regexp/syntax.(*compiler).compile(0xc82095d838, 0xc820bba7e0, 0x0)
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/regexp/syntax/compile.go:138 +0x14ca fp=0xc82095cda0 sp=0xc82095cb18
Jun  4 08:53:14 storm05 grafsy[32624]: regexp/syntax.(*compiler).compile(0xc82095d838, 0xc820bbaa80, 0x0)
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/regexp/syntax/compile.go:154 +0x1a4d fp=0xc82095d028 sp=0xc82095cda0
Jun  4 08:53:14 storm05 grafsy[32624]: regexp/syntax.(*compiler).compile(0xc82095d838, 0xc820bba770, 0x0)
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/regexp/syntax/compile.go:138 +0x14ca fp=0xc82095d2b0 sp=0xc82095d028
Jun  4 08:53:14 storm05 grafsy[32624]: regexp/syntax.(*compiler).compile(0xc82095d838, 0xc820bbaaf0, 0x0)
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/regexp/syntax/compile.go:146 +0x1783 fp=0xc82095d538 sp=0xc82095d2b0
Jun  4 08:53:14 storm05 grafsy[32624]: regexp/syntax.(*compiler).compile(0xc82095d838, 0xc8209f0d20, 0x0)
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/regexp/syntax/compile.go:156 +0x1ab3 fp=0xc82095d7c0 sp=0xc82095d538
Jun  4 08:53:14 storm05 grafsy[32624]: regexp/syntax.Compile(0xc8209f0d20, 0xc8209f0d20, 0x0, 0x0)
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/regexp/syntax/compile.go:83 +0x189 fp=0xc82095d888 sp=0xc82095d7c0
Jun  4 08:53:14 storm05 grafsy[32624]: regexp.compile(0xc820061380, 0xcb, 0xd4, 0x51, 0x0, 0x0)
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/regexp/regexp.go:172 +0x109 fp=0xc82095d948 sp=0xc82095d888
Jun  4 08:53:14 storm05 grafsy[32624]: regexp.Compile(0xc820061380, 0xcb, 0xc82070c900, 0x0, 0x0)
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/regexp/regexp.go:129 +0x43 fp=0xc82095d980 sp=0xc82095d948
Jun  4 08:53:14 storm05 grafsy[32624]: regexp.MatchString(0xc820061380, 0xcb, 0xc82070c900, 0x51, 0x160, 0x0, 0x0)
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/regexp/regexp.go:434 +0x3d fp=0xc82095d9b0 sp=0xc82095d980
Jun  4 08:53:14 storm05 grafsy[32624]: main.validateMetric(0xc82070c900, 0x51, 0xc820061380, 0xcb, 0xc82070c7e0)
Jun  4 08:53:14 storm05 grafsy[32624]: /opt/go/src/grafsy/metric.go:24 +0x3f fp=0xc82095d9f0 sp=0xc82095d9b0
Jun  4 08:53:14 storm05 grafsy[32624]: main.Server.cleanAndUseIncomingData(0xa, 0x4e20, 0xc82000e1e0, 0x11, 0xc82000e220, 0xe, 0xc82000e260, 0x1a, 0xc82000a9f0, 0xb, ...)
Jun  4 08:53:14 storm05 grafsy[32624]: /opt/go/src/grafsy/server.go:105 +0x119 fp=0xc82095db18 sp=0xc82095d9f0
Jun  4 08:53:14 storm05 grafsy[32624]: main.Server.handleRequest(0xa, 0x4e20, 0xc82000e1e0, 0x11, 0xc82000e220, 0xe, 0xc82000e260, 0x1a, 0xc82000a9f0, 0xb, ...)
Jun  4 08:53:14 storm05 grafsy[32624]: /opt/go/src/grafsy/server.go:142 +0x2c1 fp=0xc82095de20 sp=0xc82095db18
Jun  4 08:53:14 storm05 grafsy[32624]: runtime.goexit()
Jun  4 08:53:14 storm05 grafsy[32624]: /usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc82095de28 sp=0xc82095de20
Jun  4 08:53:14 storm05 grafsy[32624]: created by main.Server.runServer
Jun  4 08:53:14 storm05 grafsy[32624]: /opt/go/src/grafsy/server.go:193 +0x597
@ianlancetaylor ianlancetaylor changed the title Go crashed on regexp.MatchString runtime: fatal error: scan missed a g Jun 16, 2016
@ianlancetaylor ianlancetaylor added this to the Go1.7Maybe milestone Jun 16, 2016
@ianlancetaylor
Copy link
Contributor

CC @aclements @RLH

Is this reproducible?

@RLH
Copy link
Contributor

RLH commented Jun 16, 2016

Is there anyway we can get the input that caused the error so we can try
and reproduce it locally? I can see it came in from the network so perhaps
not.

Correct me if I missed something but it looks like the program does not use
unsafe or cgo?

Finally, can you reproduce it reliably? If not how often does it show up?

On Thu, Jun 16, 2016 at 11:52 AM, Ian Lance Taylor <notifications@github.com

wrote:

CC @aclements https://github.com/aclements @RLH https://github.com/RLH

Is this reproducible?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#16083 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AA7Wn1rqzrvO_nhGHIYJa7eu8w0ce5-Mks5qMXFDgaJpZM4I3cpk
.

@RLH
Copy link
Contributor

RLH commented Jun 16, 2016

There were some bugs fixed between 1.6 and 1.6.2 that might be related, can
you upgrade to 1.6.2 and see if it is reproducible?

On Thu, Jun 16, 2016 at 2:00 PM, Rick Hudson rlh@golang.org wrote:

Is there anyway we can get the input that caused the error so we can try
and reproduce it locally? I can see it came in from the network so perhaps
not.

Correct me if I missed something but it looks like the program does not
use unsafe or cgo?

Finally, can you reproduce it reliably? If not how often does it show up?

On Thu, Jun 16, 2016 at 11:52 AM, Ian Lance Taylor <
notifications@github.com> wrote:

CC @aclements https://github.com/aclements @RLH
https://github.com/RLH

Is this reproducible?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#16083 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AA7Wn1rqzrvO_nhGHIYJa7eu8w0ce5-Mks5qMXFDgaJpZM4I3cpk
.

@leoleovich
Copy link
Author

This happened only once on 1 machine out of few thousand.
Unfortunately I do not know the input, because program crashed and did not leave the log entry.
I only wanted to mention it because maybe you have something similar on mind.

I will look for updates and come back if this happens again.

@leoleovich
Copy link
Author

leoleovich commented Jun 17, 2016

Correct me if I missed something but it looks like the program does not use
unsafe or cgo

I do not use any of it.

@RLH
Copy link
Contributor

RLH commented Jun 17, 2016

There doesn't seem to be anything that can be done here. 1.6.2 fixes
similar problems but since this one isn't reproducible there is no way to
be sure if it is fixed. If it happens with 1.6.2 or 1.7 please refile.

On Fri, Jun 17, 2016 at 4:42 AM, Oleg Obleukhov notifications@github.com
wrote:

Correct me if I missed something but it looks like the program does not use
unsafe or cgo
I do not use any of it.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#16083 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AA7Wn0K-D2THZkVFWhGrV78kO4Kt-2jcks5qMl4PgaJpZM4I3cpk
.

@leoleovich
Copy link
Author

Hello again.
Happened again. This time go version is 1.6.2.
Unfortunately, still I can't say which data cost the error.

Jun 19 13:07:54 storm02 grafsy[15438]: fatal error: scan missed a g
Jun 19 13:07:54 storm02 grafsy[15438]: goroutine 680102 [running]:
Jun 19 13:07:54 storm02 grafsy[15438]: runtime.throw(0x622370, 0xf)
Jun 19 13:07:54 storm02 grafsy[15438]: /usr/local/go/src/runtime/panic.go:547 +0x90 fp=0xc8209ad4f8 sp=0xc8209ad4e0
Jun 19 13:07:54 storm02 grafsy[15438]: runtime.gcMarkRootCheck()
Jun 19 13:07:54 storm02 grafsy[15438]: /usr/local/go/src/runtime/mgcmark.go:87 +0x110 fp=0xc8209ad520 sp=0xc8209ad4f8
Jun 19 13:07:54 storm02 grafsy[15438]: runtime.gcMarkDone()
Jun 19 13:07:54 storm02 grafsy[15438]: /usr/local/go/src/runtime/mgc.go:1067 +0xcf fp=0xc8209ad540 sp=0xc8209ad520
Jun 19 13:07:54 storm02 grafsy[15438]: runtime.gcAssistAlloc(0xc820873b00)
Jun 19 13:07:54 storm02 grafsy[15438]: /usr/local/go/src/runtime/mgcmark.go:428 +0x18c fp=0xc8209ad5c0 sp=0xc8209ad540
Jun 19 13:07:54 storm02 grafsy[15438]: runtime.mallocgc(0x8, 0x6044c0, 0x0, 0xc8208ee000)
Jun 19 13:07:54 storm02 grafsy[15438]: /usr/local/go/src/runtime/malloc.go:537 +0x184 fp=0xc8209ad698 sp=0xc8209ad5c0
Jun 19 13:07:54 storm02 grafsy[15438]: runtime.newarray(0x6044c0, 0x1, 0x0)
Jun 19 13:07:54 storm02 grafsy[15438]: /usr/local/go/src/runtime/malloc.go:798 +0xc9 fp=0xc8209ad6d8 sp=0xc8209ad698
Jun 19 13:07:54 storm02 grafsy[15438]: runtime.growslice(0x583fc0, 0x0, 0x0, 0x0, 0x1, 0x0, 0x0, 0x0)
Jun 19 13:07:54 storm02 grafsy[15438]: /usr/local/go/src/runtime/slice.go:100 +0x2c1 fp=0xc8209ad748 sp=0xc8209ad6d8
Jun 19 13:07:54 storm02 grafsy[15438]: regexp.(*Regexp).put(0xc820abe960, 0xc820302480)
Jun 19 13:07:54 storm02 grafsy[15438]: /usr/local/go/src/regexp/regexp.go:222 +0xd2 fp=0xc8209ad7a0 sp=0xc8209ad748
Jun 19 13:07:54 storm02 grafsy[15438]: regexp.(*Regexp).doExecute(0xc820abe960, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc820339260, 0x68, 0x0, 0x0, ...)
Jun 19 13:07:54 storm02 grafsy[15438]: /usr/local/go/src/regexp/exec.go:461 +0x206 fp=0xc8209ad900 sp=0xc8209ad7a0
Jun 19 13:07:54 storm02 grafsy[15438]: regexp.(*Regexp).MatchString(0xc820abe960, 0xc820339260, 0x68, 0x0)
Jun 19 13:07:54 storm02 grafsy[15438]: /usr/local/go/src/regexp/regexp.go:411 +0x67 fp=0xc8209ad970 sp=0xc8209ad900
Jun 19 13:07:54 storm02 grafsy[15438]: regexp.MatchString(0xc8200632b0, 0xcb, 0xc820339260, 0x68, 0x170, 0x0, 0x0)
Jun 19 13:07:54 storm02 grafsy[15438]: /usr/local/go/src/regexp/regexp.go:438 +0x83 fp=0xc8209ad9a0 sp=0xc8209ad970
Jun 19 13:07:54 storm02 grafsy[15438]: main.validateMetric(0xc820339260, 0x68, 0xc8200632b0, 0xcb, 0xc820338fc0)
Jun 19 13:07:54 storm02 grafsy[15438]: /opt/go/src/grafsy/metric.go:24 +0x3f fp=0xc8209ad9e0 sp=0xc8209ad9a0
Jun 19 13:07:54 storm02 grafsy[15438]: main.Server.cleanAndUseIncomingData(0xc82000a750, 0x7, 0xa, 0x2710, 0xc820012240, 0x1a, 0xc820012280, 0xe, 0xc8200122c0, 0x1a, ...)
Jun 19 13:07:54 storm02 grafsy[15438]: /opt/go/src/grafsy/server.go:106 +0x119 fp=0xc8209adb08 sp=0xc8209ad9e0
Jun 19 13:07:54 storm02 grafsy[15438]: main.Server.handleRequest(0xc82000a750, 0x7, 0xa, 0x2710, 0xc820012240, 0x1a, 0xc820012280, 0xe, 0xc8200122c0, 0x1a, ...)
Jun 19 13:07:54 storm02 grafsy[15438]: /opt/go/src/grafsy/server.go:143 +0x2c1 fp=0xc8209ade20 sp=0xc8209adb08
Jun 19 13:07:54 storm02 grafsy[15438]: runtime.goexit()
Jun 19 13:07:54 storm02 grafsy[15438]: /usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc8209ade28 sp=0xc8209ade20
Jun 19 13:07:54 storm02 grafsy[15438]: created by main.Server.runServer
Jun 19 13:07:54 storm02 grafsy[15438]: /opt/go/src/grafsy/server.go:194 +0x597

Maybe I should try recover or is there other way how to catch it?

@ianlancetaylor
Copy link
Contributor

Thanks for the additional report with 1.6.2.

Is there any way that we can reproduce this problem ourselves?

There is no way to catch this kind of error.

@leoleovich
Copy link
Author

leoleovich commented Jun 21, 2016

@ianlancetaylor we are accepting ~2000 metrics per second, check if it is matching regexp and process only valid metrics. Nothing really special, except maybe running gc manually: https://github.com/leoleovich/grafsy/blob/master/client.go#L130 (but this is another goroutine).
So it can be just a very ugly long string.
For example:
backend.analytics.metrics_pipeline_schematized.conversion.metricsjs-frontend.portal.en.page-ready 1 946728900
But this line did not crash app, so it is something else...

@ianlancetaylor
Copy link
Contributor

Unless you are running into some sort of hard memory limit, there is really no reason to call runtime.GC yourself. It will only slow your program down. But I don't see how that could be the cause of this problem.

In order to fix this problem, we need some way to reproduce it. Is there any way we can reproduce it ourselves?

Has it happened more than twice?

Is there any similarity between the times that it happened? For example, did it happen on the same machine, or on different ones?

Is there a chance that you could try the 1.7 beta release to see if it makes any difference?

@leoleovich
Copy link
Author

@ianlancetaylor This happened only on 2 hosts our of few thousand. But on these hosts we have most loaded instances of Grafsy.
I have no idea how to reproduce it and you say - it is impossible to catch it.
You can see from logs, which I provided to you, that it happened at Jun 4 08:53:14 and at Jun 19 13:07:54 on 2 different hosts.
For now I do not have anything to help you. Unless you give me a hint how to catch it.

@ianlancetaylor
Copy link
Contributor

We don't know how to reproduce and we don't know what the problem is. Punting to 1.8.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.8, Go1.7Maybe Jun 28, 2016
@eaglerayp
Copy link

What version of Go are you using (go version)?
go version go1.6 linux/amd64

What did you see instead?

fatal error: scan missed a g

goroutine 2082 [running]:
runtime.throw(0xb48370, 0xf)
    /usr/local/go/src/runtime/panic.go:530 +0x90 fp=0xc8228a14f0 sp=0xc8228a14d8
runtime.gcMarkRootCheck()
    /usr/local/go/src/runtime/mgcmark.go:87 +0x110 fp=0xc8228a1518 sp=0xc8228a14f0
runtime.gcMarkDone()
    /usr/local/go/src/runtime/mgc.go:1066 +0xcf fp=0xc8228a1538 sp=0xc8228a1518
runtime.gcAssistAlloc(0xc823098900)
    /usr/local/go/src/runtime/mgcmark.go:428 +0x18c fp=0xc8228a15b8 sp=0xc8228a1538
runtime.mallocgc(0xf0, 0x95afa0, 0xc800000000, 0x412b52)
    /usr/local/go/src/runtime/malloc.go:537 +0x184 fp=0xc8228a1690 sp=0xc8228a15b8
runtime.newarray(0x95afa0, 0xf, 0x20a)
    /usr/local/go/src/runtime/malloc.go:798 +0xc9 fp=0xc8228a16d0 sp=0xc8228a1690
runtime.makeslice(0x943b60, 0xf, 0xf, 0x0, 0x0, 0x0)
    /usr/local/go/src/runtime/slice.go:32 +0x165 fp=0xc8228a1720 sp=0xc8228a16d0
net/textproto.(*Reader).ReadMIMEHeader(0xc822b380f0, 0xc8211aa580, 0x0, 0x0)
    /usr/local/go/src/net/textproto/reader.go:475 +0x89 fp=0xc8228a1858 sp=0xc8228a1720
net/http.readRequest(0xc822734f60, 0x0, 0xc8210597a0, 0x0, 0x0)
    /usr/local/go/src/net/http/request.go:765 +0x6b7 fp=0xc8228a19d0 sp=0xc8228a1858
net/http.(*conn).readRequest(0xc820c9a200, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/http/server.go:705 +0x359 fp=0xc8228a1bc0 sp=0xc8228a19d0
net/http.(*conn).serve(0xc820c9a200)
    /usr/local/go/src/net/http/server.go:1425 +0x947 fp=0xc8228a1f88 sp=0xc8228a1bc0
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc8228a1f90 sp=0xc8228a1f88
created by net/http.(*Server).Serve
    /usr/local/go/src/net/http/server.go:2137 +0x44e

We encouter the error, the same error point at go/src/runtime/malloc.go:798

@davecheney
Copy link
Contributor

@eaglerayp have you confirmed your program is absolutely free of data races ? Are you able to provide a piece of code which reproduces the error ?

@eaglerayp
Copy link

eaglerayp commented Jul 29, 2016

@davecheney, here is more information.
We are running a API server which run multi pods on Kubernetes under the HTTP request stress testing.
And then, only one pod encountered this error.
The error stack mentioned above is the full stack, so I cannot provide more precisely code block.
We are using go-restful http framework.

We are still testing now, if we have reproduced this problem, I will let you know

@davecheney
Copy link
Contributor

Please build and test your application with the race detector to confirm there are no data races in your program.

On 29 Jul 2016, at 22:22, 施舜元 notifications@github.com wrote:

@davecheney, here is more information.
We are running a API server which run multi pods on Kubernetes under the HTTP request stress testing.
The error stack mentioned above is the full stack, so I cannot provide more precisely code block.
We are using go-restful http framework.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@eaglerayp
Copy link

@davecheney , after tested with the race detector. The only data race comes from the container/ring which I originally thought that is a safe library.
As the result, does it means that if we use the container/ring, we should manually add mutex on the ring to avoid the corruption even thought I don't care the data correctness about the ring?

By the way, below is another fatal message with the same program. Hope this helpful.

fatal error: scan missed a g 

goroutine 17796 [running]: 
runtime.throw(0xb48370, 0xf) 
/usr/local/go/src/runtime/panic.go:530 +0x90 fp=0xc820520ef8 sp=0xc820520ee0 
runtime.gcMarkRootCheck() 
/usr/local/go/src/runtime/mgcmark.go:87 +0x110 fp=0xc820520f20 sp=0xc820520ef8 
runtime.gcMarkDone() 
/usr/local/go/src/runtime/mgc.go:1066 +0xcf fp=0xc820520f40 sp=0xc820520f20 
runtime.gcBgMarkWorker(0xc82003b500) 
/usr/local/go/src/runtime/mgc.go:1479 +0x488 fp=0xc820520fb8 sp=0xc820520f40 
runtime.goexit() 
/usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc820520fc0 sp=0xc820520fb8 
created by runtime.gcBgMarkStartWorkers 
/usr/local/go/src/runtime/mgc.go:1329 +0x92

@davecheney
Copy link
Contributor

If you program has a data race, no guarantees about its operation can be
given. We cannot accept a bug report about a fault in the runtime unless
you program is 100% race free.

Please ask questions about container/ring on the mailing list. The short
version is unless any type in Go says it is safe for concurrent use, you
must use a mutex to serialise execution. This includes container/ring.

On Mon, Aug 1, 2016 at 8:42 PM, 施舜元 notifications@github.com wrote:

@davecheney https://github.com/davecheney , after tested with the race
detector. The only data race comes from the container/ring which I
originally thought that is a safe library.
As the result, does it means that if we use the container/ring, we should
manually add mutex on the ring to avoid the corruption even thought I don't
care the data correctness about the ring?

By the way, below is another fatal message with the same program. Hope
this helpful.

fatal error: scan missed a g

goroutine 17796 [running]:
runtime.throw(0xb48370, 0xf)
/usr/local/go/src/runtime/panic.go:530 +0x90 fp=0xc820520ef8 sp=0xc820520ee0
runtime.gcMarkRootCheck()
/usr/local/go/src/runtime/mgcmark.go:87 +0x110 fp=0xc820520f20 sp=0xc820520ef8
runtime.gcMarkDone()
/usr/local/go/src/runtime/mgc.go:1066 +0xcf fp=0xc820520f40 sp=0xc820520f20
runtime.gcBgMarkWorker(0xc82003b500)
/usr/local/go/src/runtime/mgc.go:1479 +0x488 fp=0xc820520fb8 sp=0xc820520f40
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc820520fc0 sp=0xc820520fb8
created by runtime.gcBgMarkStartWorkers
/usr/local/go/src/runtime/mgc.go:1329 +0x92


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#16083 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAAcAwh3OFoJst7FXI25ca-xHM_6qS9jks5qbc2vgaJpZM4I3cpk
.

@quentinmit
Copy link
Contributor

Closing since there isn't a repro that doesn't involve a data race.

@msolo-dropbox
Copy link

I have seen this as well. This program ran for about 1.5 years under Go 1.3 and was upgraded to Go 1.6 a few weeks ago without further code modifications. It is race-free (so far as the race detector indicates) and is tested with representative load. This is rare, but I've now seen it twice today and it leaves the process hung and utterly unresponsive. gdb offered little insight, but I was in a hurry as this was in production. I couldn't send it a SIGABRT. I also suspect something in the RPC framework may have caught this panic and not crashed out correctly.

fatal error: scan missed a g

goroutine 2906450 [running]:
runtime.throw(0x983c80, 0xf)
        //server/external/go1_6_2_linux_amd64_tar_gz/go/src/runtime/panic.go:547 +0x90 fp=0xc8214c2480 sp=0xc8214c2468
runtime.gcMarkRootCheck()
        //server/external/go1_6_2_linux_amd64_tar_gz/go/src/runtime/mgcmark.go:87 +0x110 fp=0xc8214c24a8 sp=0xc8214c2480
runtime.gcMarkDone()
        //server/external/go1_6_2_linux_amd64_tar_gz/go/src/runtime/mgc.go:1067 +0xcf fp=0xc8214c24c8 sp=0xc8214c24a8
runtime.gcAssistAlloc(0xc82072c000)
        //server/external/go1_6_2_linux_amd64_tar_gz/go/src/runtime/mgcmark.go:428 +0x18c fp=0xc8214c2548 sp=0xc8214c24c8
runtime.mallocgc(0xb, 0x0, 0xc800000003, 0x0)
        //server/external/go1_6_2_linux_amd64_tar_gz/go/src/runtime/malloc.go:537 +0x184 fp=0xc8214c2620 sp=0xc8214c2548
runtime.rawstring(0xb, 0x0, 0x0, 0x0, 0x0, 0x0)
        //server/external/go1_6_2_linux_amd64_tar_gz/go/src/runtime/string.go:284 +0x70 fp=0xc8214c2668 sp=0xc8214c2620
runtime.rawstringtmp(0x0, 0xb, 0x0, 0x0, 0x0, 0x0, 0x0)
        //server/external/go1_6_2_linux_amd64_tar_gz/go/src/runtime/string.go:111 +0xb7 fp=0xc8214c26a0 sp=0xc8214c2668
runtime.concatstrings(0x0, 0xc8214c2830, 0x3, 0x3, 0x0, 0x0)
        //server/external/go1_6_2_linux_amd64_tar_gz/go/src/runtime/string.go:49 +0x1bb fp=0xc8214c27d8 sp=0xc8214c26a0
runtime.concatstring3(0x0, 0x969910, 0x5, 0x9698e8, 0x1, 0xc82022ccf5, 0x5, 0x0, 0x0)
        //server/external/go1_6_2_linux_amd64_tar_gz/go/src/runtime/string.go:63 +0x6a fp=0xc8214c2828 sp=0xc8214c27d8
os.(*File).readdir(0xc8207ba470, 0xffffffffffffffff, 0xc822200500, 0x318, 0x353, 0x0, 0x0)
        //server/external/go1_6_2_linux_amd64_tar_gz/go/src/os/file_unix.go:199 +0x202 fp=0xc8214c2910 sp=0xc8214c2828
os.(*File).Readdir(0xc8207ba470, 0xffffffffffffffff, 0x0, 0x0, 0x0, 0x0, 0x0)
        //server/external/go1_6_2_linux_amd64_tar_gz/go/src/os/doc.go:119 +0x85 fp=0xc8214c2950 sp=0xc8214c2910
io/ioutil.ReadDir(0x969910, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0)
        //server/external/go1_6_2_linux_amd64_tar_gz/go/src/io/ioutil/ioutil.go:105 +0xcc fp=0xc8214c29d8 sp=0xc8214c2950
dropbox/dbxinit/procfs.readAllProcStats(0x0, 0x0, 0x0, 0x0, 0x0)
        //server/go/src/dropbox/dbxinit/procfs/stat.go:56 +0x72 fp=0xc8214c2b28 sp=0xc8214c29d8
dropbox/dbxinit/procfs.GetSessionIdPids(0xa7cd, 0x0, 0x0, 0x0, 0x0, 0x0)
        //server/go/src/dropbox/dbxinit/procfs/stat.go:112 +0x50 fp=0xc8214c2bf0 sp=0xc8214c2b28
dropbox/dbxinit/proc.killSession(0xc820ee8140, 0xa7cd, 0xf, 0xc8214c2da0, 0x2, 0x10, 0x0, 0x0, 0x0, 0x0, ...)
        //server/go/src/dropbox/dbxinit/proc/kill.go:87 +0x73 fp=0xc8214c2d48 sp=0xc8214c2bf0
dropbox/dbxinit/proc.killSessionWithTimeout(0xc820ee8140, 0xa7cd, 0xf, 0x1bf08eb000, 0x1, 0x0, 0x0)
        //server/go/src/dropbox/dbxinit/proc/kill.go:61 +0x1d2 fp=0xc8214c2f10 sp=0xc8214c2d48
dropbox/dbxinit/proc.killProc(0xc820626bd0, 0x6a, 0xf, 0x1bf08eb000, 0x9d1901, 0x0, 0x0)
        //server/go/src/dropbox/dbxinit/proc/kill.go:31 +0x1f3 fp=0xc8214c2fb0 sp=0xc8214c2f10
dropbox/dbxinit/proc.KillProc(0xc820626bd0, 0x6a, 0xf, 0x1bf08eb000, 0x0, 0x0)
        //server/go/src/dropbox/dbxinit/proc/kill.go:17 +0x50 fp=0xc8214c2ff0 sp=0xc8214c2fb0
dropbox/dbxinit/subtask.(*subtask).performStop(0xc82085a500, 0x0, 0x0)
        //server/go/src/dropbox/dbxinit/subtask/subtask.go:598 +0x7a fp=0xc8214c3038 sp=0xc8214c2ff0
dropbox/dbxinit/subtask.(*subtask).stopping(0xc82085a500, 0x0, 0x0)
        //server/go/src/dropbox/dbxinit/subtask/subtask.go:617 +0x6f fp=0xc8214c3258 sp=0xc8214c3038
dropbox/dbxinit/subtask.(*subtask).handleStopping(0xc82085a500, 0x0, 0x0)
        //server/go/src/dropbox/dbxinit/subtask/subtask.go:589 +0x7f fp=0xc8214c3288 sp=0xc8214c3258
dropbox/dbxinit/subtask.(*subtask).running(0xc82085a500, 0x0, 0x0)
        //server/go/src/dropbox/dbxinit/subtask/subtask.go:576 +0x127 fp=0xc8214c33d0 sp=0xc8214c3288
dropbox/dbxinit/subtask.(*subtask).started(0xc82085a500, 0x0, 0x0)
        //server/go/src/dropbox/dbxinit/subtask/subtask.go:295 +0x354 fp=0xc8214c36f8 sp=0xc8214c33d0
dropbox/dbxinit/subtask.(*subtask).starting(0xc82085a500, 0x0, 0x0)
        //server/go/src/dropbox/dbxinit/subtask/subtask.go:257 +0x259 fp=0xc8214c3a68 sp=0xc8214c36f8
dropbox/dbxinit/subtask.(*subtask).delaying(0xc82085a500, 0x0, 0x0)
        //server/go/src/dropbox/dbxinit/subtask/subtask.go:199 +0x161 fp=0xc8214c3b40 sp=0xc8214c3a68
dropbox/dbxinit/subtask.(*subtask).Run(0xc82085a500, 0x0, 0x0)
        //server/go/src/dropbox/dbxinit/subtask/subtask.go:658 +0x27e fp=0xc8214c3da8 sp=0xc8214c3b40
dropbox/dbxinit/task.(*task).createSubtask.func1(0xc8203b8900, 0x7fd6f36546c0, 0xc82085a500)
        //server/go/src/dropbox/dbxinit/task/task.go:578 +0x7f fp=0xc8214c3f78 sp=0xc8214c3da8
runtime.goexit()
        //server/external/go1_6_2_linux_amd64_tar_gz/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc8214c3f80 sp=0xc8214c3f78
created by dropbox/dbxinit/task.(*task).createSubtask
        //server/go/src/dropbox/dbxinit/task/task.go:581 +0x340

@davecheney
Copy link
Contributor

Please upgrade to go 1.7.3, 1.6 is no longer supported.

On Thu, 17 Nov 2016, 13:10 Mike Solomon notifications@github.com wrote:

I have seen this as well. This program ran for about 1.5 years under Go
1.3 and was upgraded to Go 1.6 a few weeks ago without further code
modifications. It is race-free (so far as the race detector indicates) and
is tested with representative load. This is rare, but I've now seen it
twice today and it leaves the process hung and utterly unresponsive. gdb
offered little insight, but I was in a hurry as this was in production. I
couldn't send it a SIGABRT. I also suspect something in the RPC framework
may have caught this panic and not crashed out correctly.

fatal error: scan missed a g

goroutine 2906450 [running]:
runtime.throw(0x983c80, 0xf)
//server/external/go1_6_2_linux_amd64_tar_gz/go/src/runtime/panic.go:547 +0x90 fp=0xc8214c2480 sp=0xc8214c2468
runtime.gcMarkRootCheck()
//server/external/go1_6_2_linux_amd64_tar_gz/go/src/runtime/mgcmark.go:87 +0x110 fp=0xc8214c24a8 sp=0xc8214c2480
runtime.gcMarkDone()
//server/external/go1_6_2_linux_amd64_tar_gz/go/src/runtime/mgc.go:1067 +0xcf fp=0xc8214c24c8 sp=0xc8214c24a8
runtime.gcAssistAlloc(0xc82072c000)
//server/external/go1_6_2_linux_amd64_tar_gz/go/src/runtime/mgcmark.go:428 +0x18c fp=0xc8214c2548 sp=0xc8214c24c8
runtime.mallocgc(0xb, 0x0, 0xc800000003, 0x0)
//server/external/go1_6_2_linux_amd64_tar_gz/go/src/runtime/malloc.go:537 +0x184 fp=0xc8214c2620 sp=0xc8214c2548
runtime.rawstring(0xb, 0x0, 0x0, 0x0, 0x0, 0x0)
//server/external/go1_6_2_linux_amd64_tar_gz/go/src/runtime/string.go:284 +0x70 fp=0xc8214c2668 sp=0xc8214c2620
runtime.rawstringtmp(0x0, 0xb, 0x0, 0x0, 0x0, 0x0, 0x0)
//server/external/go1_6_2_linux_amd64_tar_gz/go/src/runtime/string.go:111 +0xb7 fp=0xc8214c26a0 sp=0xc8214c2668
runtime.concatstrings(0x0, 0xc8214c2830, 0x3, 0x3, 0x0, 0x0)
//server/external/go1_6_2_linux_amd64_tar_gz/go/src/runtime/string.go:49 +0x1bb fp=0xc8214c27d8 sp=0xc8214c26a0
runtime.concatstring3(0x0, 0x969910, 0x5, 0x9698e8, 0x1, 0xc82022ccf5, 0x5, 0x0, 0x0)
//server/external/go1_6_2_linux_amd64_tar_gz/go/src/runtime/string.go:63 +0x6a fp=0xc8214c2828 sp=0xc8214c27d8
os.(_File).readdir(0xc8207ba470, 0xffffffffffffffff, 0xc822200500, 0x318, 0x353, 0x0, 0x0)
//server/external/go1_6_2_linux_amd64_tar_gz/go/src/os/file_unix.go:199 +0x202 fp=0xc8214c2910 sp=0xc8214c2828
os.(_File).Readdir(0xc8207ba470, 0xffffffffffffffff, 0x0, 0x0, 0x0, 0x0, 0x0)
//server/external/go1_6_2_linux_amd64_tar_gz/go/src/os/doc.go:119 +0x85 fp=0xc8214c2950 sp=0xc8214c2910
io/ioutil.ReadDir(0x969910, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0)
//server/external/go1_6_2_linux_amd64_tar_gz/go/src/io/ioutil/ioutil.go:105 +0xcc fp=0xc8214c29d8 sp=0xc8214c2950
dropbox/dbxinit/procfs.readAllProcStats(0x0, 0x0, 0x0, 0x0, 0x0)
//server/go/src/dropbox/dbxinit/procfs/stat.go:56 +0x72 fp=0xc8214c2b28 sp=0xc8214c29d8
dropbox/dbxinit/procfs.GetSessionIdPids(0xa7cd, 0x0, 0x0, 0x0, 0x0, 0x0)
//server/go/src/dropbox/dbxinit/procfs/stat.go:112 +0x50 fp=0xc8214c2bf0 sp=0xc8214c2b28
dropbox/dbxinit/proc.killSession(0xc820ee8140, 0xa7cd, 0xf, 0xc8214c2da0, 0x2, 0x10, 0x0, 0x0, 0x0, 0x0, ...)
//server/go/src/dropbox/dbxinit/proc/kill.go:87 +0x73 fp=0xc8214c2d48 sp=0xc8214c2bf0
dropbox/dbxinit/proc.killSessionWithTimeout(0xc820ee8140, 0xa7cd, 0xf, 0x1bf08eb000, 0x1, 0x0, 0x0)
//server/go/src/dropbox/dbxinit/proc/kill.go:61 +0x1d2 fp=0xc8214c2f10 sp=0xc8214c2d48
dropbox/dbxinit/proc.killProc(0xc820626bd0, 0x6a, 0xf, 0x1bf08eb000, 0x9d1901, 0x0, 0x0)
//server/go/src/dropbox/dbxinit/proc/kill.go:31 +0x1f3 fp=0xc8214c2fb0 sp=0xc8214c2f10
dropbox/dbxinit/proc.KillProc(0xc820626bd0, 0x6a, 0xf, 0x1bf08eb000, 0x0, 0x0)
//server/go/src/dropbox/dbxinit/proc/kill.go:17 +0x50 fp=0xc8214c2ff0 sp=0xc8214c2fb0
dropbox/dbxinit/subtask.(_subtask).performStop(0xc82085a500, 0x0, 0x0)
//server/go/src/dropbox/dbxinit/subtask/subtask.go:598 +0x7a fp=0xc8214c3038 sp=0xc8214c2ff0
dropbox/dbxinit/subtask.(_subtask).stopping(0xc82085a500, 0x0, 0x0)
//server/go/src/dropbox/dbxinit/subtask/subtask.go:617 +0x6f fp=0xc8214c3258 sp=0xc8214c3038
dropbox/dbxinit/subtask.(_subtask).handleStopping(0xc82085a500, 0x0, 0x0)
//server/go/src/dropbox/dbxinit/subtask/subtask.go:589 +0x7f fp=0xc8214c3288 sp=0xc8214c3258
dropbox/dbxinit/subtask.(_subtask).running(0xc82085a500, 0x0, 0x0)
//server/go/src/dropbox/dbxinit/subtask/subtask.go:576 +0x127 fp=0xc8214c33d0 sp=0xc8214c3288
dropbox/dbxinit/subtask.(_subtask).started(0xc82085a500, 0x0, 0x0)
//server/go/src/dropbox/dbxinit/subtask/subtask.go:295 +0x354 fp=0xc8214c36f8 sp=0xc8214c33d0
dropbox/dbxinit/subtask.(_subtask).starting(0xc82085a500, 0x0, 0x0)
//server/go/src/dropbox/dbxinit/subtask/subtask.go:257 +0x259 fp=0xc8214c3a68 sp=0xc8214c36f8
dropbox/dbxinit/subtask.(_subtask).delaying(0xc82085a500, 0x0, 0x0)
//server/go/src/dropbox/dbxinit/subtask/subtask.go:199 +0x161 fp=0xc8214c3b40 sp=0xc8214c3a68
dropbox/dbxinit/subtask.(_subtask).Run(0xc82085a500, 0x0, 0x0)
//server/go/src/dropbox/dbxinit/subtask/subtask.go:658 +0x27e fp=0xc8214c3da8 sp=0xc8214c3b40
dropbox/dbxinit/task.(_task).createSubtask.func1(0xc8203b8900, 0x7fd6f36546c0, 0xc82085a500)
//server/go/src/dropbox/dbxinit/task/task.go:578 +0x7f fp=0xc8214c3f78 sp=0xc8214c3da8
runtime.goexit()
//server/external/go1_6_2_linux_amd64_tar_gz/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc8214c3f80 sp=0xc8214c3f78
created by dropbox/dbxinit/task.(_task).createSubtask
//server/go/src/dropbox/dbxinit/task/task.go:581 +0x340


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#16083 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAAcAxXLjsaypvpnuoUjYacht0cRTIuiks5q-7etgaJpZM4I3cpk
.

@aclements
Copy link
Member

@msolo-dropbox, thanks for the report. If you can reproduce it, either on 1.6 or, preferably, on 1.7.3, please grab a core file so we can work through postmortem debugging (either by sending me the core if that's possible, or I can tell you what to look at).

I'll add some extra diagnostics around this on master so that if this pops up in future versions we'll know more.

I also suspect something in the RPC framework may have caught this panic and not crashed out correctly.

This isn't a regular panic, so there's nothing the RPC framework could have done. Fatal errors in the runtime are supposed to abort the whole program, but it looks like we hold the allglock around this particular throw, so when the runtime tries to print the stack traces of other goroutines for the panic message, it self-deadlocks on allglock. I'll fix that on master, too.

@msolo-dropbox
Copy link

Thanks. Can you update this issue when you've committed?

On Thu, Nov 17, 2016 at 7:44 AM, Austin Clements notifications@github.com
wrote:

@msolo-dropbox https://github.com/msolo-dropbox, thanks for the report.
If you can reproduce it, either on 1.6 or, preferably, on 1.7.3, please
grab a core file so we can work through postmortem debugging (either by
sending me the core if that's possible, or I can tell you what to look at).

I'll add some extra diagnostics around this on master so that if this pops
up in future versions we'll know more.

I also suspect something in the RPC framework may have caught this panic
and not crashed out correctly.

This isn't a regular panic, so there's nothing the RPC framework could
have done. Fatal errors in the runtime are supposed to abort the whole
program, but it looks like we hold the allglock around this particular
throw, so when the runtime tries to print the stack traces of other
goroutines for the panic message, it self-deadlocks on allglock. I'll fix
that on master, too.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#16083 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ALBuicgev34taR8oKOCnFbpksBf7ZGUIks5q_HZsgaJpZM4I3cpk
.

@gopherbot
Copy link

CL https://golang.org/cl/33339 mentions this issue.

@aclements
Copy link
Member

@lmb, perfect, thanks! Interestingly, according to that dump, all g's have been scanned. This could indicate a race where a goroutine was marked scanned between when the check failed and when the panic finished.

@lmb, can I get you to check another thing in the core file?

  1. Run thread apply all bt and search for runtime.gcMarkRootCheck to find the thread ID that actually panicked and the frame number of gcMarkRootCheck itself in that thread.

  2. Run thread <thread ID> to switch to that thread and then frame <frame num> to switch to gcMarkRootCheck's frame.

  3. Paste the output of bt, info locals, info registers, and disas.

Thanks!

@lmb
Copy link
Contributor

lmb commented Jan 9, 2017

The core doesn't contain a thread with that function on the stack, see
thread-bt.txt. AFAIR the process dumped the above and then entered a zombie state, we pulled the core a few hours later with the following commands:

sudo gdb --pid=6600
...
(gdb) gcore
warning: target file /proc/6600/cmdline contained unexpected null characters
warning: Memory read failed for corefile section, 1048576 bytes at 0x7ee9a508d000.
warning: Memory read failed for corefile section, 8192 bytes at 0x7ffc9dbd5000.
Saved corefile core.6600
(gdb) detach
Detaching from program: /usr/local/bin/qsdaemon, process 6600
(gdb) quit

@aclements
Copy link
Member

Bleh, that's unfortunate. It's actually thread 38, but you're right that GDB isn't helping here. I'll try to pull together some gdb script that can get what I need from thread 38. What version of gdb are you running? (Really recent GDB versions may make this easier.)

@aclements
Copy link
Member

Okay, here's something to try.

  1. Download https://gist.github.com/aclements/8d2d6a1d1ade4bc4fd492db6d3eb07a5 and https://gist.github.com/aclements/ad65da1167cc8b503b4ba3c8441dfce8.

  2. Open GDB with the core file and run source corewithpcsp.py and source getg.py to load the two GDB scripts downloaded above.

  3. Run thread 38, then core-with-pc-sp $getg().m.curg.sched.pc $getg().m.curg.sched.sp. This will print the name of a new core file that's like the current core file, but thread 1 now has the stack we're looking for. Yes, it's awful.

  4. Run a new GDB with the same binary and the new core file that was printed above. From here, you should be able to run bt to get runtime.gcMarkRootCheck's frame number, and frame <frame num> to switch to it.

  5. Paste the output of bt, info locals, info registers, and disas.

Hopefully that works.

@lmb
Copy link
Contributor

lmb commented Jan 9, 2017

Ok, I'll try that next thing in the morning. I need some time to understand what the python files do.

Meanwhile, I have GNU gdb (GDB) 7.6.2 (Debian 7.6.2-1) and GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1. The first box is the one I used for previous output. If the scripts work with this its the least hassle. If all else fails I can move to a box which has the newer version.

@lmb
Copy link
Contributor

lmb commented Jan 10, 2017

I'm having trouble working with the core dump, unfortunately. It seems like the dump contains a mmapped file, which means its roughly 300 GiB in size. I've not been able to strip out the troublesome section (using objcopy or similar), and the boxes I have access to don't have enough diskspace.

I'll look into performing the modifications in place, since I have a copy of the core on another system.

@aclements
Copy link
Member

Thanks.

Meanwhile, I have GNU gdb (GDB) 7.6.2 (Debian 7.6.2-1) and GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1. The first box is the one I used for previous output. If the scripts work with this its the least hassle. If all else fails I can move to a box which has the newer version.

I tested with GDB 7.9, but I think the scripts should work with earlier GDBs.

@lmb
Copy link
Contributor

lmb commented Jan 10, 2017

Managed to get the output: debug.txt. cleanwithpcsp.py is missing an import struct I think.

@lmb
Copy link
Contributor

lmb commented Jan 13, 2017

@aclements is there anything else you would like me to look into?

@aclements
Copy link
Member

Managed to get the output: debug.txt.

Thanks for going to the trouble of getting this. Unfortunately, it looks like the state I needed was already gone even at the throw call.

cleanwithpcsp.py is missing an import struct I think.

Oops, fixed. Thanks. (GDB doesn't respect module boundaries between Python scripts, so I must have had something else loaded that imported struct.)

@aclements is there anything else you would like me to look into?

I think we'll just have to wait for this to happen in the wild with the added diagnostics in Go 1.8.

@aclements
Copy link
Member

Ah! I have a theory.

The condition for calling gcMarkDone (from either the background worker or an assist) is atomic.Xadd(&work.nwait, +1) == work.nproc && !gcMarkWorkAvailable(nil). Among other conditions, gcMarkWorkAvailable checks that work.markrootNext < work.markrootJobs, but markrootNext is incremented when a worker starts a root job, not when it's done, so this condition is satisfied as soon as we're working on the final root jobs, even if we're not done with them. This would explain why, in the dump of all of the Gs from #16083 (comment), they were all marked as scanned: when gcMarkRootCheck failed, we were still working on scanning the final stacks, and by the time throw was able to freeze the world, we had finished those root jobs.

All of this requires a logical race on the termination condition that goes something like this:

  1. We're in the mark 1 sub-phase, there is one worker and, say, one remaining root job to scan some stack (quite plausible since stacks happen to be the last root jobs).
  2. Worker 1 finishes its work and evaluates the first half of the termination condition, atomic.Xadd(&work.nwait, +1) == work.nproc, which is true.
  3. Worker 2 starts and acquires the last stack scan root job. This increments work.markrootNext so it is now equal to work.markrootJobs. It starts working on this scan job.
  4. Worker 1 evaluates the second half of the termination condition, !gcMarkWorkAvailable(nil). Suppose the global queue is empty (also plausible, particularly with a small heap in the mark 1 sub-phase). And we know that work.markrootNext == work.markrootJobs, so the termination condition passes and worker 1 invokes gcMarkDone to finish the mark 1 sub-phase.
  5. This is the part I'm not sure about: gcMarkDone rechecks work.nwait == work.nproc, which should be false at this point, so gcMarkDone should return without doing anything.
  6. gcMarkDone invokes gcMarkRootCheck, which sees the stack that is being scanned, but hasn't been fully scanned yet, and panics.

@gopherbot
Copy link

CL https://golang.org/cl/35353 mentions this issue.

gopherbot pushed a commit that referenced this issue Jan 18, 2017
Currently we check that all roots are marked as soon as gcMarkDone
decides to transition from mark 1 to mark 2. However, issue #16083
indicates that there may be a race where we try to complete mark 1
while a worker is still scanning a stack, causing the root mark check
to fail.

We don't yet understand this race, but as a simple mitigation, move
the root check to after gcMarkDone performs a ragged barrier, which
will force any remaining workers to finish their current job.

Updates #16083. This may "fix" it, but it would be better to
understand and fix the underlying race.

Change-Id: I1af9ce67bd87ade7bc2a067295d79c28cd11abd2
Reviewed-on: https://go-review.googlesource.com/35353
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
@aclements
Copy link
Member

I still haven't been able to work out an exact race that leads to this, but if it's at least along the lines of what I'm thinking, the commit I just pushed should mitigate the issue. And if it doesn't, we'll at least get better diagnostics out of Go 1.8.

Moving to Go 1.9 so we can continue to keep track of this. I'd still like to find and fix this real race.

@aclements aclements modified the milestones: Go1.9, Go1.8Maybe Jan 18, 2017
@gopherbot
Copy link

CL https://golang.org/cl/35678 mentions this issue.

@gopherbot
Copy link

CL https://golang.org/cl/35677 mentions this issue.

gopherbot pushed a commit that referenced this issue Jan 25, 2017
…a g"

Updates #18700 (backport)

Currently there are no diagnostics for mark root check during marking.
Fix this by printing out the same diagnostics we print during mark
termination.

Also, drop the allglock before throwing. Holding that across a throw
causes a self-deadlock with tracebackothers.

For #16083.

Change-Id: Ib605f3ae0c17e70704b31d8378274cfaa2307dc2
Reviewed-on: https://go-review.googlesource.com/35677
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
gopherbot pushed a commit that referenced this issue Jan 25, 2017
…k roots

Fixes #18700 (backport)

Currently we check that all roots are marked as soon as gcMarkDone
decides to transition from mark 1 to mark 2. However, issue #16083
indicates that there may be a race where we try to complete mark 1
while a worker is still scanning a stack, causing the root mark check
to fail.

We don't yet understand this race, but as a simple mitigation, move
the root check to after gcMarkDone performs a ragged barrier, which
will force any remaining workers to finish their current job.

Updates #16083. This may "fix" it, but it would be better to
understand and fix the underlying race.

Change-Id: I1af9ce67bd87ade7bc2a067295d79c28cd11abd2
Reviewed-on: https://go-review.googlesource.com/35678
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@aclements
Copy link
Member

This last happened on November 5th, so I believe this is fixed.

@sigxcpu76
Copy link

kubernetes' kubelet v1.7.8 crashes with "fatal error: scan missed a g".

    Nov 24 04:48:48 minion-10-04 kubelet[2009]: gp 0xc421c74b60 goid 6922978 status 4100 gcscandone false gcscanvalid false
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: fatal error: scan missed a g
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: goroutine 6 [running]:
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: runtime.throw(0x3a4c23b, 0xf)
    Nov 24 04:48:48 minion-10-04 kubelet[2009]:         /usr/local/go/src/runtime/panic.go:596 +0x95 fp=0xc4200886e0 sp=0xc4200886c0
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: runtime.gcMarkRootCheck()
    Nov 24 04:48:48 minion-10-04 kubelet[2009]:         /usr/local/go/src/runtime/mgcmark.go:162 +0x22e fp=0xc420088730 sp=0xc4200886e0
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: runtime.gcMarkDone()
    Nov 24 04:48:48 minion-10-04 kubelet[2009]:         /usr/local/go/src/runtime/mgc.go:1164 +0xee fp=0xc420088750 sp=0xc420088730
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: runtime.gcBgMarkWorker(0xc42001e000)
    Nov 24 04:48:48 minion-10-04 kubelet[2009]:         /usr/local/go/src/runtime/mgc.go:1581 +0x303 fp=0xc4200887d8 sp=0xc420088750
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: runtime.goexit()
    Nov 24 04:48:48 minion-10-04 kubelet[2009]:         /usr/local/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc4200887e0 sp=0xc4200887d8
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: created by runtime.gcBgMarkStartWorkers
    Nov 24 04:48:48 minion-10-04 kubelet[2009]:         /usr/local/go/src/runtime/mgc.go:1412 +0x98
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: goroutine 1 [chan receive, 11654 minutes]:
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: k8s.io/kubernetes/cmd/kubelet/app.run(0xc420b4a000, 0xc4205b46e0, 0xc400000000, 0xc420b2d920)
    Nov 24 04:48:48 minion-10-04 kubelet[2009]:         /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubelet/app/server.go:635 +0x4cc
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: k8s.io/kubernetes/cmd/kubelet/app.Run(0xc420b4a000, 0x0, 0x3a9db4a, 0x22)
    Nov 24 04:48:48 minion-10-04 kubelet[2009]:         /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubelet/app/server.go:294 +0x39
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: main.main()
    Nov 24 04:48:48 minion-10-04 kubelet[2009]:         /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubelet/kubelet.go:55 +0x9c
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: goroutine 17 [syscall, 11654 minutes, locked to thread]:
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: runtime.goexit()
    Nov 24 04:48:48 minion-10-04 kubelet[2009]:         /usr/local/go/src/runtime/asm_amd64.s:2197 +0x1
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: goroutine 5 [chan receive]:
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: k8s.io/kubernetes/vendor/github.com/golang/glog.(*loggingT).flushDaemon(0x58dd360)
    Nov 24 04:48:48 minion-10-04 kubelet[2009]:         /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/golang/glog/glog.go:879 +0x7a
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: created by k8s.io/kubernetes/vendor/github.com/golang/glog.init.1
    Nov 24 04:48:48 minion-10-04 kubelet[2009]:         /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/golang/glog/glog.go:410 +0x21d
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: goroutine 40 [select, 11654 minutes]:
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch.func1(0xc420d48000, 0xc420018240, 0xc420efc300, 0xc420d01320, 0xc420efc2a0, 0xc420d48008)
    Nov 24 04:48:48 minion-10-04 kubelet[2009]:         /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/reflector.go:268 +0x30c
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: created by k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch
    Nov 24 04:48:48 minion-10-04 kubelet[2009]:         /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/reflector.go:285 +0x864
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: goroutine 130 [syscall, 11654 minutes]:
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: os/signal.signal_recv(0x434a58)
    Nov 24 04:48:48 minion-10-04 kubelet[2009]:         /usr/local/go/src/runtime/sigqueue.go:116 +0x104
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: os/signal.loop()
    Nov 24 04:48:48 minion-10-04 kubelet[2009]:         /usr/local/go/src/os/signal/signal_unix.go:22 +0x22
    Nov 24 04:48:48 minion-10-04 kubelet[2009]: created by os/signal.init.1
    Nov 24 04:48:48 minion-10-04 kubelet[2009]:         /usr/local/go/src/os/signal/signal_unix.go:28 +0x41

@ianlancetaylor
Copy link
Contributor

@sigxcpu76 This issue is closed and we believe it is fixed. The "scan missed a g" error can occur for many different reasons. Please open a new issue with instructions for how to reproduce the problem you are seeing. Thanks.

@golang golang locked and limited conversation to collaborators Nov 24, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests