Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: segfault with large heap #4981

Closed
alberts opened this issue Mar 5, 2013 · 17 comments
Closed

runtime: segfault with large heap #4981

alberts opened this issue Mar 5, 2013 · 17 comments
Milestone

Comments

@alberts
Copy link
Contributor

alberts commented Mar 5, 2013

What steps will reproduce the problem?

https://groups.google.com/d/topic/golang-dev/sScWVM31lm0/discussion

ready to do further debugging with some guidance

What do you see instead?

a crash

Which compiler are you using (5g, 6g, 8g, gccgo)?

6g

Which operating system are you using?

linux

Which version are you using?  (run 'go version')

tip

Please provide any additional information below.

there are runs that don't crash. in those cases, the heap never seems to go much above
400 MB.

in the cases where it crashes, the heap seems to quickly ramp up from about 400 MB to 14
GB. the underlying text being processed by the process that crashes shouldn't be
dramatically different, so this is surprising.

the crash definitely seems heap size related.

gc35(8): 3+0+0 ms, 14911 -> 14911 MB 9510 -> 9510 (68456-58946) objects, 110(5695)
handoff, 25(592) steal, 925/483/94 yields
unexpected fault address 0xc627430000

gc31(8): 2+0+0 ms, 14796 -> 14796 MB 9763 -> 9763 (68394-58631) objects, 117(6009)
handoff, 34(678) steal, 946/417/88 yields
unexpected fault address 0xc624db0000

gc32(8): 2+0+0 ms, 14815 -> 14815 MB 9867 -> 9867 (68315-58448) objects, 96(5065)
handoff, 32(699) steal, 876/432/86 yields
unexpected fault address 0xc6176b0000

gc31(8): 3+0+0 ms, 14917 -> 14917 MB 9997 -> 9997 (68312-58315) objects, 120(5185)
handoff, 34(626) steal, 863/516/121 yields
unexpected fault address 0xc6198f0000

gc33(8): 2+0+0 ms, 14643 -> 14643 MB 10156 -> 10156 (68451-58295) objects,
115(5795) handoff, 42(653) steal, 968/385/84 yields
unexpected fault address 0xc622790000

gc32(8): 2+0+0 ms, 14760 -> 14760 MB 9964 -> 9964 (68466-58502) objects, 78(3768)
handoff, 45(789) steal, 662/378/90 yields
unexpected fault address 0xc62b0f0000
@dvyukov
Copy link
Member

dvyukov commented Mar 5, 2013

Comment 2:

Can you also post stack traces? Are there any similarities?
Here is the one from your email:
fatal error: fault
[signal 0xb code=0x1 addr=0xc614c20000 pc=0x421f2f]
goroutine 353 [running]:
[fp=0x7fb6a6acdd80] runtime.throw(0xf221b2)
    /build/go/go/src/pkg/runtime/panic.c:465 +0x67
[fp=0x7fb6a6acdd98] runtime.sigpanic()
    /build/go/go/src/pkg/runtime/thread_linux.c:217 +0xe7
[fp=0x7fb6a6acdda0] runtime.memmove()
    /build/go/go/src/pkg/runtime/memmove_amd64.s:46 +0x1f
[fp=0x7fb6a6acddd8] runtime.copy()
    /build/go/go/src/pkg/runtime/slice.c:259 +0x13d
[fp=0x7fb6a6acdea8] bytes.(*Buffer).ReadFrom(0xc2004df850, 0xc2001813c0, 0xc200000280,
0xfffffc36, 0x0, ...)
    /build/go/go/src/pkg/bytes/buffer.go:162 +0x157
[fp=0x7fb6a6acdf30] io.Copy(0xc200448e10, 0xc2004df850, 0xc2001813c0, 0xc200000280, 0x0,
...)
    /build/go/go/src/pkg/io/io.go:328 +0xa3
[fp=0x7fb6a6acdf80] os/exec.func·003(0x8c1a68, 0xc20046a2a0)
    /build/go/go/src/pkg/os/exec/exec.go:207 +0x60
[fp=0x7fb6a6acdfb0] os/exec.func·004(0xc200384160)
    /build/go/go/src/pkg/os/exec/exec.go:274 +0x2a
[fp=0x7fb6a6acdfb8] runtime.goexit()
    /build/go/go/src/pkg/runtime/proc.c:1149
created by os/exec.(*Cmd).Start
    /build/go/go/src/pkg/os/exec/exec.go:275 +0x6ac

@alberts
Copy link
Contributor Author

alberts commented Mar 5, 2013

Comment 3:

So far it's always been the same stack trace.
The buffer is growing because the process I'm running inside the one that crashes is
writing a lot of logs to stdout or stderr.

@dvyukov
Copy link
Member

dvyukov commented Mar 5, 2013

Comment 4:

I have 3 blind hypotheses:
1. 4GB integer overflow somewhere (GC, slices, bufio).
2. Memory corruption in GC.
3. Memory corruption due to a data race.
The easiest thing for now would be to bisect by Go revision.

@alberts
Copy link
Contributor Author

alberts commented Mar 5, 2013

Comment 5:

Based on logs, I think it's something after 3e3f5a31adaf.
There was one successful build of 231af8ac63aa, but then the machine was off for the
weekend, so I don't know if how stable it is. Going to start there.
suspect stuff if that checks out:
278ff13eca7a
31e379ac8279
here we go...

@alberts
Copy link
Contributor Author

alberts commented Mar 5, 2013

Comment 6:

231af8ac63aa looks suspect. testing 1b4b6a694b1c now.

@alberts
Copy link
Contributor Author

alberts commented Mar 5, 2013

Comment 7:

1b4b6a694b1c is fine.
this means one of two things:
1. 231af8ac63aa changes the behaviour of the inside program significantly, causing it
log a lot than it ever has before, exposing an old bug.
2. 231af8ac63aa changed the outside program.
I'll do some more tests to try to eliminate the first option.
it seem significant that every single time it crashed, it was preceded by a GC that
didn't manage to collect *anything*
here's two more:
gc32(8): 3+0+1123 ms, 15069 -> 15069 MB 10143 -> 10143 (68405-58262) objects, 49(2387)
handoff, 45(662) steal, 595/374/105 yields
scvg32: GC forced
scvg32: inuse: 15070, idle: 1631, sys: 16702, released: 597, consumed: 16104 (MB)
unexpected fault address 0xc613e90000
scvg30: inuse: 6402, idle: 2187, sys: 8590, released: 203, consumed: 8386 (MB)
gc35(8): 3+0+0 ms, 14593 -> 14593 MB 9640 -> 9640 (68401-58761) objects, 108(5489)
handoff, 39(684) steal, 808/379/91 yields
unexpected fault address 0xc618e70000
I've tried to build a program to simulate the use of os/exec to run a bunch of programs
that produce a lot of output that is buffered, but I haven't been able to trigger this
"unable to GC anything whatsoever" condition yet.

@dvyukov
Copy link
Member

dvyukov commented Mar 5, 2013

Comment 8:

The GC collects ~1000 objects, but they seem to be small. I think the
GC happens after allocation of a huge object (few GBs), soon after the
previous GC.
Looking at these records:
gc37(8): 3+0+0 ms, 6523 -> 6523 MB 9490 -> 9489 (68285-58796) objects,
65(2826) handoff, 38(541) steal, 667/411/122 yields
gc38(8): 2+0+0 ms, 14715 -> 14715 MB 9490 -> 9490 (68287-58797)
objects, 100(5268) handoff, 39(743) steal, 809/426/86 yields
it seems that the GC happens after allocation of 8192MB object. Can it
be exactly the bytes.Buffer that we've seen in the stack trace?

@alberts
Copy link
Contributor Author

alberts commented Mar 6, 2013

Comment 9:

I have caught the crash inside gdb:
#0  runtime.memmove (to=void, fr=void, n=void) at
/build/next/go/src/pkg/runtime/memmove_amd64.s:46
#1  0x000000000041b5a2 in runtime.copy (to=void, fm=void, width=void, ret=void) at
/build/next/go/src/pkg/runtime/slice.c:230
#2  0x0000000000535e9c in bytes.(*Buffer).ReadFrom (fp=void, b=0xc20051bbd0, r=...,
n=4294966338, err=0) at /build/next/go/src/pkg/bytes/buffer.go:162
#3  0x00000000004778a4 in io.Copy (fp=void, dst=..., src=..., written=0, err=0) at
/build/next/go/src/pkg/io/io.go:328
#4  0x00000000005068a4 in os/exec.func·003 (fp=void, ~anon0="<bad dynamic type>")
at /build/next/go/src/pkg/os/exec/exec.go:207
#5  0x0000000000506928 in os/exec.func·004 (fp=void, fn={void (error *)}
0x7ff2040b2fb0) at /build/next/go/src/pkg/os/exec/exec.go:274
#6  0x0000000000416760 in ?? () at /build/next/go/src/pkg/runtime/proc.c:1126
#7  0x000000c2006e8e40 in ?? ()
#8  0x0000000000000000 in ?? ()
void stuff looks a bit suspect.
n=4294966338 looks exciting...

@alberts
Copy link
Contributor Author

alberts commented Mar 6, 2013

Comment 10:

with GOGC=off at atom's suggestion
<some output snipped>
scvg22: GC forced
scvg22: inuse: 1390, idle: 0, sys: 1390, released: 0, consumed: 1390 (MB)
scvg23: GC forced
scvg23: inuse: 2926, idle: 0, sys: 2926, released: 0, consumed: 2926 (MB)
scvg24: GC forced
scvg24: inuse: 4974, idle: 0, sys: 4974, released: 0, consumed: 4974 (MB)
scvg25: GC forced
scvg25: inuse: 4974, idle: 0, sys: 4974, released: 0, consumed: 4974 (MB)
scvg26: GC forced
scvg26: inuse: 4974, idle: 0, sys: 4974, released: 0, consumed: 4974 (MB)
scvg27: GC forced
scvg27: inuse: 9070, idle: 0, sys: 9070, released: 0, consumed: 9070 (MB)
scvg28: GC forced
scvg28: inuse: 9070, idle: 0, sys: 9070, released: 0, consumed: 9070 (MB)
scvg29: GC forced
scvg29: inuse: 9070, idle: 0, sys: 9070, released: 0, consumed: 9070 (MB)
scvg30: GC forced
scvg30: inuse: 9070, idle: 0, sys: 9070, released: 0, consumed: 9070 (MB)
scvg31: GC forced
scvg31: inuse: 9070, idle: 0, sys: 9070, released: 0, consumed: 9070 (MB)
scvg32: GC forced
scvg32: 0x0 MB released
scvg32: inuse: 9070, idle: 0, sys: 9070, released: 0, consumed: 9070 (MB)
unexpected fault address 0xc52efd0000
fatal error: fault
[signal 0xb code=0x1 addr=0xc52efd0000 pc=0x42349f]
goroutine 1 [running]:
[fp=0x7f366a568ae8] runtime.throw(0x11f326a)
    /build/next/go/src/pkg/runtime/panic.c:465 +0x67
[fp=0x7f366a568b00] runtime.sigpanic()
    /build/next/go/src/pkg/runtime/thread_linux.c:217 +0xe7
[fp=0x7f366a568b08] runtime.memmove()
    /build/next/go/src/pkg/runtime/memmove_amd64.s:46 +0x1f
[fp=0x7f366a568b40] runtime.slicebytetostring(0xc336e9b000, 0xf810c6a9, 0xfffffe00,
0xc436e9b000, 0xf810c6a9, ...)
    /build/next/go/src/pkg/runtime/zstring_linux_amd64.c:300 +0x96
[fp=0x7f366a568ba0] foo/testing/runner.RunCommandCombined(0xc200aa45e0, 0x3f00000001,
0xc236dc4bd0, 0x200076f6db, 0x408470, ...)
this function does:
out, err := command.Exec.CombinedOutput()
return string(out), err
Exec is a *os/exec.Cmd
[fp=0x7f366a568c88] foo/testing/runner.RunCommandTestCase(0xc200aa45e0, 0xc236e90a20,
0x0, 0x0, 0x0, ...)
[fp=0x7f366a568d38] foo/testing/runner.RunCommandTestSuite(0xc200aa45e0, 0xcb3eb0, 0xa,
0xc236e6bab0, 0x1, ...)
[fp=0x7f366a568e08] foo/testing/unit.(*GotestRunner).Run(0xc200371630, 0xc200aa45e0,
0x1, 0x0, 0x0, ...)
[fp=0x7f366a568ec8] foo/testing/runner.Run(0xc200357c60, 0xc200371630, 0xc2003ab5a0,
0x1a, 0xc85af0, ...)
[fp=0x7f366a568f90] main.main()
    /build/foo/cmd/testing/footest/footest.go:45 +0x3aa
[fp=0x7f366a568fb8] runtime.main()
    /build/go/src/pkg/runtime/proc.c:180 +0xab
[fp=0x7f366a568fc0] runtime.goexit()
    /build/go/src/pkg/runtime/proc.c:1131

@remyoudompheng
Copy link
Contributor

Comment 11:

Can you try a small program with string(s) ans copy(t, s) with >4GB data?

@alberts
Copy link
Contributor Author

alberts commented Mar 6, 2013

Comment 12:

I think I got it.
http://play.golang.org/p/kB0-5J8c_9
$ go build bork.go  && GOGC=off GOGCTRACE=1 /usr/bin/time -v ./bork
4883073352
5302471482
5893650854
5076787217
415263038
854870012
89900999
6255975042
unexpected fault address 0xd45cd40000
fatal error: fault
[signal 0xb code=0x1 addr=0xd45cd40000 pc=0x4181bf]
goroutine 3 [running]:
[fp=0x7f2d9bae3ee8] runtime.throw(0x448222)
/build/go/go/src/pkg/runtime/panic.c:465 +0x67
[fp=0x7f2d9bae3f00] runtime.sigpanic()
/build/go/go/src/pkg/runtime/thread_linux.c:217 +0xe7
[fp=0x7f2d9bae3f08] runtime.memmove()
/build/go/go/src/pkg/runtime/memmove_amd64.s:46 +0x1f
[fp=0x7f2d9bae3f40] runtime.copy()
/build/go/go/src/pkg/runtime/slice.c:259 +0x13d
[fp=0x7f2d9bae3fb0] main.foo()
/home/alberts/bork.go:9 +0xb8
[fp=0x7f2d9bae3fb8] main.func·001()
/home/alberts/bork.go:22 +0x18
[fp=0x7f2d9bae3fc0] runtime.goexit()
/build/go/go/src/pkg/runtime/proc.c:1151
created by main.main
/home/alberts/bork.go:24 +0x5f
the process used about 80 GB of RAM before falling over.

@remyoudompheng
Copy link
Contributor

Comment 13:

Do you really need the goroutines to reproduce? I expect a single call to foo() to fail. 
Can you try http://play.golang.org/p/9yl3-f74pA ?
Can you s/MOVLQSX/MOVQ at src/pkg/runtime/memmove_amd64.s:30 and see if it fixes the
issue?
You might need to also change the memmove signature in runtime.h to use a uintptr as
argument.

@alberts
Copy link
Contributor Author

alberts commented Mar 6, 2013

Comment 14:

your one didn't do it, but here is one that does:
http://play.golang.org/p/Oq4dHz0qLX
$ go build k.go && GOGC=off GOGCTRACE=1 ./k
unexpected fault address 0xc2e15b0000
fatal error: fault
[signal 0xb code=0x1 addr=0xc2e15b0000 pc=0x417d7f]
goroutine 1 [running]:
[fp=0x7fe698e12ef0] runtime.throw(0x43dd2a)
    /build/go/go/src/pkg/runtime/panic.c:465 +0x67
[fp=0x7fe698e12f08] runtime.sigpanic()
    /build/go/go/src/pkg/runtime/thread_linux.c:217 +0xe7
[fp=0x7fe698e12f10] runtime.memmove()
    /build/go/go/src/pkg/runtime/memmove_amd64.s:46 +0x1f
[fp=0x7fe698e12f48] runtime.copy()
    /build/go/go/src/pkg/runtime/slice.c:259 +0x13d
[fp=0x7fe698e12f90] main.main()
    /home/alberts/k.go:7 +0x92
[fp=0x7fe698e12fb8] runtime.main()
    /build/go/go/src/pkg/runtime/proc.c:182 +0xa1
[fp=0x7fe698e12fc0] runtime.goexit()
    /build/go/go/src/pkg/runtime/proc.c:1151
your change seems to fix the problem. I had to edit runtime.h too.

@bradfitz
Copy link
Contributor

bradfitz commented Mar 6, 2013

Comment 15:

Labels changed: added go1.1, removed priority-triage, go1.1maybe.

Status changed to Accepted.

@dvyukov
Copy link
Member

dvyukov commented Mar 7, 2013

Comment 16:

The issue is that length is treated as int32, right?

@alberts
Copy link
Contributor Author

alberts commented Mar 7, 2013

Comment 17:

yep

@remyoudompheng
Copy link
Contributor

Comment 18:

This issue was closed by revision 1b8f51c.

Status changed to Fixed.

@rsc rsc added this to the Go1.1 milestone Apr 14, 2015
@rsc rsc removed the go1.1 label Apr 14, 2015
@golang golang locked and limited conversation to collaborators Jun 24, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants