Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: rare SIGBUS in runtime.handoff #16705

Closed
colincross opened this issue Aug 15, 2016 · 9 comments
Closed

runtime: rare SIGBUS in runtime.handoff #16705

colincross opened this issue Aug 15, 2016 · 9 comments

Comments

@colincross
Copy link
Contributor

  1. What version of Go are you using (go version)?
    go1.7rc1
  2. What operating system and processor architecture are you using (go env)?
    linux amd64
  3. What did you do?
    Android updated its go prebuilts on July 8th to go1.7rc1. Since then we have seen two SIGBUS panics in runtime.handoff on our build servers. One was when running the go compiler:
fatal error: unexpected signal during runtime execution
[signal SIGBUS: bus error code=0x2 addr=0x2ad3c661c010 pc=0x41e5e4]

goroutine 31 [running]:
runtime.throw(0x908476, 0x2a)
    prebuilts/go/linux-x86/src/runtime/panic.go:566 +0x95 fp=0xc420452e60 sp=0xc420452e40
runtime.sigpanic()
    prebuilts/go/linux-x86/src/runtime/sigpanic_unix.go:12 +0x2cc fp=0xc420452eb8 sp=0xc420452e60
runtime.handoff(0x2ad3c644c000, 0xc41ffd1729)
    prebuilts/go/linux-x86/src/runtime/mgcwork.go:422 +0x54 fp=0xc420452ef0 sp=0xc420452eb8
runtime.(*gcWork).balance(0xc420041c28)
    prebuilts/go/linux-x86/src/runtime/mgcwork.go:263 +0x54 fp=0xc420452f10 sp=0xc420452ef0
runtime.gcDrain(0xc420041c28, 0x5)
    prebuilts/go/linux-x86/src/runtime/mgcmark.go:972 +0x1ac fp=0xc420452f48 sp=0xc420452f10
runtime.gcBgMarkWorker(0xc420040a00)
    prebuilts/go/linux-x86/src/runtime/mgc.go:1453 +0x1d1 fp=0xc420452fb8 sp=0xc420452f48
runtime.goexit()
    prebuilts/go/linux-x86/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc420452fc0 sp=0xc420452fb8
created by runtime.gcBgMarkStartWorkers
    prebuilts/go/linux-x86/src/runtime/mgc.go:1342 +0x98

The other was when running our ninja manifest generator:

fatal error: unexpected signal during runtime execution
[signal SIGBUS: bus error code=0x2 addr=0x2b12d822b010 pc=0x41ec04]

runtime stack:
runtime.throw(0x72eadc, 0x2a)
    prebuilts/go/linux-x86/src/runtime/panic.go:566 +0x95
runtime.sigpanic()
    prebuilts/go/linux-x86/src/runtime/sigpanic_unix.go:12 +0x2cc
runtime.handoff(0x2b12d819b000, 0xc41fff9ed5)
    prebuilts/go/linux-x86/src/runtime/mgcwork.go:422 +0x54
runtime.(*gcWork).balance(0xc420036728)
    prebuilts/go/linux-x86/src/runtime/mgcwork.go:263 +0x54
runtime.gcDrainN(0xc420036728, 0x1bd877, 0x2b12ffffffed)
    prebuilts/go/linux-x86/src/runtime/mgcmark.go:1037 +0x195
runtime.gcAssistAlloc.func1()
    prebuilts/go/linux-x86/src/runtime/mgcmark.go:457 +0xe5
runtime.systemstack(0xc420028000)
    prebuilts/go/linux-x86/src/runtime/asm_amd64.s:298 +0x79
runtime.mstart()
    prebuilts/go/linux-x86/src/runtime/proc.go:1078

Full panics attached:
panic1.txt
panic2.txt

@bradfitz bradfitz changed the title Rare SIGBUS in runtime.handoff runtime: rare SIGBUS in runtime.handoff Aug 15, 2016
@quentinmit quentinmit added this to the Go1.7Maybe milestone Aug 15, 2016
@aclements
Copy link
Member

I don't suppose you were lucky enough to get core files or any other debug info from these?

I'm rather perplexed by what could specifically cause a SIGBUS here. A SIGSEGV would mean we just followed a bad pointer, but a SIGBUS indicates that we somehow wound up in a truncated file mapping. Furthermore, the pointer has the proper 4K alignment + 16 bytes in both tracebacks, which is unlikely to happen by accident, further indicating that these were valid workbuf pointers. It almost seems like a file got mapped over our workbuf allocation, but AFAIK nothing in the standard library or commands ever mmaps a file.

@aclements
Copy link
Member

More observations:

  1. In both cases, the faulting workbuf address is pretty similar to the good workbuf address passed to handoff (within 1.8MB and 576KB, respectively), further indicating that this isn't just a bogus pointer.
  2. getempty has two code paths. If getempty returned an existing buffer from the work.empty list, it would also have read b1.nobj and should have crashed earlier, so we presumably didn't take this path. Hence, we must have just persistentalloc'd the buffer. This also means the funny lock-free list code isn't involved. Also, the distance between the buffers means they came from different persistentalloc chunks.
  3. I was mistaken about the 4K alignment. workbufs are actually only 64 byte aligned (+ 16 bytes for the field offset). The 4K alignment could indicate that persistentalloc just mmap'ed this memory, but if we're persistentalloc'ing a bunch of workbufs and nothing else, I would expect half of them to be 4K aligned anyway.

@colincross
Copy link
Contributor Author

Sorry, no core dumps.

@aclements
Copy link
Member

@colincross, too bad. Do you know what kernel version these were running on (approximately if you don't know exactly)?

@colincross
Copy link
Contributor Author

Kernel version for panic1.txt was 3.13.0-85-generic #129-Ubuntu SMP.

@colincross
Copy link
Contributor Author

colincross commented Aug 16, 2016

Just got a 3rd one:

[signal SIGBUS: bus error code=0x2 addr=0x2b70f621f010 pc=0x41ec04]

runtime stack:
runtime.throw(0x742d09, 0x2a)
    prebuilts/go/linux-x86/src/runtime/panic.go:566 +0x95
runtime.sigpanic()
    prebuilts/go/linux-x86/src/runtime/sigpanic_unix.go:12 +0x2cc
runtime.handoff(0x2b70f5c69a00, 0xc41ffe3aee)
    prebuilts/go/linux-x86/src/runtime/mgcwork.go:422 +0x54
runtime.(*gcWork).balance(0xc42003fc28)
    prebuilts/go/linux-x86/src/runtime/mgcwork.go:263 +0x54
runtime.gcDrainN(0xc42003fc28, 0x1488cc, 0xc4ffffffea)
    prebuilts/go/linux-x86/src/runtime/mgcmark.go:1037 +0x195
runtime.gcAssistAlloc.func1()
    prebuilts/go/linux-x86/src/runtime/mgcmark.go:457 +0xe5
runtime.systemstack(0xc420030000)
    prebuilts/go/linux-x86/src/runtime/asm_amd64.s:298 +0x79
runtime.mstart()
    prebuilts/go/linux-x86/src/runtime/proc.go:1078

panic3.txt

@colincross
Copy link
Contributor Author

Where would the 0x2xxxxxxxxxxx addresses come from in persistentalloc? Code inspection and strace locally suggest that persistentalloc addresses should be more like 0xc4xxxxxxxx.

@colincross
Copy link
Contributor Author

The most recent panic was associated with a kernel BUG message, so likely not a go issue. I'll close this unless we find some evidence that go is tickling a kernel bug.

[10181793.153768] BUG: Bad page map in process soong_build pte:f000dee8f000dee8 pmd:00000000
[10181793.163021] addr:00002b70f621f010 vm_flags:08100073 anon_vma:ffff881fdd105ab0 mapping: (null) index:2b70f621f
[10181793.175267] CPU: 21 PID: 40830 Comm: soong_build Not tainted 3.13.0-85-generic #129-Ubuntu
[10181793.175268] Hardware name: Dell Inc. PowerEdge R620/0VV3F2, BIOS 2.0.19 08/29/2013
[10181793.175270] 0000000000000000 ffff881052045d40 ffffffff8172b6a7 00002b70f621f010
[10181793.175273] ffff880167eb0fc0 ffff881052045d88 ffffffff81179493 f000dee8f000dee8
[10181793.175276] 00000002b70f621f 00002b70f621f010 ffff88104cef0d88 ffff880167eb0fc0
[10181793.175279] Call Trace:
[10181793.175287] [] dump_stack+0x64/0x82
[10181793.175292] [] print_bad_pte+0x1a3/0x250
[10181793.175294] [] handle_mm_fault+0xec3/0xf10
[10181793.175299] [] ? call_rwsem_down_read_failed+0x14/0x30
[10181793.175304] [] __do_page_fault+0x184/0x570
[10181793.175306] [] ? call_rwsem_wake+0x17/0x30
[10181793.175310] [] ? up_write+0x1d/0x20
[10181793.175314] [] ? vm_mmap_pgoff+0x99/0xc0
[10181793.175316] [] do_page_fault+0x1a/0x70
[10181793.175318] [] page_fault+0x28/0x30
[10181793.175321] Disabling lock debugging due to kernel taint

@aclements
Copy link
Member

Where would the 0x2xxxxxxxxxxx addresses come from in persistentalloc? Code inspection and strace locally suggest that persistentalloc addresses should be more like 0xc4xxxxxxxx.

0xc4... is where the Go heap lives, but persistentalloc uses whatever the kernel returns for a non-MAP_FIXED mapping.

The most recent panic was associated with a kernel BUG message, so likely not a go issue.

Huh. I was spiraling in on this being a kernel bug, but that's not an accusation I wanted to make lightly. :) Glad you found the BUG in the kernel log!

I found some other reports of page table corruption with kernels around 3.13, such as https://lkml.org/lkml/2014/4/8/173, so I suspect this has been fixed in the kernel.

If this turns out to be a real problem, I suspect we can track down the exact path through the kernel that caused this using the print_bad_pte output from your kernel log.

@golang golang locked and limited conversation to collaborators Aug 17, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants