runtime: panic on plan9_arm builders #42303

millerresearch · 2020-10-30T20:33:36Z

The plan9_arm builders have been getting panics on every run since the afternoon of 27 Oct. The first one seems to be https://build.golang.org/log/6a299fffd128c3ed0bfdd1c471c2ca891dee8b34 after CL 232298 was merged.

The immediate cause of the panic varies. Could be a memory corruption.

dmitshur · 2020-10-30T21:22:35Z

This may be the same as or related to #42237.

millerresearch · 2020-10-31T11:59:19Z

This may be the same as or related to #42237.

I don't think so. Those examples are deliberate panics because of a timeout. The plan9_arm panics are unexpected signals triggered by something going weirdly wrong in the runtime.

millerresearch · 2020-11-03T22:38:16Z

In some cases, the immediate cause of the panic is that _g_.m.p is nil at the top of findrunnable(). Shouldn't that be impossible?

ianlancetaylor · 2020-11-03T23:40:13Z

CC @aclements @mknyszek @prattmic

millerresearch · 2020-11-04T11:32:04Z

Another common panic cause is that _g_.m is nil after calling acquirep() in mstart1(). Also impossible?

prattmic · 2020-11-04T16:03:18Z

Thanks for the heads up. I agree that this looks different from the failures in #42237, though perhaps it is related. Hopefully this one is reproducible on gomote.

prattmic · 2020-11-04T21:11:08Z

Does this builder have stability issues (not sure who maintains it)? After waiting 4hr 40min (!!!!) for a gomote, I tried to run some tests:

$ gomote run user-mpratt-plan9-arm-0 ./go/src/all.rc  
Building Go cmd/dist using /usr/glenda/go
Building Go toolchain1 using /usr/glenda/go.
2020/11/04 16:02:28 user-mpratt-plan9-arm-0: timeout after 10s waiting for headers for /status
2020/11/04 16:02:50 Buildlet https://farmer.golang.org:443 failed three heartbeats; final error: 502 Bad Gateway
Error running run: Error trying to execute ./go/src/all.rc: Buildlet https://farmer.golang.org:443 failed heartbeat after 86.211812ms; marking dead; err=502 Bad Gateway

And it seems to still be gone:

$ gomote ping user-mpratt-plan9-arm-0
Error running ping: 502 Bad Gateway

I notice on the build dashboard that the builds that don't have these panics contain errors like: communication error to buildlet (promoted to terminal error): network error promoted to terminal error: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain, which sounds awfully similar.

I'm not thrilled by the prospect of waiting another 5hr to see if the next gomote will share a similar fate.

cc @golang/release

millerresearch · 2020-11-04T21:20:41Z

Does this builder have stability issues

Well, that's one way of putting it. As I said above, these builders have been crashing on every run since the 27 October CL I referred to. Sometimes they panic, sometimes the corruption is so bad that the go runtime just hangs. There isn't a full time attendant to keep restarting them manually.

I've been putting in some time trying to debug this, and the common factor seems to be that
_g_.m is nil or garbage. I don't know enough about the internals of the scheduler to understand how this could happen.

prattmic · 2020-11-04T21:34:29Z

I see at least one crash that does look immediately like bad g.m:

https://build.golang.org/log/8135e1c551962cbe67aaa005110626dc6b6d7539: p.mcache is nil. This p came from pidleget, so it's not immediately coming off of a bad m (though it could be a downstream issue).

Overall, this looks like general memory corruption to me, but that it is limited to runtime internals is interesting. Perhaps bad TLS somehow?

dmitshur · 2020-11-04T21:41:36Z

not sure who maintains it

Builder owners and additional notes are available at https://farmer.golang.org/builders.

millerresearch · 2020-11-05T17:22:08Z

To corroborate the circumstantial evidence implicating CL 232298 I've tried patching the current head of the master branch with the reverse of commit 8fdc79e. (It needed a bit of hand editing because of later changes in runtime/proc.go)

With the commit reversed, all.rc completes with ALL TESTS PASSED, and no memory reference traps.

I'll run it a few more times, but it does appear that something in that commit has introduced a bug or tickled an existing bug whose effect is specific to plan9_arm.

millerresearch · 2020-11-05T22:41:47Z

it does appear that something in that commit has introduced a bug

I've isolated it to the new call to wakep in wakeNetPoller. If I remove that, the panics stop.

The ongoing investigation into #42237 seems to show it's related after all, so I'll wait to see how that one is resolved.

prattmic · 2020-11-05T22:48:57Z

Does http://golang.org/cl/267257 (merged to tip now) fix the crashes? That was also directly fixing the startm call in wakep from wakeNetPoller.

millerresearch · 2020-11-06T09:29:01Z

Does http://golang.org/cl/267257 (merged to tip now) fix the crashes?

No, it still gets memory faults.

Since the Plan 9 runtime doesn't actually have a network poller, does the wakep serve any useful purpose there? Could we just make it conditional on GOOS != "plan9" ?

If I understand correctly, there's no network poller for wasm/js either. Should the wakep be skipped there too, to avoid unnecessary starting of threads?

prattmic · 2020-11-06T14:56:36Z

The netpoll delay sleep is used to wait for timers, so we still need that wakep to ensure some thread is waiting on timers.

millerresearch · 2020-11-06T15:04:19Z

The netpoll delay sleep is used to wait for timers

Is this documented somewhere? Or can you point me to where in the code this happens on systems with no netpoller?

prattmic · 2020-11-06T15:24:19Z

I'm not aware of great overview docs, but this was part of @ianlancetaylor's new timers last year. You can see the changes in http://golang.org/cl/171821 and its relation chain.

Notably, Plan 9's netpoll sleeps as requested, despite not performing any actual polling. findrunnable determines when the next timer expires (pollUntil here), and will later ask netpoll to wait that long.

millerresearch · 2020-11-06T15:43:07Z

Thanks, that's helpful. I actually fixed a bug in netpoll_stub earlier this year without being fully aware of how it's used. All I know is that, empirically, removing that wakep call makes the difference between memory faults (every test run) and ALL TESTS PASSED. If the wakep is "needed to ensure some thread is waiting", why don't I see deadlocks when it's removed?

millerresearch · 2020-11-07T13:34:48Z

If the wakep is "needed to ensure some thread is waiting", why don't I see deadlocks when it's removed?

The stub netpoll calls notetsleep to do an OS-level timed wait. notetsleep is being run by the m (aka thread, aka Plan 9 process, yes?) associated with g0 (the program's main goroutine, yes?). The Plan 9 implementation of notetsleep is semasleep which blocks on a timed semaphore in the OS. So isn't g0.m already the thread which is waiting on the timer?

Before calling netpoll, the code in findrunnable explicitly checks that there's no current p associated with the running thread, and after netpoll returns there's a call to pidleget to (try to) associate a p with the thread. So while the g0.m thread is blocked on the timed semaphore, what is the point of wakep looking for a spare p and m to hook together?

There must be some detail I'm not seeing, sorry.

millerresearch · 2020-12-05T11:43:00Z

The bug is not specific to ARM. It's not seen on the plan9_386 builder because that's configured as a uniprocessor. I had earlier tried all.rc on a 386 locally and got ALL TESTS PASSED, but that was on a two processor machine. Today I tried on a 386 with 4 cpus, and had the usual memory fault:

# flag
fatal error: unexpected signal during runtime execution
[signal sys: trap: fault read code=0x0 addr=0x0 pc=0x37ebe]

runtime stack:
runtime: unexpected return pc for runtime.findrunnable called from 0x8c468899
stack: frame={sp:0x10847d30, fp:0x10847e34} stack=[0x10837e8c,0x10847e8c)
10847cb0:  00000001  10847cdc  00030dc3 <runtime.throw+99>  10d022a0 
10847cc0:  00030f59 <runtime.fatalthrow+89>  10847cc8  00055920 <runtime.fatalthrow.func1+0>  10d022a0 
10847cd0:  00030dc3 <runtime.throw+99>  10847cdc  00030dc3 <runtime.throw+99>  10847ce0 
10847ce0:  000558b0 <runtime.throw.func1+0>  00275542  0000002a  0002d0a1 <runtime.sigpanic+1201> 
10847cf0:  00275542  0000002a  10d022a0  0003679f <runtime.mDoFixup+111> 
10847d00:  10cf23ac  003e2e01  00000000  10d022a0 
10847d10:  000358c2 <runtime.mPark+66>  10d022a0  003fb528  0003695e <runtime.stopm+142> 
10847d20:  1083aa00  00000000  10d022a0  00037ebe <runtime.findrunnable+2638> 
10847d30: <003f46fc  00000000  00000000  00000003 
10847d40:  00000001  00000000  00000000  00000000 
10847d50:  0100002c  00000001  00000000  00000000 
10847d60:  00000001  00000001  2c95ee4b  00000000 
10847d70:  00000000  00000000  00000000  00000000 
10847d80:  00000000  00000000  00000000  0000002c 
10847d90:  00000000  00000001  00000001  00000003 
10847da0:  00000003  0a03b656  00000003  ffffffff 
10847db0:  ffffffff  00000004  00000000  00000001 
10847dc0:  0002b683 <runtime.(*spanSet).pop+259>  10739a0c  00000001  00000004 
10847dd0:  00000004  00000004  00000001  00405b28 
10847de0:  00000003  00000004  00000003  10739600 
10847df0:  003e8180  000000b7  0000009f  00000000 
10847e00:  00405b28  00000000  00000000  00000000 
10847e10:  10816020  00000002  1083c000  0002bd7b <runtime.(*consistentHeapStats).release+91> 
10847e20:  1083d3ac  00000001  0000004c  00010efa <runtime.(*mspan).nextFreeIndex+74> 
10847e30: !8c468899 >ee264860  00000000  00000400 
10847e40:  00000000  00000000  00000000  ee264860 
10847e50:  0027c8e4  77132430  0000ac50 <runtime.(*mcache).nextFree+144>  00405b28 
10847e60:  00000000  0003c16f <runtime.acquirep+47>  003fb778  00405b28 
10847e70:  00012c53 <runtime.heapBitsSetType+2659>  0000b232 <runtime.mallocgc+1122>  00000000  00000000 
10847e80:  00000000  00000000  0005a801 <runtime.plan9_tsemacquire+1> 
runtime.throw(0x275542, 0x2a)
	/tmp/go/src/runtime/panic.go:1112 +0x63
runtime.sigpanic()
	/tmp/go/src/runtime/os_plan9.go:79 +0x4b1

Could someone remove the 'arch-arm' label from this issue please?

gopherbot · 2020-12-05T20:11:10Z

Change https://golang.org/cl/275672 mentions this issue: runtime: skip wakep call in wakeNetPoller on Plan 9

dmitshur added NeedsInvestigation OS-Plan9 labels Oct 30, 2020

dmitshur added this to the Backlog milestone Oct 30, 2020

dmitshur added the arch-arm label Oct 30, 2020

cagedmantis removed the arch-arm label Dec 5, 2020

gopherbot closed this as completed in 53c984d Dec 21, 2020

bcmills mentioned this issue Jan 5, 2021

os/exec: program hangs if it reads from pipe on Plan 9 (9front) #43524

Closed

golang locked and limited conversation to collaborators Dec 21, 2021

gopherbot added the FrozenDueToAge label Dec 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: panic on plan9_arm builders #42303

runtime: panic on plan9_arm builders #42303

millerresearch commented Oct 30, 2020

dmitshur commented Oct 30, 2020

millerresearch commented Oct 31, 2020

millerresearch commented Nov 3, 2020

ianlancetaylor commented Nov 3, 2020

millerresearch commented Nov 4, 2020

prattmic commented Nov 4, 2020

prattmic commented Nov 4, 2020

millerresearch commented Nov 4, 2020

prattmic commented Nov 4, 2020 •

edited

Loading

dmitshur commented Nov 4, 2020

millerresearch commented Nov 5, 2020 •

edited by bcmills

Loading

millerresearch commented Nov 5, 2020

prattmic commented Nov 5, 2020

millerresearch commented Nov 6, 2020 •

edited

Loading

prattmic commented Nov 6, 2020

millerresearch commented Nov 6, 2020

prattmic commented Nov 6, 2020

millerresearch commented Nov 6, 2020

millerresearch commented Nov 7, 2020

millerresearch commented Dec 5, 2020

gopherbot commented Dec 5, 2020

runtime: panic on plan9_arm builders #42303

runtime: panic on plan9_arm builders #42303

Comments

millerresearch commented Oct 30, 2020

dmitshur commented Oct 30, 2020

millerresearch commented Oct 31, 2020

millerresearch commented Nov 3, 2020

ianlancetaylor commented Nov 3, 2020

millerresearch commented Nov 4, 2020

prattmic commented Nov 4, 2020

prattmic commented Nov 4, 2020

millerresearch commented Nov 4, 2020

prattmic commented Nov 4, 2020 • edited Loading

dmitshur commented Nov 4, 2020

millerresearch commented Nov 5, 2020 • edited by bcmills Loading

millerresearch commented Nov 5, 2020

prattmic commented Nov 5, 2020

millerresearch commented Nov 6, 2020 • edited Loading

prattmic commented Nov 6, 2020

millerresearch commented Nov 6, 2020

prattmic commented Nov 6, 2020

millerresearch commented Nov 6, 2020

millerresearch commented Nov 7, 2020

millerresearch commented Dec 5, 2020

gopherbot commented Dec 5, 2020

prattmic commented Nov 4, 2020 •

edited

Loading

millerresearch commented Nov 5, 2020 •

edited by bcmills

Loading

millerresearch commented Nov 6, 2020 •

edited

Loading