Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: panic on plan9_arm builders #42303

Closed
millerresearch opened this issue Oct 30, 2020 · 21 comments
Closed

runtime: panic on plan9_arm builders #42303

millerresearch opened this issue Oct 30, 2020 · 21 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Plan9
Milestone

Comments

@millerresearch
Copy link
Contributor

The plan9_arm builders have been getting panics on every run since the afternoon of 27 Oct. The first one seems to be https://build.golang.org/log/6a299fffd128c3ed0bfdd1c471c2ca891dee8b34 after CL 232298 was merged.

The immediate cause of the panic varies. Could be a memory corruption.

@dmitshur dmitshur added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Plan9 labels Oct 30, 2020
@dmitshur
Copy link
Contributor

This may be the same as or related to #42237.

@dmitshur dmitshur added this to the Backlog milestone Oct 30, 2020
@dmitshur dmitshur added the arch-arm Issues solely affecting the 32-bit arm architecture. label Oct 30, 2020
@millerresearch
Copy link
Contributor Author

This may be the same as or related to #42237.

I don't think so. Those examples are deliberate panics because of a timeout. The plan9_arm panics are unexpected signals triggered by something going weirdly wrong in the runtime.

@millerresearch
Copy link
Contributor Author

In some cases, the immediate cause of the panic is that _g_.m.p is nil at the top of findrunnable(). Shouldn't that be impossible?

@ianlancetaylor
Copy link
Contributor

CC @aclements @mknyszek @prattmic

@millerresearch
Copy link
Contributor Author

Another common panic cause is that _g_.m is nil after calling acquirep() in mstart1(). Also impossible?

@prattmic
Copy link
Member

prattmic commented Nov 4, 2020

Thanks for the heads up. I agree that this looks different from the failures in #42237, though perhaps it is related. Hopefully this one is reproducible on gomote.

@prattmic
Copy link
Member

prattmic commented Nov 4, 2020

Does this builder have stability issues (not sure who maintains it)? After waiting 4hr 40min (!!!!) for a gomote, I tried to run some tests:

$ gomote run user-mpratt-plan9-arm-0 ./go/src/all.rc  
Building Go cmd/dist using /usr/glenda/go
Building Go toolchain1 using /usr/glenda/go.
2020/11/04 16:02:28 user-mpratt-plan9-arm-0: timeout after 10s waiting for headers for /status
2020/11/04 16:02:50 Buildlet https://farmer.golang.org:443 failed three heartbeats; final error: 502 Bad Gateway
Error running run: Error trying to execute ./go/src/all.rc: Buildlet https://farmer.golang.org:443 failed heartbeat after 86.211812ms; marking dead; err=502 Bad Gateway

And it seems to still be gone:

$ gomote ping user-mpratt-plan9-arm-0
Error running ping: 502 Bad Gateway

I notice on the build dashboard that the builds that don't have these panics contain errors like: communication error to buildlet (promoted to terminal error): network error promoted to terminal error: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain, which sounds awfully similar.

I'm not thrilled by the prospect of waiting another 5hr to see if the next gomote will share a similar fate.

cc @golang/release

@millerresearch
Copy link
Contributor Author

Does this builder have stability issues

Well, that's one way of putting it. As I said above, these builders have been crashing on every run since the 27 October CL I referred to. Sometimes they panic, sometimes the corruption is so bad that the go runtime just hangs. There isn't a full time attendant to keep restarting them manually.

I've been putting in some time trying to debug this, and the common factor seems to be that
_g_.m is nil or garbage. I don't know enough about the internals of the scheduler to understand how this could happen.

@prattmic
Copy link
Member

prattmic commented Nov 4, 2020

I see at least one crash that does look immediately like bad g.m:

Overall, this looks like general memory corruption to me, but that it is limited to runtime internals is interesting. Perhaps bad TLS somehow?

@dmitshur
Copy link
Contributor

dmitshur commented Nov 4, 2020

not sure who maintains it

Builder owners and additional notes are available at https://farmer.golang.org/builders.

@millerresearch
Copy link
Contributor Author

millerresearch commented Nov 5, 2020

To corroborate the circumstantial evidence implicating CL 232298 I've tried patching the current head of the master branch with the reverse of commit 8fdc79e. (It needed a bit of hand editing because of later changes in runtime/proc.go)

With the commit reversed, all.rc completes with ALL TESTS PASSED, and no memory reference traps.

I'll run it a few more times, but it does appear that something in that commit has introduced a bug or tickled an existing bug whose effect is specific to plan9_arm.

@millerresearch
Copy link
Contributor Author

it does appear that something in that commit has introduced a bug

I've isolated it to the new call to wakep in wakeNetPoller. If I remove that, the panics stop.

The ongoing investigation into #42237 seems to show it's related after all, so I'll wait to see how that one is resolved.

@prattmic
Copy link
Member

prattmic commented Nov 5, 2020

Does http://golang.org/cl/267257 (merged to tip now) fix the crashes? That was also directly fixing the startm call in wakep from wakeNetPoller.

@millerresearch
Copy link
Contributor Author

millerresearch commented Nov 6, 2020

Does http://golang.org/cl/267257 (merged to tip now) fix the crashes?

No, it still gets memory faults.

Since the Plan 9 runtime doesn't actually have a network poller, does the wakep serve any useful purpose there? Could we just make it conditional on GOOS != "plan9" ?

If I understand correctly, there's no network poller for wasm/js either. Should the wakep be skipped there too, to avoid unnecessary starting of threads?

@prattmic
Copy link
Member

prattmic commented Nov 6, 2020

The netpoll delay sleep is used to wait for timers, so we still need that wakep to ensure some thread is waiting on timers.

@millerresearch
Copy link
Contributor Author

The netpoll delay sleep is used to wait for timers

Is this documented somewhere? Or can you point me to where in the code this happens on systems with no netpoller?

@prattmic
Copy link
Member

prattmic commented Nov 6, 2020

I'm not aware of great overview docs, but this was part of @ianlancetaylor's new timers last year. You can see the changes in http://golang.org/cl/171821 and its relation chain.

Notably, Plan 9's netpoll sleeps as requested, despite not performing any actual polling. findrunnable determines when the next timer expires (pollUntil here), and will later ask netpoll to wait that long.

@millerresearch
Copy link
Contributor Author

Thanks, that's helpful. I actually fixed a bug in netpoll_stub earlier this year without being fully aware of how it's used. All I know is that, empirically, removing that wakep call makes the difference between memory faults (every test run) and ALL TESTS PASSED. If the wakep is "needed to ensure some thread is waiting", why don't I see deadlocks when it's removed?

@millerresearch
Copy link
Contributor Author

If the wakep is "needed to ensure some thread is waiting", why don't I see deadlocks when it's removed?

The stub netpoll calls notetsleep to do an OS-level timed wait. notetsleep is being run by the m (aka thread, aka Plan 9 process, yes?) associated with g0 (the program's main goroutine, yes?). The Plan 9 implementation of notetsleep is semasleep which blocks on a timed semaphore in the OS. So isn't g0.m already the thread which is waiting on the timer?

Before calling netpoll, the code in findrunnable explicitly checks that there's no current p associated with the running thread, and after netpoll returns there's a call to pidleget to (try to) associate a p with the thread. So while the g0.m thread is blocked on the timed semaphore, what is the point of wakep looking for a spare p and m to hook together?

There must be some detail I'm not seeing, sorry.

@millerresearch
Copy link
Contributor Author

The bug is not specific to ARM. It's not seen on the plan9_386 builder because that's configured as a uniprocessor. I had earlier tried all.rc on a 386 locally and got ALL TESTS PASSED, but that was on a two processor machine. Today I tried on a 386 with 4 cpus, and had the usual memory fault:

# flag
fatal error: unexpected signal during runtime execution
[signal sys: trap: fault read code=0x0 addr=0x0 pc=0x37ebe]

runtime stack:
runtime: unexpected return pc for runtime.findrunnable called from 0x8c468899
stack: frame={sp:0x10847d30, fp:0x10847e34} stack=[0x10837e8c,0x10847e8c)
10847cb0:  00000001  10847cdc  00030dc3 <runtime.throw+99>  10d022a0 
10847cc0:  00030f59 <runtime.fatalthrow+89>  10847cc8  00055920 <runtime.fatalthrow.func1+0>  10d022a0 
10847cd0:  00030dc3 <runtime.throw+99>  10847cdc  00030dc3 <runtime.throw+99>  10847ce0 
10847ce0:  000558b0 <runtime.throw.func1+0>  00275542  0000002a  0002d0a1 <runtime.sigpanic+1201> 
10847cf0:  00275542  0000002a  10d022a0  0003679f <runtime.mDoFixup+111> 
10847d00:  10cf23ac  003e2e01  00000000  10d022a0 
10847d10:  000358c2 <runtime.mPark+66>  10d022a0  003fb528  0003695e <runtime.stopm+142> 
10847d20:  1083aa00  00000000  10d022a0  00037ebe <runtime.findrunnable+2638> 
10847d30: <003f46fc  00000000  00000000  00000003 
10847d40:  00000001  00000000  00000000  00000000 
10847d50:  0100002c  00000001  00000000  00000000 
10847d60:  00000001  00000001  2c95ee4b  00000000 
10847d70:  00000000  00000000  00000000  00000000 
10847d80:  00000000  00000000  00000000  0000002c 
10847d90:  00000000  00000001  00000001  00000003 
10847da0:  00000003  0a03b656  00000003  ffffffff 
10847db0:  ffffffff  00000004  00000000  00000001 
10847dc0:  0002b683 <runtime.(*spanSet).pop+259>  10739a0c  00000001  00000004 
10847dd0:  00000004  00000004  00000001  00405b28 
10847de0:  00000003  00000004  00000003  10739600 
10847df0:  003e8180  000000b7  0000009f  00000000 
10847e00:  00405b28  00000000  00000000  00000000 
10847e10:  10816020  00000002  1083c000  0002bd7b <runtime.(*consistentHeapStats).release+91> 
10847e20:  1083d3ac  00000001  0000004c  00010efa <runtime.(*mspan).nextFreeIndex+74> 
10847e30: !8c468899 >ee264860  00000000  00000400 
10847e40:  00000000  00000000  00000000  ee264860 
10847e50:  0027c8e4  77132430  0000ac50 <runtime.(*mcache).nextFree+144>  00405b28 
10847e60:  00000000  0003c16f <runtime.acquirep+47>  003fb778  00405b28 
10847e70:  00012c53 <runtime.heapBitsSetType+2659>  0000b232 <runtime.mallocgc+1122>  00000000  00000000 
10847e80:  00000000  00000000  0005a801 <runtime.plan9_tsemacquire+1> 
runtime.throw(0x275542, 0x2a)
	/tmp/go/src/runtime/panic.go:1112 +0x63
runtime.sigpanic()
	/tmp/go/src/runtime/os_plan9.go:79 +0x4b1

Could someone remove the 'arch-arm' label from this issue please?

@cagedmantis cagedmantis removed the arch-arm Issues solely affecting the 32-bit arm architecture. label Dec 5, 2020
@gopherbot
Copy link

Change https://golang.org/cl/275672 mentions this issue: runtime: skip wakep call in wakeNetPoller on Plan 9

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Plan9
Projects
None yet
Development

No branches or pull requests

6 participants