-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: panic on plan9_arm builders #42303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This may be the same as or related to #42237. |
I don't think so. Those examples are deliberate panics because of a timeout. The plan9_arm panics are unexpected signals triggered by something going weirdly wrong in the runtime. |
In some cases, the immediate cause of the panic is that |
Another common panic cause is that |
Thanks for the heads up. I agree that this looks different from the failures in #42237, though perhaps it is related. Hopefully this one is reproducible on gomote. |
Does this builder have stability issues (not sure who maintains it)? After waiting 4hr 40min (!!!!) for a gomote, I tried to run some tests:
And it seems to still be gone:
I notice on the build dashboard that the builds that don't have these panics contain errors like: I'm not thrilled by the prospect of waiting another 5hr to see if the next gomote will share a similar fate. cc @golang/release |
Well, that's one way of putting it. As I said above, these builders have been crashing on every run since the 27 October CL I referred to. Sometimes they panic, sometimes the corruption is so bad that the go runtime just hangs. There isn't a full time attendant to keep restarting them manually. I've been putting in some time trying to debug this, and the common factor seems to be that |
I see at least one crash that does look immediately like bad g.m:
Overall, this looks like general memory corruption to me, but that it is limited to runtime internals is interesting. Perhaps bad TLS somehow? |
Builder owners and additional notes are available at https://farmer.golang.org/builders. |
To corroborate the circumstantial evidence implicating CL 232298 I've tried patching the current head of the master branch with the reverse of commit 8fdc79e. (It needed a bit of hand editing because of later changes in runtime/proc.go) With the commit reversed, I'll run it a few more times, but it does appear that something in that commit has introduced a bug or tickled an existing bug whose effect is specific to plan9_arm. |
I've isolated it to the new call to The ongoing investigation into #42237 seems to show it's related after all, so I'll wait to see how that one is resolved. |
Does http://golang.org/cl/267257 (merged to tip now) fix the crashes? That was also directly fixing the startm call in wakep from wakeNetPoller. |
No, it still gets memory faults. Since the Plan 9 runtime doesn't actually have a network poller, does the If I understand correctly, there's no network poller for wasm/js either. Should the |
The netpoll delay sleep is used to wait for timers, so we still need that |
Is this documented somewhere? Or can you point me to where in the code this happens on systems with no netpoller? |
I'm not aware of great overview docs, but this was part of @ianlancetaylor's new timers last year. You can see the changes in http://golang.org/cl/171821 and its relation chain. Notably, Plan 9's |
Thanks, that's helpful. I actually fixed a bug in netpoll_stub earlier this year without being fully aware of how it's used. All I know is that, empirically, removing that |
The stub Before calling There must be some detail I'm not seeing, sorry. |
The bug is not specific to ARM. It's not seen on the plan9_386 builder because that's configured as a uniprocessor. I had earlier tried
Could someone remove the 'arch-arm' label from this issue please? |
Change https://golang.org/cl/275672 mentions this issue: |
The plan9_arm builders have been getting panics on every run since the afternoon of 27 Oct. The first one seems to be https://build.golang.org/log/6a299fffd128c3ed0bfdd1c471c2ca891dee8b34 after CL 232298 was merged.
The immediate cause of the panic varies. Could be a memory corruption.
The text was updated successfully, but these errors were encountered: