Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: "fatal: systemstack called from unexpected goroutine" on Android #51001

Closed
bcmills opened this issue Feb 3, 2022 · 24 comments
Closed
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Android WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Feb 3, 2022

#!watchflakes
post <- builder ~ `android` && `systemstack called from unexpected goroutine`

greplogs --dashboard -md -l -e '^fatal: systemstack called from unexpected goroutine' --since=2021-01-01

2022-02-02T21:12:39-53d6a72/android-amd64-emu

##### GOMAXPROCS=2 runtime -cpu=1,2,4 -quick
fatal: systemstack called from unexpected goroutineTrap 
exitcode=133
FAIL	runtime	59.879s

2021-10-08T16:26:20-59d4e92-99c1b24/android-amd64-emu

fatal: systemstack called from unexpected goroutineSegmentation fault 
exitcode=139FAIL	golang.org/x/net/publicsuffix	3.419s

I'll also note that badsystemstackMsg seems to be missing a final newline as of CL 93659 (CC @aclements @randall77). 😅

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Android labels Feb 3, 2022
@bcmills bcmills added this to the Backlog milestone Feb 3, 2022
@bcmills
Copy link
Contributor Author

bcmills commented May 3, 2022

This happened in a TryBot in https://storage.googleapis.com/go-build-log/f1e11825/android-amd64-emu_262486a5.log:

##### GOMAXPROCS=2 runtime -cpu=1,2,4 -quick
fatal: systemstack called from unexpected goroutineTrap 
exitcode=133
FAIL	runtime	16.177s
FAIL
2022/05/03 14:18:09 Failed: exit status 1
go tool dist: FAILED

Marking as release-blocker because this affects TryBot runs. Since android/amd64 is not a first-class port, either the underlying bug can be diagnosed and fixed, or the builder can be removed from the default TryBot set. (I'll leave that choice up to @golang/runtime to decide and implement.)

@bcmills bcmills modified the milestones: Backlog, Go1.19 May 3, 2022
@bcmills
Copy link
Contributor Author

bcmills commented May 3, 2022

This may or may not be OS-specific. There is another failure in the builder logs since February, but on plan9 rather than android; it isn't obvious to me whether that is an independent bug.

greplogs -l -e 'fatal: systemstack called from unexpected goroutine' --since=2022-02-03
2022-03-05T21:20:16-e155b03-45f4544/plan9-amd64-0intro

@bcmills
Copy link
Contributor Author

bcmills commented May 4, 2022

greplogs -l -e 'fatal: systemstack called from unexpected goroutine' --since=2022-03-06
2022-05-03T19:48:07-bccce90/android-arm64-corellium

@bcmills bcmills changed the title runtime: "fatal: systemstack called from unexpected goroutine" on android-amd64-emu runtime: "fatal: systemstack called from unexpected goroutine" on android/amd64 May 4, 2022
@mknyszek
Copy link
Contributor

@golang/runtime This is a second class port, but because it's a trybot, this is a release blocker. Should we consider removing this as a trybot? Is that bringing us enough value?

@gopherbot
Copy link

Change https://go.dev/cl/407615 mentions this issue: dashboard: remove android-amd64-emu from main go repo's TryBot set

@dmitshur
Copy link
Contributor

I've mailed CL 407615 that makes android-amd64-emu a post-submit builder only (in the main repo) while investigation of this issue is underway. If submitted, this issue can be unmarked as a release-blocker for Go 1.19.

@bcmills bcmills changed the title runtime: "fatal: systemstack called from unexpected goroutine" on android/amd64 runtime: "fatal: systemstack called from unexpected goroutine" on Android May 23, 2022
@bcmills
Copy link
Contributor Author

bcmills commented May 23, 2022

Curiously, this does not appear to be arch-specific: we've seen these failures on both amd64 and arm64.

greplogs -l -e 'fatal: systemstack called from unexpected goroutine' --since=2022-05-04
2022-05-20T22:30:37-2b0e457/android-arm64-corellium

@prattmic prattmic added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label May 25, 2022
@prattmic
Copy link
Member

prattmic commented Jun 1, 2022

The first failure shows exitcode=133. This is likely bash parlance for exiting with signal 5 (SIGTRAP). From man bash: The return value of a simple command is its exit status, or 128+n if the command is terminated by signal n.

If I recall correctly, Android applies a seccomp syscall filter to (all?) processes. I wonder if we are violating this filter on the throw path, resulting in truncation of the stack trace. seccomp with mode SECCOMP_RET_TRAP sends a SIGTRAP on violation.

@prattmic
Copy link
Member

prattmic commented Jun 1, 2022

@golang/android do you know if the Android seccomp filters apply to processes on our builders, and if so which one?

@prattmic
Copy link
Member

prattmic commented Jun 6, 2022

No repros of this on 25 gomotes all weekend. I did find #53250, plus several no context SIGSEGVs in the runtime test, like:

##### GOMAXPROCS=2 runtime -cpu=1,2,4 -quick
Segmentation fault 
exitcode=139
FAIL»...runtime»19.914s
FAIL
2022/06/05 22:34:10 Failed: exit status 1

(Some where in the standard runtime test rather the -cpu variant)

@aclements
Copy link
Member

This isn't a first-class port, so dropping release-blocker.

@gopherbot gopherbot removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Jun 10, 2022
@bcmills
Copy link
Contributor Author

bcmills commented Jun 14, 2022

This isn't a first-class port, so dropping release-blocker.

This port is still run as a default TryBot until/unless CL 407615 is merged. IMO known failures on TryBots should still block releases, since they still add testing noise for anyone who uses TryBots on a pending change.

@bcmills
Copy link
Contributor Author

bcmills commented Jun 14, 2022

In the interest of decoupling this issue from the Android TryBots in general, I've filed #53377 (as a release-blocker) to decide whether to remove the TryBots or fix their known failure modes.

@bcmills
Copy link
Contributor Author

bcmills commented Jun 14, 2022

Summarizing the known failures with this pattern on Android:

greplogs -l -e '(?ms)\Aandroid-.*^fatal: systemstack called from unexpected goroutine'
2022-05-20T22:30:37-2b0e457/android-arm64-corellium
2022-05-03T19:48:07-bccce90/android-arm64-corellium
2022-02-02T21:12:39-53d6a72/android-amd64-emu
2021-10-08T16:26:20-59d4e92-99c1b24/android-amd64-emu

So it looks like this bug was probably introduced sometime in 2021..?
(Or else, maybe the check itself was introduced then? 😅)

@gopherbot
Copy link

Change https://go.dev/cl/412174 mentions this issue: dashboard: add known issues for android-*-emu

gopherbot pushed a commit to golang/build that referenced this issue Jun 14, 2022
Issue golang/go#42212 manifests as test timeouts, and is by far the most
frequent of these known issues.

Issue golang/go#51001 causes failures with "systemstack called from unexpected
goroutine". It seems to have been introduced sometime last year, but
it isn't clear to me whether it is a regression or an older (latent)
bug unearthed by some other change.

Issue golang/go#52724 appears to be a bug or race in the Android emulator
itself. It might require a builder image update and/or escalation to
the maintainers of the emulator proper.

Updates golang/go#53377.

Change-Id: I677915b1ff02dd02e0f14c63b0d25caf11e27a72
Reviewed-on: https://go-review.googlesource.com/c/build/+/412174
Reviewed-by: David Chase <drchase@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Bryan Mills <bcmills@google.com>
@ianlancetaylor
Copy link
Contributor

Rolling forward to 1.20.

@heschi
Copy link
Contributor

heschi commented Aug 29, 2022

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- builder ~ `android` && `systemstack called from unexpected goroutine`
2022-08-22 14:48 android-arm64-corellium go@6bdca820 runtime (log)
fatal: systemstack called from unexpected goroutine
2022-08-25 19:17 android-arm64-corellium go@f64f12f0 runtime (log)
fatal: systemstack called from unexpected goroutine
2022-09-27 18:26 android-arm64-corellium go@17078f58 runtime (log)
fatal: systemstack called from unexpected goroutine

watchflakes

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- builder ~ `android` && `systemstack called from unexpected goroutine`
2022-10-06 02:38 android-arm64-corellium go@2e054128 runtime (log)
fatal: systemstack called from unexpected goroutine

watchflakes

@cherrymui
Copy link
Member

Seems no new failure for some time.

@cherrymui cherrymui modified the milestones: Go1.20, Backlog Jan 10, 2023
@bcmills
Copy link
Contributor Author

bcmills commented Jan 11, 2023

Note that the rate of testing is much lower now because of the freeze. (6 months is a good window size for checking failure rates.)

@bcmills bcmills added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Feb 3, 2023
@bcmills
Copy link
Contributor Author

bcmills commented Feb 3, 2023

Still none after the tree reopened. Maybe fixed?

@gopherbot
Copy link

Change https://go.dev/cl/465156 mentions this issue: dashboard: unmark known-issues with low failure rates

gopherbot pushed a commit to golang/build that referenced this issue Feb 4, 2023
I had initially added known issues fairly aggressively in order to use
them to reduce noise in 'greplogs -triage'. Now that we are using
'watchflakes' for triage, that noise reduction is no longer important
(the failures are already clustered to their respective known issues),
and having greyed-out cells on the dashboard makes new regressions too
easy to miss.

Concretely:

- golang/go#42212 is mostly specific to x/net at this point (as
  golang/go#57841)

- There have been no failures matching golang/go#51001 since October.

- golang/go#52724 has been so rare lately that we hadn't yet added a
  'watchflakes' pattern for it.

- There have been no failures matching golang/go#51443 since May.

- There have been no failures matching golang/go#53116 or
  golang/go#53093 since I enabled 'watchflakes' for the builder in
  December.

- The linux-amd64-perf builder seems to be passing consistently for
  x/benchmarks and x/tools, so there is no need to refer to
  golang/go#53538 to explain failures on it.

Change-Id: Ia16db2a23e5fa037a299f1f56fb26f1cf84521e1
Reviewed-on: https://go-review.googlesource.com/c/build/+/465156
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Run-TryBot: Bryan Mills <bcmills@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@seankhliao seankhliao added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Sep 18, 2023
@gopherbot
Copy link

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@gopherbot gopherbot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Android WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
Status: Done
Development

No branches or pull requests

10 participants