Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: frequent TestSegv failures since 2021-10-26 #49182

Open
bcmills opened this issue Oct 27, 2021 · 40 comments
Open

runtime: frequent TestSegv failures since 2021-10-26 #49182

bcmills opened this issue Oct 27, 2021 · 40 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels Oct 27, 2021
@bcmills bcmills added this to the Go1.18 milestone Oct 27, 2021
@cuonglm
Copy link
Member

cuonglm commented Oct 27, 2021

I think since https://go-review.googlesource.com/c/go/+/339990

CL https://go-review.googlesource.com/c/go/+/339989 causes windows-arm64-10 builder fails.

@bcmills
Copy link
Contributor Author

bcmills commented Oct 27, 2021

Filed the windows-arm64-10 issue separately as #49188.

@gopherbot
Copy link

Change https://golang.org/cl/359254 mentions this issue: runtime: disable TestSegv on darwin, illumos, solaris

gopherbot pushed a commit that referenced this issue Oct 28, 2021
CL 339990 made this test more strict, exposing pre-existing issues on
these OSes. Skip for now until they can be resolved.

Updates #49182

Change-Id: I3ac400dcd21b801bf4ab4eeb630e23b5c66ba563
Reviewed-on: https://go-review.googlesource.com/c/go/+/359254
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
@bcmills
Copy link
Contributor Author

bcmills commented Oct 28, 2021

Hmm, looks like relaxing the check didn't completely solve the problem:

greplogs --dashboard -md -l -e 'FAIL: TestSegv' --since=2021-10-28T16:54:00

2021-10-28T16:54:58-6bd0e7f/linux-arm-aws

@bcmills
Copy link
Contributor Author

bcmills commented Nov 2, 2021

linux-arm-aws is the only one that appears to still be flaky:

greplogs --dashboard -md -l -e '(?ms)TestSegv.*unexpectedly saw "runtime: "'

2021-11-02T20:47:30-b246873/linux-arm-aws
2021-11-02T17:01:14-62b29b0/linux-arm-aws
2021-10-28T16:54:58-6bd0e7f/linux-arm-aws

@aclements
Copy link
Member

This looks like the signal is landing while we're in the VDSO, presumably executing a kernel-provided atomic for casgstatus.

@prattmic, you looked at and fixed a few similar issues recently (e.g., 86f6bf1). Any insights on this one?

@aclements
Copy link
Member

Yeah, the "unknown PC" is in the kernel-provided CAS implementation. The problem may just be that we're entirely missing the vdsoSP protection around the VDSO cas call (presumably the same thing applies to memory_barrier<>, too).

@prattmic
Copy link
Member

prattmic commented Nov 3, 2021

Yes, I believe that is the case. I'll get those covered as well.

@prattmic prattmic self-assigned this Nov 8, 2021
@gopherbot
Copy link

Change https://golang.org/cl/362796 mentions this issue: runtime/internal/atomic: treat ARM kernel helpers as VDSO

@gopherbot
Copy link

Change https://golang.org/cl/362795 mentions this issue: runtime: refactor ARM VDSO call setup to helper

@gopherbot
Copy link

Change https://golang.org/cl/362977 mentions this issue: runtime: start ARM atomic kernel helper traceback in caller

@jeremyfaller jeremyfaller added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Nov 12, 2021
@jeremyfaller
Copy link
Contributor

Mostly fixed. Not a Beta1 blocker as it's not a new breakage, just a stricter test.

@bcmills
Copy link
Contributor Author

bcmills commented Nov 17, 2021

Looks like this is fixed on ARM and skipped on amd64, but still occasionally failing on some of the more exotic architectures.
(It's not clear to me whether the remaining failures are arch-specific bugs.)

greplogs --dashboard -md -l -e 'FAIL: TestSegv ' --since=2021-11-13

2021-11-17T04:55:12-1d004fa/linux-mips64le-mengzhuo
2021-11-13T00:23:16-39bc666/linux-riscv64-unmatched

@bcmills bcmills reopened this Nov 17, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Nov 17, 2021

Looks like both of those failures are in the SegvInCgo subtest.

@bcmills
Copy link
Contributor Author

bcmills commented Dec 1, 2021

https://storage.googleapis.com/go-build-log/3c4f5e79/linux-riscv64-jsing_db2bc678.log (A linux-riscv64-jsing SlowBot), also failed in SegvInCgo.

@prattmic
Copy link
Member

prattmic commented Dec 3, 2021

The netbsd and openbsd failures there appear to be #49209.

@mknyszek
Copy link
Contributor

mknyszek commented Dec 6, 2021

Excluding the NetBSD and OpenBSD failures, all the ones that are left are in TestSegvInCgo like the two @bcmills posted about before. I don't think this is the same problem as the one @prattmic fixed earlier, because even an arm64 builder is failing there.

I'll poke at this one a bit.

@mknyszek
Copy link
Contributor

mknyszek commented Dec 6, 2021

In all these cases it looks like the signal is landing in some part of the cgocall path (not all that surprising), though I'm not sure I understand why gentraceback has issues producing a traceback in these cases.

@mknyszek
Copy link
Contributor

mknyszek commented Dec 6, 2021

oh oh, ok, actually for the linux/arm64 failure, the PC looks like the PC for the branch that @prattmic added in https://go.dev/cl/362977 (i.e. the top bits of the PC (in a 48-bit address space) are 0xffff). So there's some kind of other VDSO (?) call on at least this platform where there's more going on.

The other platforms' failures don't seem to look like this, however.

@bcmills
Copy link
Contributor Author

bcmills commented Jan 14, 2022

The linux-mips64le-mengzhuo failures are still occurring on the builder. It's not obvious to me whether that has the same root cause as the 386 failure mode, so I filed it separately as #50605.

jproberts pushed a commit to jproberts/go that referenced this issue Jun 21, 2022
The VDSO (__kernel_vsyscall) is reachable via
asmcgocall(cgo_start_thread) on linux-386, which causes traceback to
throw.

Fixes golang#49182.
For golang#50504.

Change-Id: Idb78cb8de752203ce0ed63c2dbd2d12847338688
Reviewed-on: https://go-review.googlesource.com/c/go/+/376656
Reviewed-by: Cherry Mui <cherryyz@google.com>
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
@prattmic prattmic self-assigned this Jun 24, 2022
gopherbot pushed a commit that referenced this issue Aug 17, 2022
We have a very complex process to make VDSO calls on ARM. Create a
wrapper helper function which reduces duplication and allows for
additional calls from other packages.

vdsoCall has a few differences from the original code in
walltime/nanotime:

* It does not use R0-R3, as they are passed through as arguments to fn.
* It does not save g if g.m.gsignal.stack.lo is zero. This may occur if
it called at startup on g0 between assigning g0.m.gsignal and setting
its stack.

For #49182

Change-Id: I51aca514b4835b71142011341d2f09125334d30f
Reviewed-on: https://go-review.googlesource.com/c/go/+/362795
Run-TryBot: Michael Pratt <mpratt@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@gopherbot
Copy link

Change https://go.dev/cl/431975 mentions this issue: runtime: enable TestSegv on darwin, illumos, solaris

@bcmills bcmills reopened this May 1, 2023
@bcmills bcmills closed this as completed May 1, 2023
@bcmills bcmills reopened this May 1, 2023
@bcmills
Copy link
Contributor Author

bcmills commented May 1, 2023

@golang/runtime, can this be re-triaged? The test still has a testenv.SkipFlaky call on darwin, illumos, and solaris referring to this issue — if there is some other tracking issue for the test (particularly given that darwin is a first-class platform), then the skip should be updated to refer to that issue.

@bcmills bcmills added the compiler/runtime Issues related to the Go compiler and/or runtime. label May 1, 2023
@bcmills bcmills modified the milestones: Go1.18, Backlog May 1, 2023
@gopherbot
Copy link

Change https://go.dev/cl/491095 mentions this issue: runtime: add test skips for ios

gopherbot pushed a commit that referenced this issue May 3, 2023
For #59912.
For #59913.
Updates #49182.

Change-Id: I3fcdfaca3a4f7120404e7a36b4fb5f0e57dd8114
Reviewed-on: https://go-review.googlesource.com/c/go/+/491095
TryBot-Bypass: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
Reviewed-by: Austin Clements <austin@google.com>
@mknyszek
Copy link
Contributor

mknyszek commented May 3, 2023

Hello from triage. :) I think we didn't get to this because our incoming queue was quite busy. @prattmic will take a look at breaking it up. @cherrymui is working on a rewrite of the test that will split out the failures a bit better.

@cherrymui
Copy link
Member

There are various TestSegv issues that watchflakes are tracking, e.g. #59443. I think we can close this as a dup. If this happens again watchflakes will post on or reopen an existing issue, or open a new one.

@bcmills
Copy link
Contributor Author

bcmills commented Jun 26, 2023

@cherrymui, this issue is more about the followup work needed to diagnose the existing skips, particularly on darwin and on linux/386 (which are both first-class ports):
https://cs.opensource.google/go/go/+/master:src/runtime/crash_cgo_test.go;l=667-675;drc=6dd3bfbed6f17e7789f092e96408c00c227a8b68

@bcmills bcmills reopened this Jun 26, 2023
@bcmills
Copy link
Contributor Author

bcmills commented Jun 26, 2023

That is: relying on watchflakes to report this issue is not appropriate because the relevant failures are already being skipped.

@cherrymui
Copy link
Member

cherrymui commented Jun 26, 2023

I think we know we are not always able to traceback from asynchronous interrupts in libc calls. I'm not sure if there is anything we can do. Or you mean we should push hard to make that work?

The linux/386 VDSO case may be fixable (or already fixed). #50504 should track that.

@bcmills
Copy link
Contributor Author

bcmills commented Jun 26, 2023

I think we know we are not always able to traceback from asynchronous interrupts in libc calls. I'm not sure if there is anything we can do. Or you mean we should push hard to make that work?

I think that if we consider that to be normal operation, we should update the test to examine the output and confirm that it is consistent with what we would expect for that case (instead of calling testenv.SkipFlaky).

@cherrymui
Copy link
Member

Sounds good. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: In Progress
Development

No branches or pull requests

8 participants