Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/trace: test failures increased as of 2020-05-01 #38794

Closed
bcmills opened this issue May 1, 2020 · 33 comments
Closed

runtime/trace: test failures increased as of 2020-05-01 #38794

bcmills opened this issue May 1, 2020 · 33 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented May 1, 2020

2020-05-01T15:02:37-f092be8/windows-amd64-longtest
2019-11-22T21:09:43-9f3c2b6/windows-amd64-longtest

--- FAIL: TestEventBatch (6.16s)
    --- FAIL: TestEventBatch/G=4978 (0.05s)
        trace_test.go:80: failed to parse trace: no consistent ordering of events possible
FAIL
FAIL	runtime/trace	8.257s

CC @hyangah @aclements

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 1, 2020
@bcmills bcmills added this to the Backlog milestone May 1, 2020
@bcmills
Copy link
Contributor Author

bcmills commented May 6, 2020

@bcmills bcmills modified the milestones: Backlog, Go1.15 May 6, 2020
@bcmills bcmills changed the title runtime/trace: TestEventBatch failures on windows-amd64-longtest builder runtime/trace: TestEventBatch failures increased as of 2020-05-01 May 6, 2020
@hyangah
Copy link
Contributor

hyangah commented May 7, 2020

I was thinking to skip this in case of the inconsistent ordering error (like many other execution trace tests due to #29707. But if we are observing the failure rate increasing, I am afraid users will face the similar issue more frequently. Are there any changes during this cycle that possibly caused the timestamp info to be less accurate?

@aclements @mknyszek

@bcmills bcmills changed the title runtime/trace: TestEventBatch failures increased as of 2020-05-01 runtime/trace: test failures increased as of 2020-05-01 May 13, 2020
@bcmills
Copy link
Contributor Author

bcmills commented May 13, 2020

https://storage.googleapis.com/go-build-log/2449e0ae/android-amd64-emu_518ac08d.log

--- FAIL: TestDirectSemaphoreHandoff (0.04s)
    trace_test.go:262: failed to trace the program: no consistent ordering of events possible
FAIL
exitcode=1FAIL	cmd/trace	0.219s

@bcmills
Copy link
Contributor Author

bcmills commented May 13, 2020

https://build.golang.org/log/d14354a0476763a9e3f92118417547d2859351d7

--- FAIL: TestTraceStressStartStop (0.08s)
    trace_test.go:147: failed to parse trace: no consistent ordering of events possible
FAIL
FAIL	runtime/trace	2.409s

@bcmills
Copy link
Contributor Author

bcmills commented May 13, 2020

This has so far caused two consecutive SlowBot failures on https://golang.org/cl/233526.

I'm going to start a git bisect run to see if there is a specific change we can roll back to stabilize the tree.

@bcmills

This comment has been minimized.

@bcmills
Copy link
Contributor Author

bcmills commented May 14, 2020

Rats, looks like I need a larger -count.

~/bisect/src$ git checkout HEAD~1
Previous HEAD position was f092be8fd8 Merge "cmd: merge branch 'dev.link' into master"
HEAD is now at e3b0e3d646 cmd/go: add 'go help buildconstraint'

~/bisect/src$ ./make.bash
Building Go cmd/dist using /usr/local/google/home/bcmills/go1.4. (go1.4-bootstrap-20170531 linux/amd64)
Building Go toolchain1 using /usr/local/google/home/bcmills/go1.4.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for linux/amd64.
---
Installed Go for linux/amd64 in /usr/local/google/home/bcmills/bisect
Installed commands in /usr/local/google/home/bcmills/bisect/bin

~/bisect/src$ ../bin/go test -count=200 -failfast -timeout=30m runtime/trace
--- FAIL: TestEventBatch (7.12s)
    --- FAIL: TestEventBatch/G=5047 (0.06s)
        trace_test.go:80: failed to parse trace: no consistent ordering of events possible
FAIL
FAIL    runtime/trace   33.080s
FAIL

Restarting bisection. 😞 At least I have a tighter upper bound now!

@bcmills
Copy link
Contributor Author

bcmills commented May 14, 2020

New git bisect result with -count=256:

2491c5fd2451783e4ba6630345805de1e7761e3b is the first bad commit
commit 2491c5fd2451783e4ba6630345805de1e7761e3b
Author: Michael Anthony Knyszek <mknyszek@google.com>
Date:   Tue Nov 19 17:32:17 2019 +0000

    runtime: wake scavenger and update address on sweep done
…
    Reviewed-on: https://go-review.googlesource.com/c/go/+/207998

That was CL 207998, for #35788. (CC @mknyszek @aclements)

@bcmills

This comment has been minimized.

@bcmills
Copy link
Contributor Author

bcmills commented May 14, 2020

Oh, that CL isn't much too old after all! It was authored back in November, but wasn't merged until April 30, and apparently Git+Gerrit preserve the date of the first iteration of the commit rather than updating it on amend.

@mknyszek
Copy link
Contributor

@bcmills There's a separate commit date.

Huh... I wonder if there's an edge missing in the trace somewhere. The scavenger is now sometimes awoken via sysmon, which is a fairly indirect way of unparking a goroutine.

@mknyszek
Copy link
Contributor

@bcmills I forgot to say: thanks for bisecting. I'll dig into it.

@mknyszek mknyszek self-assigned this May 14, 2020
@mknyszek
Copy link
Contributor

Some progress: it's definitely the scavenger holding it up some of the time. I have a trace where I've observed a GoStart event for the scavenger (started in Gwaiting in the trace) happen before a GoUnblock event. Both are on the "frontier" for the parser, but neither can actually be considered because they're out-of-order (the rest are held up for uninteresting reasons). The main issue is that the sequence number for the GoStart and the GoUnblock event are swapped, as a result.

I'm not totally sure how this is possible. Scavenger wake-ups are protected by a lock. The first thing injectglist does is emit a GoUnblock event under this lock. GoStart events are only emitted right before a goroutine is about to get scheduled, after a goroutine exits a syscall and continue running, when a goroutine is starting a trace, or after GOMAXPROCS changes and that happened on a P which is being destroyed. I'm not sure it's possible (or something would be really wrong) for a goroutine to start executing and then be added to some runqueue, so there's got to be some kind of trace event emission skew.

Anyway, will continue to investigate. I feel close to the source of the issue, but the mechanism still eludes me.

@mknyszek
Copy link
Contributor

Getting closer. Looks more like now there's a chance there's a missing Unblock event when the scavenger is awoken sometimes, which kind of looks like an out-of-order error.

@mknyszek
Copy link
Contributor

mknyszek commented May 18, 2020

I think I figured it out (and maybe the cause of many of our recent trace ordering inconsistencies). The main problem is missing events generated by sysmon.

trace.enabled is switched on during a stop-the-world event, but sysmon isn't stopped. sysmon can generate trace events though, which means that if a state change to trace.enabled races with sysmon unparking a goroutine, it might not actually emit the trace event when it really should. This results in missing events and surfaces as an "inconsistent trace order" error in the trace parser which is unfortunately pretty much indistinguishable from a missing event from the trace parser's PoV.

The recent change to have the scavenger occasionally awoken by sysmon simply exacerbates this issue and makes it more likely to occur (especially in tests, where we're stopping the world to start tracing a lot). I've observed this with a debuglog trace of waking the scavenger, and it appears sysmon is involved in the wake-up when there's a missing event.

Not sure what the fix is here. Stopping sysmon for STW seems like it would require a lot of work. It's tricky because switching on tracing requires the system to reach a stable state where it can emit a consistent trace. Maybe in specifically these STWs we block until sysmon also reaches a stable state? Basically we block (in only the trace case) until sysmon preempts itself, and we set some special flag so that sysmon knows it should preempt itself.

CC @aclements @hyangah @ianlancetaylor

Possibly related: #29707

@mknyszek
Copy link
Contributor

Not sure what the fix is here. Stopping sysmon for STW seems like it would require a lot of work. It's tricky because switching on tracing requires the system to reach a stable state where it can emit a consistent trace. Maybe in specifically these STWs we block until sysmon also reaches a stable state? Basically we block (in only the trace case) until sysmon preempts itself, and we set some special flag so that sysmon knows it should preempt itself.

By "a lot of work" I mean "a lot of work to make efficient." Because sysmon can sleep for up to 10ms, it could really hurt STW latency. That's why I'm suggesting we only do this for trace STWs because I don't think those have the same SLO as other STW events (such as for GC).

@ianlancetaylor
Copy link
Contributor

Interesting result.

I would like to believe that there is a limited number of trace events that sysmon can generate. Like, maybe they can only happen in the retake function. Could we just skip retake when the world is stopped? Or could we annotate any trace events generated by retake to indicate that they come from sysmon, and that therefore the ordering might be surprising?

@mknyszek
Copy link
Contributor

mknyszek commented May 18, 2020

Interesting result.

I would like to believe that there is a limited number of trace events that sysmon can generate. Like, maybe they can only happen in the retake function. Could we just skip retake when the world is stopped? Or could we annotate any trace events generated by retake to indicate that they come from sysmon, and that therefore the ordering might be surprising?

What I've been observing is that they come from (or rather should come from) injectglist after netpoll and from the scavenger wake (this is the big one, leading to this issue). The issue isn't bad ordering of events, it's the fact that events which the execution trace parser expects aren't emitted at all. I'm not really sure there's much here for the parser to just ignore. When I mentioned above about sysmon being involved, I mean that I actually saw it casgstatus the scavenger to runnable in injectglist without emitting a trace event.

@ianlancetaylor
Copy link
Contributor

Obviously I'm missing something, but I don't see how injectglist could fail to generate a trace event.

@mknyszek
Copy link
Contributor

mknyszek commented May 18, 2020

Here's a more concrete scenario:

  1. StartTrace is called and stops the world. It hasn't yet set trace.enabled.
  2. sysmon sees it should wake the scavenger/netpoll indicates there are goroutines to be awoken.
  3. StartTrace emits a bunch of GoWaiting events for all the goroutines in _Gwaiting.
  4. sysmon calls injectglist but doesn't observe trace.enabled as true, so it doesn't emit any GoUnblock events.
  5. StartTrace sets trace.enabled, non-atomically (because of the non-atomic access, the ordering between 4 and 5 is even less strict).
  6. StartTrace starts the world. The goroutines which were waiting get scheduled and emit GoStart events.

We now have a trace which has goroutines going from GoWaiting -> GoStart which is invalid; the trace expects a GoUnblock event in between, but wasn't emitted. The parser is trying to connect up the events, but is unable to, so it says "ahh there's no consistent ordering."

@hyangah
Copy link
Contributor

hyangah commented May 18, 2020

Thanks @mknyszek for investigating it. What will happen if we relax the trace ordering requirement, and allow GoWaiting -> GoStart transitions (and inject a dummy GoUnblock event)?

@hyangah
Copy link
Contributor

hyangah commented May 18, 2020

cc/ @dvyukov @pjweinbgo who know better about the ordering algorithm.

@mknyszek
Copy link
Contributor

@hyangah That will probably work? The GoStart events definitely won't be emitted until the world starts again, so there should at least be those two. But I'm not sure what other events sysmon could possibly emit; I'm worried there's more lurking here that I haven't observed (@ianlancetaylor's comment that retake could generate some events has me worried; I wasn't even looking there!).

@ianlancetaylor
Copy link
Contributor

If the issue is specific to StartTrace, then perhaps StartTrace could suspend sysmon. And perhaps StopTrace as well. It's definitely a hack but it might avoid these issues.

(By the way, retake checks trace.enabled itself.)

@mknyszek
Copy link
Contributor

If the issue is specific to StartTrace, then perhaps StartTrace could suspend sysmon. And perhaps StopTrace as well. It's definitely a hack but it might avoid these issues.

(By the way, retake checks trace.enabled itself.)

That's what I was getting at in an earlier comment. I tried to actually implement this and it's a bit tricky to do so. Currently sysmon may suspend itself on STW if sched.gcwaiting is set, but it looks like a "best effort" kind of strategy, and even then it could wake up early if there's a timer that's going to go off or something. So, suspending sysmon isn't trivial until I understand the motivation behind this "best effort" suspension (I traced back gcwaiting all the way to when it was introduced in C code, but nothing in the documentation seems to indicate why sysmon responds to it at all; if anyone has any historical perspective here that would be very useful).

@ianlancetaylor
Copy link
Contributor

I get that it's hard to suspend sysmon during STW, but that's not what I was suggesting. I was suggesting something specific to StartTrace and possibly StopTrace. It could perhaps be as simple as acquiring some lock that symon also acquires each time through its loop, and only then stopping the world.

@mknyszek
Copy link
Contributor

I think we're on the same page here, I don't mean anything more than for StartTrace and StopTrace either.

Even with a simple lock it's non-trivial because you need to wait until sysmon actually pauses so that it's not, for example, in the middle of manipulating goroutines while trace state is being manipulated. This opens up the opportunity for deadlock. For example if you did something simple where sysmon just grabs the lock at the top of the loop, you could get unlucky and have sysmon notice gcwaiting first, and suspend itself (and it won't unsuspend until a start-the-world), which is a deadlock. You can work around this by placing these lock acquires in careful places, or by checking if sysmon suspended itself (and won't wake up prematurely due to a timer), but my overall point is that it's still relatively tricky, especially since we're in a freeze.

Maybe I'm missing something and there is a simple, obviously-correct way to do this, but I tried pretty much what you suggested and it was deadlocks galore.

@ianlancetaylor
Copy link
Contributor

It doesn't seem that bad to me since sysmon only suspends itself in known places, but if you tried and it failed that clearly beats my speculations.

@mknyszek
Copy link
Contributor

@ianlancetaylor Sorry, you're totally right. I was overthinking it. A lock around all the things sysmon does works fine. I'll have a CL up shortly.

@gopherbot
Copy link

Change https://golang.org/cl/234617 mentions this issue: runtime: synchronize StartTrace and StopTrace with sysmon

@mknyszek
Copy link
Contributor

Change https://golang.org/cl/234617 mentions this issue: runtime: synchronize StartTrace and StopTrace with sysmon

After this change, no failure in TestTraceStressStartStop on linux/amd64 after 2048 consecutive runs.

@golang golang locked and limited conversation to collaborators May 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Projects
None yet
Development

No branches or pull requests

5 participants