Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: TestMutexProfile flakiness #19139

Closed
randall77 opened this issue Feb 17, 2017 · 12 comments
Closed

runtime/pprof: TestMutexProfile flakiness #19139

randall77 opened this issue Feb 17, 2017 · 12 comments
Labels
FrozenDueToAge Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@randall77
Copy link
Contributor

On the builders, I see TestMutexProfile occasionally failing with

--- FAIL: TestMutexProfile (0.01s)
	pprof_test.go:532: expected 6 lines, got 3 "--- mutex:\ncycles/second=2362842166\nsampling period=1"
		--- mutex:
		cycles/second=2362842166
		sampling period=1
FAIL
FAIL	runtime/pprof	1.298s

Maybe freebsd only?

@pjweinb

Some builder logs:
https://build.golang.org/log/c625cca078dac40bc5f1f50b62891c04b52308c5
https://build.golang.org/log/118de984fc0903fe93c3e677962cd7965637a2a3

@pjweinb
Copy link

pjweinb commented Feb 17, 2017 via email

@bradfitz bradfitz added OS-OpenBSD Testing An issue that has been verified to require only test changes, not just a test failure. labels Feb 17, 2017
@bradfitz bradfitz added this to the Go1.9 milestone Feb 17, 2017
@bradfitz
Copy link
Contributor

I also saw it on OpenBSD in #19146 (closed as dup of this).

@pjweinb
Copy link

pjweinb commented Feb 17, 2017 via email

@ianlancetaylor
Copy link
Contributor

@aclements has a tool that scans the builder logs for test failures.

@aclements
Copy link
Member

$ greplogs -dashboard -E "TestMutexProfile" -E "expected 6 lines" -l -md
2017-02-16T17:34:39-fc456c7/freebsd-386-gce101
2017-02-16T17:34:39-fc456c7/freebsd-amd64-race
2017-02-16T17:52:15-990124d/linux-386
2017-02-16T21:54:59-661e217/freebsd-amd64-gce93
2017-02-16T21:54:59-661e217/linux-ppc64le-buildlet
2017-02-16T22:42:36-794f1eb/freebsd-amd64-race
2017-02-17T06:16:44-708ba22/freebsd-amd64-gce93
2017-02-17T14:36:27-79f6a5c/windows-386-gce

$ greplogs -dashboard -E "TestMutexProfile" -E "expected 6 lines" -l | findflakes -paths
First observed fc456c7 16 Feb 17:34 2017 (8 commits ago)
Last observed  79f6a5c 17 Feb 14:36 2017 (1 commits ago)
29% chance failure is still happening
71% failure probability (6 of 8 commits)
Likely culprits:
   71% fc456c7 cmd/compile/internal/gc: drop unused src.XPos params in SSA builder
   20% 58d7621 runtime: run mutexevent profiling without holding semaRoot lock
No known past failures

@pjweinb
Copy link

pjweinb commented Feb 17, 2017 via email

@mdempsky
Copy link
Member

Seeing as it's also appearing on Linux and Windows, it appears OS agnostic.

@gopherbot
Copy link

CL https://golang.org/cl/37291 mentions this issue.

gopherbot pushed a commit that referenced this issue Feb 20, 2017
Flaky tests hurt productivity. Disable for now.

Updates #19139

Change-Id: I2e3040bdf0e53597a1c4f925b788e3268ea284c1
Reviewed-on: https://go-review.googlesource.com/37291
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Peter Weinberger <pjw@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@aclements
Copy link
Member

@pjweinb, any progress on this?

@aclements
Copy link
Member

one suspects 58d7621 runtime: run mutexevent profiling without holding semaRoot lock

Never trust a commit that says "Clearly safe". :)

I've confirmed that I can reproduce this fairly easily on master (738739f), but have been unable to reproduce it after locally reverting that change.

@aclements
Copy link
Member

Here's my hunch:

  1. Let G1 be the test's main goroutine. It acquires mu, starts a new goroutine G2, and blocks on the second mu.Lock().
  2. G2 reaches the mu.Unlock(), which eventually calls runtime.semrelease1. Before recording the mutex event, semrelease1 does unlock(&root.lock).
  3. We switch to G1, where the mu.Lock() is now allowed to proceed. The test returns from blockMutex and captures the mutex profile.
  4. At some point in the future, G2 resumes and records its mutex event, but it's too late.

Some evidence that this is what's going on: If you add defer runtime.GOMAXPROCS(runtime.GOMAXPROCS(1)) to the top of TestMutexProfile, the failure rate increases by a factor of 30. This could be because we can no longer complete G2 on a separate thread before G1 collects the profile. However, if you additionally add runtime.Gosched() to the end of blockMutex after it acquires the lock for the second time, the failure rate drops to 0. This could be because it's now forcing the scheduler to give G2 a change to finish up.

The good news is, if I'm right, this is purely a test bug.

@gopherbot
Copy link

CL https://golang.org/cl/45751 mentions this issue.

@golang golang locked and limited conversation to collaborators Jun 14, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

7 participants