Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: locklinear.go failures with "lockmany: too slow" #32986

Closed
bcmills opened this issue Jul 8, 2019 · 22 comments
Closed

test: locklinear.go failures with "lockmany: too slow" #32986

bcmills opened this issue Jul 8, 2019 · 22 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Jul 8, 2019

From the illumos-amd64-joyent builder, in https://build.golang.org/log/b60463ada3a04bbfdb7f714e54d819176e60a833:

##### ../test
# go run run.go -- locklinear.go
exit status 1
panic: lockmany: too slow: 1000 ops: 22.164624ms; 2000 ops: 433.387975ms

1000 42.726699ms 2000 43.00828ms (1.0X)
1000 13.237367ms 2000 40.902635ms (3.1X)
1000 9.373097ms 2000 39.993182ms (4.3X)
1000 5.415665ms 2000 29.961344ms (5.5X)
1000 22.164624ms 2000 433.387975ms (19.6X)


goroutine 1 [running]:
main.checkLinear(0x4d8124, 0x8, 0x3e8, 0x4df5f0)
	/tmp/workdir-host-illumos-amd64-joyent/go/test/locklinear.go:73 +0x590
main.main()
	/tmp/workdir-host-illumos-amd64-joyent/go/test/locklinear.go:134 +0x79
exit status 2

FAIL	locklinear.go	11.857s

See previously #19276.

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-illumos labels Jul 8, 2019
@bcmills bcmills added this to the Go1.14 milestone Jul 8, 2019
@bcmills bcmills added the Testing An issue that has been verified to require only test changes, not just a test failure. label Jul 8, 2019
@bcmills
Copy link
Contributor Author

bcmills commented Oct 7, 2019

@rsc
Copy link
Contributor

rsc commented Oct 8, 2019

Is the machine just overloaded?

@rsc
Copy link
Contributor

rsc commented Oct 8, 2019

The timings for 1000 ops being all over the map - 42ms, 13ms, 9ms, 5ms, 22ms - is a sign that the machine is just overloaded. Maybe if the test can't even get something consistent for 1000 ops it should just skip itself.

@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@bcmills bcmills changed the title test: locklinear.go panic in lockmany on illumos-amd64-joyent builder test: locklinear.go failures with "lockmany: too slow" Jan 7, 2021
@bcmills bcmills removed the OS-illumos label Jan 7, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Jan 7, 2021

CC @aclements

@bcmills
Copy link
Contributor Author

bcmills commented Jan 12, 2022

greplogs --dashboard -md -l -e 'lockmany: too slow' --since=2021-04-23

2021-09-22T15:46:33-5ee32ff/solaris-amd64-oraclerel
2021-06-04T17:21:49-3a9d906/illumos-amd64

@aclements
Copy link
Member

I did a large number of all.bash runs on FreeBSD for other purposes, but got a fair number of locklinear failures in the process.

On freebsd-amd64-12_2 10 of 898 runs failed in locklinear (1.1%). On freebsd-amd64-12_3, 6 of 912 runs failed in locklinear (0.66%). Those are fairly different rates given the number of runs, which I find a little surprising, but I think give us a sense of the ambient failure rate of this test when it's run in the context of everything else all.bash does.

@bcmills
Copy link
Contributor Author

bcmills commented Feb 15, 2022

greplogs --dashboard -md -l -e 'lockmany: too slow' --since=2022-01-13

2022-02-11T20:01:24-bcee121/windows-386-2008

(Note that windows/386 and linux/arm are both first class ports.)

@bcmills
Copy link
Contributor Author

bcmills commented Feb 15, 2022

Given that this test won't be run by users as part of go test all I don't think this needs to block the Go 1.18 release, but since it seems to affect first-class ports I do think we should either fix it or skip / remove it for 1.19.

(That is: we should make a decision as to whether the test is important enough to deflake. If it is, we should deflake it; otherwise, we should suppress the flakes.)

@bcmills bcmills modified the milestones: Backlog, Go1.19 Feb 15, 2022
@jeremyfaller jeremyfaller removed their assignment Mar 23, 2022
@heschi
Copy link
Contributor

heschi commented May 11, 2022

cc @golang/runtime

@heschi heschi added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label May 11, 2022
@mknyszek
Copy link
Contributor

I think the test is useful, i.e. verifying that the asymptotic behavior of contended Mutexes. Currently, when it does fail, it has to fail the check at least 5 times in a row, which at face value seems like a high bar.

Looking more closely at the test it already has a few workarounds for flakiness that appear to have been added in the few months after it was landed. There's an inherent flakiness to this kind of test, especially since it calls runtime.Gosched.

One thing that jumps out at me, though, is that the test doesn't ever block to clean up goroutines after each attempt, so theoretically it's possible that one failure could exacerbate subsequent ones if the problem is transient oversubscription (especially given the runtime.Gosched calls; the test creates 1000 or 2000 goroutines each time it tries). 5 is still a lot of consecutive failures for that to be the only culprit though.

Another thing that seems kind of suspicious is that lockone generates contention on locks[1] and locks[offset], but lockmany generates contention on locks[1] and locks[0]. offset is 251 which matches the size of the runtime's semaphore table. This table is indexed by the lock's 8-byte word address. Based on this information, neither locks[offset] nor locks[0] collide with locks[1], which is likely intentional. However, locks[0] and locks[1] are definitely in the same cache line, whereas locks[1] and locks[offset] are not, and given that offset is prime, it's very unlikely that there's some kind of false sharing going on. Meanwhile, locks[0] and locks[1] absolutely do share a cache line so the lock fast path will contend in cache coherency. Note that this test uses RWMutex values which are 24 bytes in size. (This might be an oversight? The test doesn't use the read-lock functionality at all, meanwhile a Mutex is exactly 8 bytes in size.)

Maybe it's possible that on these slow platforms there's some kind of cache coherency breakdown? It's hard for me to tell whether the 0 vs. offset is intentional. There aren't any comments in the review (https://go-review.googlesource.com/c/go/+/37103) pointing out the difference, the commit message doesn't discuss it, and the test file itself doesn't have any explanation.

I'm inclined to do two things to make this test more robust:

  1. Block on cleanup after obtaining the time-to-run data. Maybe also exclude the setup time from the timing call (just to narrow the scope of the test a bit more).
  2. Use offset instead of 0 in lockmany.

And then we wait to see if it fails again. Does this seem reasonable? If we don't want to perturb things too much, we can just do (2).

@mknyszek
Copy link
Contributor

I made the aforementioned changes, and it definitely helps a bit. Even on my linux/amd64 GCP instance I get some results that are all over the place forcing a number of retries before my changes. There's still often at least one retry after my changes.

But, thinking more about the test, comparing individual runs is always going to be fraught, so I instead made sure 10 runs are done apiece and ran a t-test on the results, just to see what would happen. The p-value threshold I set is 0.005 and if it doesn't meet that, I discard the results and complete the test. If it does, then I just make sure the 2n case's mean is within 3x of the n case.

AFAICT this is more robust than before? In theory if there's no significant difference (due to e.g. noise) or the difference is inverted for some reason we can just discard the result entirely. We can also set our p-value threshold until we feel comfortable with the rate of false positives (though 0.005 threshold and the fairly lax constraints should already make that very low, and it should also detect real regressions). It does unfortunately make the test slower, but the methods are more rigorous.

This method of testing also seems like it could be useful for other tests in this directory (like maplinear.go).

@mknyszek
Copy link
Contributor

mknyszek commented Jun 1, 2022

It occurs to me that I never stated this plainly: I don't think this failure represents a real regression. I think it's very clear from all my runs that the invariant it intends to check is generally upheld, and I also don't see some significant regression here in terms of how often it's upheld. What I understand this issue to be about is a flaky test and trying to make it less flaky.

@bcmills
Copy link
Contributor Author

bcmills commented Jun 1, 2022

What I understand this issue to be about is a flaky test and trying to make it less flaky.

I agree, and I think the current labels (okay-after-beta1, Testing) also reflect that.

But I also think we do need to make a decision about what information the test is providing. If these lockmany: too slow failures aren't providing us with information about a regression, then I don't want to see them in TryBot runs or post-commit test runs, and I don't think users who run all.bash will want to see them either. (Note that our installing from source instructions currently instruct users to run all.bash.)

So I do think the action here is to decide: (a) is the information provided by the test worth the cost to deflake? and (b) if so, how should we deflake it? (If the answer to (a) is “no”, then we can just delete it instead. 😅)

@bcmills
Copy link
Contributor Author

bcmills commented Jun 1, 2022

Looking at the test body and the most recent windows/386 failure in the comments, I see:

# go run run.go -G=3 locklinear.go
exit status 2
panic: lockmany: too slow: 1000 ops: 6.836ms; 2000 ops: 403.3203ms

1000 8.7891ms 2000 32.2265ms (3.7X)
1000 21.4844ms 2000 32.2266ms (1.5X)
1000 24.414ms 2000 10.7422ms (0.4X)
1000 19.5313ms 2000 14.6484ms (0.7X)
1000 18.5547ms 2000 16.6016ms (0.9X)
1000 7.8125ms 2000 41.9921ms (5.4X)
1000 4.8829ms 2000 26.3671ms (5.4X)
1000 6.836ms 2000 403.3203ms (59.0X)

That looks like this arbitrary-timeout cutoff:
https://cs.opensource.google/go/go/+/master:test/locklinear.go;l=68;drc=97677273532cf1a4e8b181c242d89c0be8c92bb6

In addition to that cutoff, there is the 5-consecutive-failure cutoff (which results in failure if the running time was above the timeout), and an additional “5 inversions means too noisy” cutoff (https://cs.opensource.google/go/go/+/master:test/locklinear.go;l=54-57;drc=97677273532cf1a4e8b181c242d89c0be8c92bb6).

But the test doesn't actually increase n for the failures before the fifth one, so it seems to me that the test isn't reducing its own noise and is failing if it happens to get a single unusually-slow run on the fifth-or-later attempt.

So I wonder: would it make sense to increase n on every failure instead of just the 5th-or-later (to start raising the signal-to-noise ratio earlier), and perhaps also remove or substantially increase the arbitrary 400ms cutoff (to avoid failing the test if there is a single outlier)?

@gopherbot
Copy link

Change https://go.dev/cl/409894 mentions this issue: test/locklinear.go: improve test robustness

@mknyszek
Copy link
Contributor

mknyszek commented Jun 1, 2022

WDYT about my previous comment on just running a t-test (like benchstat)? I agree that the noise reduction attempts here aren't particularly rigorous. That, and removing the arbitrary cutoff (I left it in for now, but yeah it likely just makes things worse; a consistent failure due to high constant overheads would be better, so we can just fix the test). I put up https://go-review.googlesource.com/c/go/+/409894 if you'd like to take a look.

There's a ton of copied code, but the same infrastructure could come in handy for other tests. It's unfortunate that we can't really depend on x/perf here, because that would avoid all the duplication. Honestly, this is kind of a performance test, and I think we should treat it like one. That might mean pulling this out and merging it into the rest of the performance tests once we have a dashboard.

@bcmills
Copy link
Contributor Author

bcmills commented Jun 1, 2022

WDYT about my previous comment on just running a t-test (like benchstat)?

TBH my statistics chops were never great to begin with, and have only dulled further with time. 😅

I agree that a T-test seems like the right approach in general, although I'm not sure what an appropriate P-value would be for this kind of test.

I do worry a bit that the test could decay if some regression causes it to consistently miss the significance threshold. Maybe we could guard against that by keeping the “increasing n” loop, and either failing the test if it fails to attain significance after some reasonable number of retries, or perhaps if it fails to attain significance on, say, the -longtest builders?

@aclements
Copy link
Member

@mknyszek , I think you've got the right idea with a t-test here. In particular, don't have any retries, and instead just collect several samples from both the less and more contended scenarios to build up two distributions, and then make a decision based on those distributions. I haven't looked at your code closely, but your commit message describes what I think is the right approach.

I do worry that picking a P-value threshold is effectively picking how flaky we're willing to allow this test to be, but nevertheless I think this is worth a shot. It's at least a lot more principled.

Repeating the significance test if it fails sort of defeats the purpose of a significance test, so I don't think we should do that.

Increasing n might help to smooth out the results, but that's basically just baking in a mean over a larger sample period. In general, statistical analysis should do better if we give it more samples with less pre-filtering, so it has as much information as possible.

@gopherbot gopherbot removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Jun 10, 2022
@gopherbot
Copy link

Change https://go.dev/cl/411396 mentions this issue: sync: move lock linearity test and treat it like a performance test

@gopherbot
Copy link

Change https://go.dev/cl/411395 mentions this issue: internal/testmath: add two-sample Welch's t-test for performance tests

gopherbot pushed a commit that referenced this issue Jun 13, 2022
This CL copies code from github.com/aclements/go-moremath/stats and
github.com/aclements/go-moremath/mathx for Welch's t-test. Several
existing tests in the Go repository check performance and scalability,
and this import is part of a move toward a more rigorous measurement of
both.

Note that the copied code is already licensed to Go Authors, so there's
no need to worry about additional licensing considerations.

For #32986.

Change-Id: I058630fab7216d1a589bb182b69fa2231e6f5475
Reviewed-on: https://go-review.googlesource.com/c/go/+/411395
Reviewed-by: Michael Pratt <mpratt@google.com>
gopherbot pushed a commit that referenced this issue Jun 13, 2022
This change moves test/locklinear.go into the sync package tests, and
adds a bit of infrastructure since there are other linearity-checking
tests that could benefit from it too. This infrastructure is also
different than what test/locklinear.go does: instead of trying really
hard to get at least one success, we instead treat this like a
performance test and look for a significant difference via a t-test.

This makes the methodology behind the tests more rigorous, and should
reduce flakiness as transient noise should produce an insignificant
result. A follow-up CL does more to make these tests even more robust.

For #32986.

Change-Id: I408c5f643962b70ea708930edb4ac9df1c6123ce
Reviewed-on: https://go-review.googlesource.com/c/go/+/411396
Reviewed-by: Michael Pratt <mpratt@google.com>
@golang golang locked and limited conversation to collaborators Jun 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
Development

No branches or pull requests

7 participants