test: locklinear.go failures with "lockmany: too slow" #32986

bcmills · 2019-07-08T19:20:52Z

From the illumos-amd64-joyent builder, in https://build.golang.org/log/b60463ada3a04bbfdb7f714e54d819176e60a833:

##### ../test
# go run run.go -- locklinear.go
exit status 1
panic: lockmany: too slow: 1000 ops: 22.164624ms; 2000 ops: 433.387975ms

1000 42.726699ms 2000 43.00828ms (1.0X)
1000 13.237367ms 2000 40.902635ms (3.1X)
1000 9.373097ms 2000 39.993182ms (4.3X)
1000 5.415665ms 2000 29.961344ms (5.5X)
1000 22.164624ms 2000 433.387975ms (19.6X)


goroutine 1 [running]:
main.checkLinear(0x4d8124, 0x8, 0x3e8, 0x4df5f0)
	/tmp/workdir-host-illumos-amd64-joyent/go/test/locklinear.go:73 +0x590
main.main()
	/tmp/workdir-host-illumos-amd64-joyent/go/test/locklinear.go:134 +0x79
exit status 2

FAIL	locklinear.go	11.857s

See previously #19276.

The text was updated successfully, but these errors were encountered:

bcmills · 2019-10-07T14:53:42Z

https://build.golang.org/log/d3aca1d198ef66f7a053c525794522f8a83f8ff4

rsc · 2019-10-08T15:19:03Z

Is the machine just overloaded?

rsc · 2019-10-08T15:23:49Z

The timings for 1000 ops being all over the map - 42ms, 13ms, 9ms, 5ms, 22ms - is a sign that the machine is just overloaded. Maybe if the test can't even get something consistent for 1000 ops it should just skip itself.

bcmills · 2021-01-07T16:45:31Z

This turns out not to be builder-specific — it has failed on a number of slowish builders.
Either the test is making unwarranted assumptions, or there is a deeper problem in the behavior under test.

2021-01-06T18:41:54-4c668b2/linux-ppc64-buildlet
2020-01-07T18:46:29-52a9210/linux-arm
2019-10-17T16:30:15-2449444/solaris-amd64-oraclerel
2019-10-05T10:39:30-11b3368/illumos-amd64-joyent
2019-10-03T18:18:14-5fe3b49/solaris-amd64-oraclerel
2019-09-11T00:56:39-8ef6d6a/solaris-amd64-oraclerel
2019-08-28T20:42:48-9859f6b/solaris-amd64-smartosbuildlet
2019-07-23T12:39:03-e8c7e63/illumos-amd64-joyent
2019-07-08T13:20:23-ddc8439/illumos-amd64-joyent
2019-06-26T19:11:12-4047f01/solaris-amd64-smartosbuildlet
2019-05-09T18:02:19-c514071/solaris-amd64-smartosbuildlet

bcmills · 2021-01-07T16:46:37Z

CC @aclements

bcmills · 2021-04-23T02:28:53Z

2021-04-22T21:23:59-f7afdfd/windows-386-2008
2021-04-12T15:17:33-117b1c8/illumos-amd64
2021-03-13T11:26:35-83e79c7/linux-arm

bcmills · 2022-01-12T14:50:54Z

greplogs --dashboard -md -l -e 'lockmany: too slow' --since=2021-04-23

2021-09-22T15:46:33-5ee32ff/solaris-amd64-oraclerel
2021-06-04T17:21:49-3a9d906/illumos-amd64

aclements · 2022-01-14T17:55:02Z

I did a large number of all.bash runs on FreeBSD for other purposes, but got a fair number of locklinear failures in the process.

On freebsd-amd64-12_2 10 of 898 runs failed in locklinear (1.1%). On freebsd-amd64-12_3, 6 of 912 runs failed in locklinear (0.66%). Those are fairly different rates given the number of runs, which I find a little surprising, but I think give us a sense of the ambient failure rate of this test when it's run in the context of everything else all.bash does.

bcmills · 2022-02-15T16:14:50Z

greplogs --dashboard -md -l -e 'lockmany: too slow' --since=2022-01-13

2022-02-11T20:01:24-bcee121/windows-386-2008

(Note that windows/386 and linux/arm are both first class ports.)

bcmills · 2022-02-15T16:17:34Z

Given that this test won't be run by users as part of go test all I don't think this needs to block the Go 1.18 release, but since it seems to affect first-class ports I do think we should either fix it or skip / remove it for 1.19.

(That is: we should make a decision as to whether the test is important enough to deflake. If it is, we should deflake it; otherwise, we should suppress the flakes.)

heschi · 2022-05-11T16:08:42Z

cc @golang/runtime

mknyszek · 2022-05-27T20:26:15Z

I think the test is useful, i.e. verifying that the asymptotic behavior of contended Mutexes. Currently, when it does fail, it has to fail the check at least 5 times in a row, which at face value seems like a high bar.

Looking more closely at the test it already has a few workarounds for flakiness that appear to have been added in the few months after it was landed. There's an inherent flakiness to this kind of test, especially since it calls runtime.Gosched.

One thing that jumps out at me, though, is that the test doesn't ever block to clean up goroutines after each attempt, so theoretically it's possible that one failure could exacerbate subsequent ones if the problem is transient oversubscription (especially given the runtime.Gosched calls; the test creates 1000 or 2000 goroutines each time it tries). 5 is still a lot of consecutive failures for that to be the only culprit though.

Another thing that seems kind of suspicious is that lockone generates contention on locks[1] and locks[offset], but lockmany generates contention on locks[1] and locks[0]. offset is 251 which matches the size of the runtime's semaphore table. This table is indexed by the lock's 8-byte word address. Based on this information, neither locks[offset] nor locks[0] collide with locks[1], which is likely intentional. However, locks[0] and locks[1] are definitely in the same cache line, whereas locks[1] and locks[offset] are not, and given that offset is prime, it's very unlikely that there's some kind of false sharing going on. Meanwhile, locks[0] and locks[1] absolutely do share a cache line so the lock fast path will contend in cache coherency. Note that this test uses RWMutex values which are 24 bytes in size. (This might be an oversight? The test doesn't use the read-lock functionality at all, meanwhile a Mutex is exactly 8 bytes in size.)

Maybe it's possible that on these slow platforms there's some kind of cache coherency breakdown? It's hard for me to tell whether the 0 vs. offset is intentional. There aren't any comments in the review (https://go-review.googlesource.com/c/go/+/37103) pointing out the difference, the commit message doesn't discuss it, and the test file itself doesn't have any explanation.

I'm inclined to do two things to make this test more robust:

Block on cleanup after obtaining the time-to-run data. Maybe also exclude the setup time from the timing call (just to narrow the scope of the test a bit more).
Use offset instead of 0 in lockmany.

And then we wait to see if it fails again. Does this seem reasonable? If we don't want to perturb things too much, we can just do (2).

mknyszek · 2022-05-31T19:57:24Z

I made the aforementioned changes, and it definitely helps a bit. Even on my linux/amd64 GCP instance I get some results that are all over the place forcing a number of retries before my changes. There's still often at least one retry after my changes.

But, thinking more about the test, comparing individual runs is always going to be fraught, so I instead made sure 10 runs are done apiece and ran a t-test on the results, just to see what would happen. The p-value threshold I set is 0.005 and if it doesn't meet that, I discard the results and complete the test. If it does, then I just make sure the 2n case's mean is within 3x of the n case.

AFAICT this is more robust than before? In theory if there's no significant difference (due to e.g. noise) or the difference is inverted for some reason we can just discard the result entirely. We can also set our p-value threshold until we feel comfortable with the rate of false positives (though 0.005 threshold and the fairly lax constraints should already make that very low, and it should also detect real regressions). It does unfortunately make the test slower, but the methods are more rigorous.

This method of testing also seems like it could be useful for other tests in this directory (like maplinear.go).

mknyszek · 2022-06-01T17:27:13Z

It occurs to me that I never stated this plainly: I don't think this failure represents a real regression. I think it's very clear from all my runs that the invariant it intends to check is generally upheld, and I also don't see some significant regression here in terms of how often it's upheld. What I understand this issue to be about is a flaky test and trying to make it less flaky.

bcmills · 2022-06-01T18:09:42Z

What I understand this issue to be about is a flaky test and trying to make it less flaky.

I agree, and I think the current labels (okay-after-beta1, Testing) also reflect that.

But I also think we do need to make a decision about what information the test is providing. If these lockmany: too slow failures aren't providing us with information about a regression, then I don't want to see them in TryBot runs or post-commit test runs, and I don't think users who run all.bash will want to see them either. (Note that our installing from source instructions currently instruct users to run all.bash.)

So I do think the action here is to decide: (a) is the information provided by the test worth the cost to deflake? and (b) if so, how should we deflake it? (If the answer to (a) is “no”, then we can just delete it instead. 😅)

bcmills · 2022-06-01T18:19:42Z

Looking at the test body and the most recent windows/386 failure in the comments, I see:

# go run run.go -G=3 locklinear.go
exit status 2
panic: lockmany: too slow: 1000 ops: 6.836ms; 2000 ops: 403.3203ms

1000 8.7891ms 2000 32.2265ms (3.7X)
1000 21.4844ms 2000 32.2266ms (1.5X)
1000 24.414ms 2000 10.7422ms (0.4X)
1000 19.5313ms 2000 14.6484ms (0.7X)
1000 18.5547ms 2000 16.6016ms (0.9X)
1000 7.8125ms 2000 41.9921ms (5.4X)
1000 4.8829ms 2000 26.3671ms (5.4X)
1000 6.836ms 2000 403.3203ms (59.0X)

That looks like this arbitrary-timeout cutoff:
https://cs.opensource.google/go/go/+/master:test/locklinear.go;l=68;drc=97677273532cf1a4e8b181c242d89c0be8c92bb6

In addition to that cutoff, there is the 5-consecutive-failure cutoff (which results in failure if the running time was above the timeout), and an additional “5 inversions means too noisy” cutoff (https://cs.opensource.google/go/go/+/master:test/locklinear.go;l=54-57;drc=97677273532cf1a4e8b181c242d89c0be8c92bb6).

But the test doesn't actually increase n for the failures before the fifth one, so it seems to me that the test isn't reducing its own noise and is failing if it happens to get a single unusually-slow run on the fifth-or-later attempt.

So I wonder: would it make sense to increase n on every failure instead of just the 5th-or-later (to start raising the signal-to-noise ratio earlier), and perhaps also remove or substantially increase the arbitrary 400ms cutoff (to avoid failing the test if there is a single outlier)?

gopherbot · 2022-06-01T18:41:22Z

Change https://go.dev/cl/409894 mentions this issue: test/locklinear.go: improve test robustness

mknyszek · 2022-06-01T18:41:53Z

WDYT about my previous comment on just running a t-test (like benchstat)? I agree that the noise reduction attempts here aren't particularly rigorous. That, and removing the arbitrary cutoff (I left it in for now, but yeah it likely just makes things worse; a consistent failure due to high constant overheads would be better, so we can just fix the test). I put up https://go-review.googlesource.com/c/go/+/409894 if you'd like to take a look.

There's a ton of copied code, but the same infrastructure could come in handy for other tests. It's unfortunate that we can't really depend on x/perf here, because that would avoid all the duplication. Honestly, this is kind of a performance test, and I think we should treat it like one. That might mean pulling this out and merging it into the rest of the performance tests once we have a dashboard.

bcmills · 2022-06-01T18:51:48Z

WDYT about my previous comment on just running a t-test (like benchstat)?

TBH my statistics chops were never great to begin with, and have only dulled further with time. 😅

I agree that a T-test seems like the right approach in general, although I'm not sure what an appropriate P-value would be for this kind of test.

I do worry a bit that the test could decay if some regression causes it to consistently miss the significance threshold. Maybe we could guard against that by keeping the “increasing n” loop, and either failing the test if it fails to attain significance after some reasonable number of retries, or perhaps if it fails to attain significance on, say, the -longtest builders?

aclements · 2022-06-08T19:02:27Z

@mknyszek , I think you've got the right idea with a t-test here. In particular, don't have any retries, and instead just collect several samples from both the less and more contended scenarios to build up two distributions, and then make a decision based on those distributions. I haven't looked at your code closely, but your commit message describes what I think is the right approach.

I do worry that picking a P-value threshold is effectively picking how flaky we're willing to allow this test to be, but nevertheless I think this is worth a shot. It's at least a lot more principled.

Repeating the significance test if it fails sort of defeats the purpose of a significance test, so I don't think we should do that.

Increasing n might help to smooth out the results, but that's basically just baking in a mean over a larger sample period. In general, statistical analysis should do better if we give it more samples with less pre-filtering, so it has as much information as possible.

gopherbot · 2022-06-10T18:53:08Z

Change https://go.dev/cl/411396 mentions this issue: sync: move lock linearity test and treat it like a performance test

gopherbot · 2022-06-10T18:53:09Z

Change https://go.dev/cl/411395 mentions this issue: internal/testmath: add two-sample Welch's t-test for performance tests

This CL copies code from github.com/aclements/go-moremath/stats and github.com/aclements/go-moremath/mathx for Welch's t-test. Several existing tests in the Go repository check performance and scalability, and this import is part of a move toward a more rigorous measurement of both. Note that the copied code is already licensed to Go Authors, so there's no need to worry about additional licensing considerations. For #32986. Change-Id: I058630fab7216d1a589bb182b69fa2231e6f5475 Reviewed-on: https://go-review.googlesource.com/c/go/+/411395 Reviewed-by: Michael Pratt <mpratt@google.com>

This change moves test/locklinear.go into the sync package tests, and adds a bit of infrastructure since there are other linearity-checking tests that could benefit from it too. This infrastructure is also different than what test/locklinear.go does: instead of trying really hard to get at least one success, we instead treat this like a performance test and look for a significant difference via a t-test. This makes the methodology behind the tests more rigorous, and should reduce flakiness as transient noise should produce an insignificant result. A follow-up CL does more to make these tests even more robust. For #32986. Change-Id: I408c5f643962b70ea708930edb4ac9df1c6123ce Reviewed-on: https://go-review.googlesource.com/c/go/+/411396 Reviewed-by: Michael Pratt <mpratt@google.com>

bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-illumos labels Jul 8, 2019

bcmills added this to the Go1.14 milestone Jul 8, 2019

bcmills added the Testing An issue that has been verified to require only test changes, not just a test failure. label Jul 8, 2019

rsc modified the milestones: Go1.14, Backlog Oct 9, 2019

bcmills changed the title ~~test: locklinear.go panic in lockmany on illumos-amd64-joyent builder~~ test: locklinear.go failures with "lockmany: too slow" Jan 7, 2021

bcmills removed the OS-illumos label Jan 7, 2021

bcmills mentioned this issue Jan 12, 2022

cmd/compile: failures with "symbol type.[…] listed multiple times" on freebsd-amd64-12_2 since 2021-11-24 #49967

Closed

bcmills modified the milestones: Backlog, Go1.19 Feb 15, 2022

bcmills added the release-blocker label Feb 15, 2022

heschi assigned jeremyfaller Mar 16, 2022

jeremyfaller removed their assignment Mar 23, 2022

heschi added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label May 11, 2022

gopherbot removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Jun 10, 2022

gopherbot closed this as completed in 56bc309 Jun 13, 2022

aclements mentioned this issue Jun 15, 2022

sync: TestMutexLinearOne / TestMutexLinearMany flaky failure #53381

Closed

cherrymui mentioned this issue Jul 27, 2022

build dashboard triage log #52653

Closed

golang locked and limited conversation to collaborators Jun 13, 2023

gopherbot added the FrozenDueToAge label Jun 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: locklinear.go failures with "lockmany: too slow" #32986

test: locklinear.go failures with "lockmany: too slow" #32986

bcmills commented Jul 8, 2019

bcmills commented Oct 7, 2019

rsc commented Oct 8, 2019

rsc commented Oct 8, 2019

bcmills commented Jan 7, 2021

bcmills commented Jan 7, 2021

bcmills commented Apr 23, 2021

bcmills commented Jan 12, 2022

aclements commented Jan 14, 2022

bcmills commented Feb 15, 2022

bcmills commented Feb 15, 2022

heschi commented May 11, 2022

mknyszek commented May 27, 2022

mknyszek commented May 31, 2022

mknyszek commented Jun 1, 2022

bcmills commented Jun 1, 2022

bcmills commented Jun 1, 2022 •

edited

gopherbot commented Jun 1, 2022

mknyszek commented Jun 1, 2022

bcmills commented Jun 1, 2022

aclements commented Jun 8, 2022

gopherbot commented Jun 10, 2022

gopherbot commented Jun 10, 2022

test: locklinear.go failures with "lockmany: too slow" #32986

test: locklinear.go failures with "lockmany: too slow" #32986

Comments

bcmills commented Jul 8, 2019

bcmills commented Oct 7, 2019

rsc commented Oct 8, 2019

rsc commented Oct 8, 2019

bcmills commented Jan 7, 2021

bcmills commented Jan 7, 2021

bcmills commented Apr 23, 2021

bcmills commented Jan 12, 2022

aclements commented Jan 14, 2022

bcmills commented Feb 15, 2022

bcmills commented Feb 15, 2022

heschi commented May 11, 2022

mknyszek commented May 27, 2022

mknyszek commented May 31, 2022

mknyszek commented Jun 1, 2022

bcmills commented Jun 1, 2022

bcmills commented Jun 1, 2022 • edited

gopherbot commented Jun 1, 2022

mknyszek commented Jun 1, 2022

bcmills commented Jun 1, 2022

aclements commented Jun 8, 2022

gopherbot commented Jun 10, 2022

gopherbot commented Jun 10, 2022

bcmills commented Jun 1, 2022 •

edited