Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync: Pool tests flaky on arm builders #31422

Closed
bcmills opened this issue Apr 11, 2019 · 9 comments
Closed

sync: Pool tests flaky on arm builders #31422

bcmills opened this issue Apr 11, 2019 · 9 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. release-blocker Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Apr 11, 2019

Possibly related to #24640.

Samples:
https://build.golang.org/log/10c155a9635967f5b3006b6a04b6d5442ff9713a:

--- FAIL: TestPoolDequeue (0.00s)
    pool_test.go:239: popHead never succeeded
FAIL
FAIL	sync	0.864s

https://build.golang.org/log/3fbb17b4083eca9629c97f7b71879c804ecf5d0d and
https://build.golang.org/log/9f98720ace008f8c98f74c0d14049cb67b3c56f5:

##### sync -cpu=10
--- FAIL: TestPoolChain (0.00s)
    pool_test.go:239: popHead never succeeded
FAIL
FAIL	sync	0.864s
@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 11, 2019
@bcmills bcmills added this to the Go1.13 milestone Apr 11, 2019
@ianlancetaylor
Copy link
Contributor

CC @aclements

@aclements
Copy link
Member

I just got this once in 1,045 runs of all.bash on my linux/amd64 workstation.

--- FAIL: TestPoolChain (0.00s)
    pool_test.go:239: popHead never succeeded
FAIL
FAIL    sync    0.827s

This is certainly a theoretically possible failure, but when I wrote this test I though the chance of hitting the bad schedule was infinitesimal. Maybe there's a more likely schedule that can cause this.

@bradfitz bradfitz added NeedsFix The path to resolution is known, but the work has not been done. release-blocker Testing An issue that has been verified to require only test changes, not just a test failure. labels Apr 30, 2019
@gopherbot gopherbot removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 30, 2019
@josharian
Copy link
Contributor

Lots of instances of this on arm and arm64 builders:

$ greplogs -dashboard -E popHead -l
2019-04-29T15:23:10-db1514c/linux-arm64-packet
2019-04-29T21:26:07-d5014ec/linux-arm64-packet
2019-04-29T22:17:05-ccbc9a3/linux-arm64-packet
2019-04-30T15:48:46-4ad1355/netbsd-arm-bsiegert
2019-04-30T16:59:13-f686a28/netbsd-arm-bsiegert
2019-04-30T18:40:06-62ddf7d/linux-arm64-packet
2019-04-30T19:13:43-8e4f1a7/linux-arm64-packet
2019-04-30T20:26:36-85387aa/netbsd-arm-bsiegert
2019-05-01T14:59:51-ab5cee5/netbsd-arm-bsiegert
2019-05-01T16:10:05-e56c73f/netbsd-arm-bsiegert
2019-05-01T16:53:19-f0c383b/netbsd-arm-bsiegert
2019-05-01T16:55:33-07f6894/netbsd-arm-bsiegert
2019-05-01T21:14:28-aaf40f8/netbsd-arm-bsiegert
2019-05-01T22:22:41-e5f0d14/netbsd-arm-bsiegert
2019-05-02T14:04:56-2316784/netbsd-arm-bsiegert
2019-05-02T14:44:05-19f5c23/netbsd-arm-bsiegert
2019-05-02T22:17:31-fe83731/netbsd-arm-bsiegert
2019-05-03T15:17:54-5e404b3/netbsd-arm-bsiegert
2019-05-03T15:20:15-f5c43b9/netbsd-arm-bsiegert
2019-05-03T15:20:41-2c67cdf/linux-arm64-packet
2019-05-03T18:42:04-7fcba81/linux-arm64-packet
2019-05-06T17:06:16-5003b62/netbsd-arm-bsiegert
2019-05-06T18:17:03-cc5eaf9/linux-arm64-packet
2019-05-06T20:09:58-e1f9e70/netbsd-arm-bsiegert
2019-05-06T20:57:39-a62b572/netbsd-arm-bsiegert
2019-05-06T20:59:20-f4a5ae5/netbsd-arm-bsiegert
2019-05-06T21:14:52-5c15ed6/linux-arm
2019-05-06T21:23:29-04845fe/linux-arm64-packet
2019-05-06T21:23:29-04845fe/netbsd-arm-bsiegert
2019-05-06T23:02:29-6b1ac82/netbsd-arm-bsiegert
2019-05-06T23:23:45-53374e7/linux-arm64-packet
2019-05-06T23:23:45-53374e7/netbsd-arm-bsiegert
2019-05-07T12:48:04-a88cb1d/netbsd-arm-bsiegert
2019-05-07T16:59:51-8280455/linux-arm64-packet
2019-05-07T16:59:51-8280455/netbsd-arm-bsiegert
2019-05-08T16:00:05-4cd6c3b/linux-arm64-packet
2019-05-08T16:00:05-4cd6c3b/netbsd-arm-bsiegert
2019-05-08T16:55:59-2625fef/netbsd-arm-bsiegert
2019-05-08T17:11:57-5a2da56/netbsd-arm-bsiegert
2019-05-09T00:02:34-f766b68/netbsd-arm-bsiegert
2019-05-09T16:10:22-d56199d/linux-arm64-packet
2019-05-09T17:11:16-a44c3ed/linux-arm64-packet
2019-05-09T17:49:12-50a1d89/netbsd-arm-bsiegert
2019-05-09T21:13:18-6ed2ec4/netbsd-arm-bsiegert
2019-05-09T21:13:21-1ea7644/netbsd-arm-bsiegert
2019-05-09T21:13:39-13723d4/netbsd-arm-bsiegert
2019-05-09T21:13:56-a4f5c9c/netbsd-arm-bsiegert
2019-05-10T00:14:40-4ae31dc/netbsd-arm-bsiegert
2019-05-10T14:24:43-2aa8971/netbsd-arm-bsiegert
2019-05-11T03:02:33-ce5ae2f/netbsd-arm-bsiegert
2019-05-11T23:19:40-0926701/netbsd-arm-bsiegert

@bcmills bcmills changed the title sync: Pool tests flaky on linux-arm64-packet builder sync: Pool tests flaky on arm builders May 14, 2019
@dianhong01
Copy link
Contributor

dianhong01 commented Jun 3, 2019

when I run sync pool test cases like below for about 2000 times, they were all passed in arm64 device.
../golang/bin/go test sync -cpu=10 - c -o s1
./s1
But when I run case like that:
../golang/bin/go test sync -cpu=10 - c -o s2
./s2 -test.short
there were 1521 passed and 1378 failed.

when run all.bash script, the flag '-test.short' is set, which could make installation more efficient. In this case, the flag '-test.short' control value of "N". As comment in code "In theory it's possible in a valid schedule for popHead to never succeed", so I guess maybe N is too small to pass the case.

func testPoolDequeue(t *testing.T, d PoolDequeue) {
const P = 10
// In long mode, do enough pushes to wrap around the 21-bit
// indexes.
N := 1<<21 + 1000
if testing.Short() {
N = 1e3
}
...........

@bcmills
Copy link
Contributor Author

bcmills commented Jun 26, 2019

@aclements, is this still on the radar for 1.13? Is this more likely a bug in the test, or in the Pool implementation?

@aclements
Copy link
Member

Given that the long test doesn't flake, this is almost certainly a bug in the test. In the short test, there are only 100 expected PopHeads. On my linux/amd64 laptop, in 1000 runs, it gets as low as 50 successful PopHeads, but that seems to be a hard floor. It does give me pause that the failure rate is that high, since I would expect these schedules to be quite rare.

@aclements
Copy link
Member

I added some logging. It looks like the time between the PushHead committing and the PopHead committing is just long enough that the racing PopTail loop can regularly succeed and drain the queue.

This means it's just the test. I'm not sure why it's so flaky on arm64 specifically, but it may be that that window is just larger because of architectural details. I'm still thinking about how to make the test less flaky. We could of course just add retries, but it would be nice to do something better.

@aclements
Copy link
Member

Or we just remove the nPopHead check.

@gopherbot
Copy link

Change https://golang.org/cl/183981 mentions this issue: sync: only check for successful PopHeads in long mode

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. release-blocker Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

7 participants