time: TestAfterQueuing fails on windows builders: #10680

alexbrainman · 2015-05-04T06:27:15Z

The TestAfterQueuing fails on windows builders recently:

http://build.golang.org/log/b9cb3c3357a4a8d874bf0669d6ee52f39f0be277
http://build.golang.org/log/62b0ebbe08b4b4b099b301a426e793286c2ed155
http://build.golang.org/log/bb3bbce19d8e09e3377629c491ca584faecc8ed0

with

--- FAIL: TestAfterQueuing (2.46s)
    sleep_test.go:235: attempt 0 failed: After(160ms) arrived at 532.1925ms, expected [150ms,360ms]
    sleep_test.go:235: attempt 1 failed: after slot 3, expected 1
    sleep_test.go:235: attempt 2 failed: after slot 3, expected 1
    sleep_test.go:239: after slot 3, expected 1
FAIL
FAIL    time    5.842s

The test does not fail for me here. But I suspect builders are CPU busy building Go and running tests - we are running both windows-386 and windows-amd64 in parallel. I added code similar to runtime/pprof/cpuHogger to TestAfterQueuing and I can make TestAfterQueuing fail on VM here (not on real hardware). I suggest we increase delta to 200 * Millisecond (It is 100 * Millisecond at this moment). That fixes my failure. It might make builders more reliable. How does that sound?

Alex

The text was updated successfully, but these errors were encountered:

minux · 2015-05-04T06:29:01Z

I think for the new style builders, the 386 and amd64 are building on different VMs.

bradfitz · 2015-05-04T13:57:29Z

Correct. Each Windows build is isolated.

alexbrainman · 2015-05-05T06:07:16Z

Yes, builders are running in separate VMs now. Thanks for correcting me @minux and @bradfitz. I am not sure why this test fails more often in VMs.

Like I mentioned before, I can break this test reliably if I add this:

diff --git a/src/time/sleep.go b/src/time/sleep.go
index e7a2ee2..673d433 100644
--- a/src/time/sleep.go
+++ b/src/time/sleep.go
@@ -97,8 +97,10 @@ func sendTime(c interface{}, seq uintptr) {
    // Used in NewTicker, dropping sends on the floor is
    // the desired behavior when the reader gets behind,
    // because the sends are periodic.
+   t := Now()
+   print("sendTime: t=", t.String(), "\n")
    select {
-   case c.(chan Time) <- Now():
+   case c.(chan Time) <- t:
    default:
    }
 }
diff --git a/src/time/sleep_test.go b/src/time/sleep_test.go
index 2fce753..508f689 100644
--- a/src/time/sleep_test.go
+++ b/src/time/sleep_test.go
@@ -220,10 +220,60 @@ func TestAfterStop(t *testing.T) {
    }
 }

+var (
+   salt1 = 0
+   salt2 = 0
+)
+
+func cpuHog1() {
+   foo := salt1
+   for i := 0; i < 1e5; i++ {
+       if foo > 0 {
+           foo *= foo
+       } else {
+           foo *= foo + 1
+       }
+   }
+   salt1 = foo
+}
+
+func cpuHog2() {
+   foo := salt2
+   for i := 0; i < 1e5; i++ {
+       if foo > 0 {
+           foo *= foo
+       } else {
+           foo *= foo + 2
+       }
+   }
+   salt2 = foo
+}
+
+func cpuHogger(f func()) {
+   // We only need to get one 100 Hz clock tick, so we've got
+   // a 25x safety buffer.
+   // But do at least 500 iterations (which should take about 100ms),
+   // otherwise TestCPUProfileMultithreaded can fail if only one
+   // thread is scheduled during the 250ms period.
+   t0 := Now()
+   for i := 0; i < 500 || Since(t0) < 5000*Millisecond; i++ {
+       f()
+   }
+}
+
 func TestAfterQueuing(t *testing.T) {
+   go func() {
+       c := make(chan int)
+       go func() {
+           cpuHogger(cpuHog1)
+           c <- 1
+       }()
+       cpuHogger(cpuHog2)
+       <-c
+   }()
    // This test flakes out on some systems,
    // so we'll try it a few times before declaring it a failure.
-   const attempts = 3
+   const attempts = 1
    err := errors.New("!=nil")
    for i := 0; i < attempts && err != nil; i++ {
        delta := 100 * Millisecond
@@ -247,7 +297,9 @@ type afterResult struct {
 }

 func await(slot int, result chan<- afterResult, ac <-chan Time) {
-   result <- afterResult{slot, <-ac}
+   t := <-ac
+   print("await: t=", t.String(), "\n")
+   result <- afterResult{slot, t}
 }

 func testAfterQueuing(t *testing.T, delta Duration) error {

the output

C:\>u:\test -test.short -test.v -test.run=TestAfterQueuing
=== RUN TestAfterQueuing
sendTime: t=2015-05-04 22:44:37.4667308 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4667308 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4677073 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4677073 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4677073 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4677073 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4686838 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4686838 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4686838 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4686838 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4686838 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4696603 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4696603 -0700 PDT
await: t=2015-05-04 22:44:37.4667308 -0700 PDT
await: t=2015-05-04 22:44:37.4686838 -0700 PDT
await: t=2015-05-04 22:44:37.4677073 -0700 PDT
await: t=2015-05-04 22:44:37.4686838 -0700 PDT
await: t=2015-05-04 22:44:37.4686838 -0700 PDT
await: t=2015-05-04 22:44:37.4686838 -0700 PDT
await: t=2015-05-04 22:44:37.4677073 -0700 PDT
await: t=2015-05-04 22:44:37.4667308 -0700 PDT
await: t=2015-05-04 22:44:37.4677073 -0700 PDT
await: t=2015-05-04 22:44:37.4686838 -0700 PDT
await: t=2015-05-04 22:44:37.4696603 -0700 PDT
await: t=2015-05-04 22:44:37.4677073 -0700 PDT
await: t=2015-05-04 22:44:37.4696603 -0700 PDT
--- FAIL: TestAfterQueuing (0.23s)
        sleep_test.go:285: attempt 0 failed: After calls returned out of order: [0 5 3 6 6 6 1 1 2 7 9 4 8]
        sleep_test.go:289: After calls returned out of order: [0 5 3 6 6 6 1 1 2 7 9 4 8]
FAIL

Timers are fired fine - I tested this by adding debug prints in runtime/time.go (not shown). But then when sendTime puts timer events in the channel (returned by time.After), the events are received (in await) out of order - see the output. I think it just the way our scheduler works when there is some runnable code or, perhaps, when in VM. I don't know why this only happens on windows. Perhaps it is to do with Windows time granularity being around 15ms.

The only thing I can think of is to increase delta further from 100ms to 200ms or even more. This makes adjusted test PASS.

Looking for suggestions. Thank you.

Alex

@dvyukov @randall77

mdempsky · 2015-05-05T06:31:36Z

OpenBSD is still flaky too, even after bumping retries up to 100ms (#9903, d12b532). It probably makes sense that for a second retry (i.e., third attempt) we can try an even longer delta period, like 500ms or a full second. Since the tests are only occasionally flaky right now, an extra conservative delta duration shouldn't affect our average testing times much.

Overall, the test as written is just inherently flaky. There's no guarantee the await goroutines will execute in the desired order. They might get scheduled across multiple CPUs or even just get scheduled out of order on a single CPU.

An alternative non-flaky way to write the test would be to select on all of the time.After channels (probably using reflect.Select) and make sure that if the runtime chooses to receive from time.After(N*delta) then we need to either have already received from all of the shorter duration timeouts or receiving from them now should not block.

alexbrainman · 2015-05-05T06:43:18Z

Thank you @mdempsky for the input. I agree with your conclusions, that it is reasonable to increase delta up to 0.5s or 1s. These flukes are exception to the rule and shouldn't increase test time too much. I will see what others say and send the change after.

Alex

randall77 · 2015-05-05T07:19:21Z

It looks like general heavy load can throw this test off. Is there any way to get this test run in isolation? Maybe in test/... somewhere instead of a Test* (which gets run in parallel with lots of other stuff?).

minux · 2015-05-05T07:24:34Z

I really don't want package specific tests to be placed under /test. Tests under /test are supposed to test the compiler and to a lesser extent, the runtime.

alexbrainman · 2015-05-05T07:31:16Z

Can we use testing.TestMain somehow?

Alex

minux · 2015-05-05T07:34:47Z

The problem is that on multiple core machines, go test -short std will build and run multiple tests in parallel. It's not about contention from tests in the same package.

bradfitz · 2015-05-05T16:30:10Z

The problem is that on multiple core machines, go test -short std
will build and run multiple tests in parallel.

I'm not sure that's happening on our builders, but regardless the test shouldn't care.

Let's focus on making the test robust and not running it in isolation.

gopherbot · 2015-05-06T18:00:31Z

CL https://golang.org/cl/9795 mentions this issue.

bradfitz closed this as completed in b4bc7b4 May 6, 2015

golang locked and limited conversation to collaborators Jun 25, 2016

gopherbot added the FrozenDueToAge label Jun 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

time: TestAfterQueuing fails on windows builders: #10680

time: TestAfterQueuing fails on windows builders: #10680

alexbrainman commented May 4, 2015

minux commented May 4, 2015 via email

bradfitz commented May 4, 2015

alexbrainman commented May 5, 2015

mdempsky commented May 5, 2015

alexbrainman commented May 5, 2015

randall77 commented May 5, 2015

minux commented May 5, 2015 via email

alexbrainman commented May 5, 2015

minux commented May 5, 2015 via email

bradfitz commented May 5, 2015

gopherbot commented May 6, 2015

time: TestAfterQueuing fails on windows builders: #10680

time: TestAfterQueuing fails on windows builders: #10680

Comments

alexbrainman commented May 4, 2015

minux commented May 4, 2015 via email

bradfitz commented May 4, 2015

alexbrainman commented May 5, 2015

mdempsky commented May 5, 2015

alexbrainman commented May 5, 2015

randall77 commented May 5, 2015

minux commented May 5, 2015 via email

alexbrainman commented May 5, 2015

minux commented May 5, 2015 via email

bradfitz commented May 5, 2015

gopherbot commented May 6, 2015