Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/trace: TestTraceStress fails while unpark #11320

Closed
alexbrainman opened this issue Jun 21, 2015 · 48 comments
Closed

runtime/trace: TestTraceStress fails while unpark #11320

alexbrainman opened this issue Jun 21, 2015 · 48 comments
Milestone

Comments

@alexbrainman
Copy link
Member

From recent windows-amd64-2003kardianos builder log:

--- FAIL: TestTraceStress (8.30s)
    trace_test.go:226: failed to parse trace: p 1 is not running g 0 while unpark (offset 3449241, time 930298786)
FAIL
FAIL    runtime/pprof   22.016s

@dvyukov how would I debug that? I cannot reproduce the failure here.

Thank you.

Alex

@bradfitz
Copy link
Contributor

/cc @aclements for triage

@bradfitz bradfitz added this to the Go1.5Maybe milestone Jun 22, 2015
@dvyukov
Copy link
Member

dvyukov commented Jun 22, 2015

@alexbrainman how frequently does it happen? Please provide links to builder logs. Can't find them.
Do we have access to the builder?
I have only one idea that is that traceTickDiv value is too large. I will send a change, but it is a blind change. We need to assess how it affects failure frequency.

@alexbrainman
Copy link
Member Author

Dmitry, the failure is on windows-amd64-2003kardianos builder (see builder in the middle of 3 windows builders). Recently it fails for nearly every build - failures are slightly different, but all about trace parsing. The builder belongs to @kardianos - perhaps he can help you with testing.

The test never fails on my PCs here.

Thank you for looking into it.

Alex

@dvyukov
Copy link
Member

dvyukov commented Jun 22, 2015

What is that builder? Don't see it here:
https://code.google.com/p/go-wiki/wiki/DashboardBuilders
Is it AMD processor? Probably the processor of the OS is broken. Need to run the test program on the machine:
#9729 (comment)
It tests whether RDTSC behave reasonably on the machine.

@alexbrainman
Copy link
Member Author

I really don't know. All I know that it is one of the oldest ones we have OS-wise. You won't be able to run try-bot on it either. It is running old builder program. Lets wait for @kardianos - I am sure he will help you to test whatever you like there.

Alex

@kardianos
Copy link
Contributor

Running it mainly to test the runtime and cgo interface for older versions
of windows that we still claim to support. It runs on VirtualBox right now,
the host machine is a 6-core AMD that is close to four or five years old
now.

We could disable this test in some manner, or I could send you the product
code and iso (and purge it on my end), or come up with some other work
around.
Let me know how I can best help,
-Daniel

On Mon, Jun 22, 2015 at 4:00 AM Alex Brainman notifications@github.com
wrote:

I really don't know. All I know that it is one of the oldest ones we have
OS-wise. You won't be able to run try-bot on it either. It is running old
builder program. Lets wait for @kardianos https://github.com/kardianos

  • I am sure he will help you to test whatever you like there.

Alex


Reply to this email directly or view it on GitHub
#11320 (comment).

@dvyukov
Copy link
Member

dvyukov commented Jun 22, 2015

@kardianos thanks, I see
Ideally, we disable the test on this particular builder. @bradfitz , do we have such means now?

@aclements
Copy link
Member

FWIW, this happened a few times back in April, but only start happening reliably a few days ago:

2015-04-21T20:50:23-e589e08/windows-amd64-2003kardianos
2015-04-22T02:50:48-87054c4/windows-amd64-2003kardianos
2015-04-22T10:35:44-5fa2d99/windows-amd64-2003kardianos
2015-06-18T22:16:16-0c247bf/windows-amd64-2003kardianos
2015-06-18T22:17:11-ccec934/windows-amd64-2003kardianos
2015-06-18T22:39:09-682ecea/windows-amd64-2003kardianos
2015-06-18T22:44:26-82020f8/windows-amd64-2003kardianos
2015-06-19T00:53:56-18d9a8d/windows-amd64-2003kardianos
2015-06-19T01:47:11-9d968cb/windows-amd64-2003kardianos
2015-06-19T06:14:38-75ce330/windows-amd64-2003kardianos
2015-06-19T19:05:01-183cc0c/windows-amd64-2003kardianos
2015-06-19T20:05:31-dc89350/windows-amd64-2003kardianos
2015-06-19T20:28:01-cc6554f/windows-amd64-2003kardianos
2015-06-20T00:52:38-79d4d6e/windows-amd64-2003kardianos
2015-06-20T10:35:38-13c44d2/windows-amd64-2003kardianos
2015-06-21T03:11:01-3cab476/windows-amd64-2003kardianos
2015-06-22T02:48:27-626188d/windows-amd64-2003kardianos

I see e72f5f6 (runtime: fix tracing of syscallexit) on the 18th. Could that have made us more susceptible to this failure, or did it just make us able to detect it?

@kardianos
Copy link
Contributor

@dvyukov Last time we approached this we left it at this:
https://go-review.googlesource.com/#/c/8736/

I'll do some testing later to see if I can detect if time goes backwards reliably. Let me know if you don't think that is worthwhile.

@dvyukov
Copy link
Member

dvyukov commented Jun 22, 2015

I'll do some testing later to see if I can detect if time goes backwards reliably. Let me know if you don't think that is worthwhile.

What is your plan? Do you want to detect this in test, or in runtime?

@kardianos
Copy link
Contributor

It looked like Brad wanted to try to put it in the runtime, though I'm not sure about that. If it can be detected cheaply then maybe the runtime too. I'll try to quantize the cost of detection and get back.

@bradfitz
Copy link
Contributor

We don't have a way to disable tests per builder yet, nor is there anything in the environment to key off of. I filed #11346.

@kardianos
Copy link
Contributor

I tried to detect "time goes backwards" with something like this:

package main

import (
    "log"
    "runtime"
    "time"
)

const timeoutSecond = 100

func nanotime() int64 {
    // return time.Now().UnixNano()
    // return runtime.Unixnano()
    return runtime.Nanotime() // exported runtime·nanotime
}

func main() {
    runtime.GOMAXPROCS(8)

    log.Print("Start Check")
    max := nanotime()
    start := time.Now()

    timeout := time.After(time.Second * timeoutSecond)
    for {
        select {
        case <-timeout:
            log.Print("timeout, no backwards time detected")
            return
        default:
            next := nanotime()
            if next < max {
                log.Printf("backwards detected: duration %v", time.Now().Sub(start))
                return
            }
            max = next
        }
    }
}

Output:

2015/06/23 05:58:56 Start Check
2015/06/23 06:00:36 timeout, no backwards time detected
2015/06/23 06:00:36 Max Diff: 29296900

But I was unable to detect time going backwards, regardless of how I got "time".

Is there a better way to do this? I'm sure I'm missing something.

@dvyukov
Copy link
Member

dvyukov commented Jun 23, 2015

Cputicks do not go backwards, they are unsynchronized on different CPUs. So you would need to create a pair of threads, bind them to every possible pair of CPUs and measure minimal roundtrip time as perceived by one CPU and by another CPU. And then figure out from that numbers if cputicks are unsynchronized. However, there can be various restrictions on binding of threads to CPUs; in such case kernel will silently ignore your affinity mask and bind the thread to something else. Here it is important to not end up with two threads bound to the same CPU, because otherwise round-trip will be determined by thread context switch time which is wrong. And, of course, syscalls for determining of affinity masks and actual binding are OS-specific. This is going to be messy.

@aclements
Copy link
Member

Are we sure this is an unsynchronized TSCs problem? Why would it have only started up on the 18th?

@dvyukov
Copy link
Member

dvyukov commented Jun 24, 2015

I used the following program to test TSC on a machine:
#9729 (comment)
But it needs to be ported to windows.

@kardianos
Copy link
Contributor

Just to note, the VM only has one core allocated to it and I pinned the VM threads to a single core with taskset -p 1 X and the ppro test is still failing . I ran the detection program on the host and it didn't detect a problem with the CPU itself.

I'm out of my depth here by a long ways. If someone wants to play around on the box that is failing I can spin up access to it, or try something myself. Let me know.

@gopherbot
Copy link

CL https://golang.org/cl/11288 mentions this issue.

@kardianos
Copy link
Contributor

cl 11288 doesn't prevent these two pprof tests from failing, but it does appear to reduce the occurrence of failures.

@dvyukov
Copy link
Member

dvyukov commented Jun 24, 2015

@kardianos then probably RDTSC ticks with very low frequency or episodically inside of the VM...

@kardianos
Copy link
Contributor

@dvyukov I'm guessing that is the case. I would like to propose something like the following:

package main

import (
    "fmt"
    "time"
)

func main() {
    if nextDiffMilli() > 10 {
        fmt.Println("SKIP PPROF TESTS")
    }
}

func nextDiffMilli() int {
    last := time.Now().UnixNano()
    for {
        next := time.Now().UnixNano()
        if last != next {
            return int((next - last) / (1000 * 1000))
        }
    }
}

On my windows 7 box (HW) nextDiffMilli always returns 1. On my linux box it returns 0. On my windows XP VM it returns somewhere between 45 and 15 if unloaded, if loaded it gets worse.

@alexbrainman
Copy link
Member Author

@kardianos your program prints:

C:\dev\src\foo>go run main.go
SKIP PPROF TESTS

on my windows xp. windows xp (and maybe others) cannot do ticks better then 15ms. I expect all windows pcs have either 1ms or 15ms tick resolution. Maybe you can look for "missing ticks". Do a set of measurements like yours and see if there are any variations.

Keep in mind that Go prof is not running by default - it is started on request. But when started, it runs on a separate thread with high priority (see profileloop1 func in runtime). Mind you, now that I am looking at this code again, it used to work like that when in C. Now that runtime is written in Go, I am not sure that profileloop1 is not rescheduled onto a different thread as it runs. Even worse, the thread that profileloop1 has started with (with high priority), might be running some things that should not be running with high priority? @dvyukov what do you think?

Alex

@dvyukov
Copy link
Member

dvyukov commented Jun 25, 2015

@kardianos @alexbrainman wait, I am talking about RDTSC ticks, not OS time ticks. RDTSC is processor instruction on x86 processors, it is not affected/dependent on OS.

@alexbrainman
Copy link
Member Author

I just created https://go-review.googlesource.com/11478 that, I though, should be unrelated to our discussion. But I can reliably see similar failures now (on my windows/amd64 pc):

c:\dev\go\src\runtime\pprof>go test -short
--- FAIL: TestTraceStress (3.25s)
        trace_test.go:226: failed to parse trace: g 63 is not running while GoBlock (offset 2092378, time 806941183)
FAIL
exit status 1
FAIL    runtime/pprof   12.933s

c:\dev\go\src\runtime\pprof>go test -short
--- FAIL: TestTraceStress (1.78s)
        trace_test.go:226: failed to parse trace: g 81 is not running while GoBlock (offset 539994, time 612984625)
FAIL
exit status 1
FAIL    runtime/pprof   11.331s

c:\dev\go\src\runtime\pprof>go test -short
--- FAIL: TestTraceStressStartStop (1.66s)
        trace_test.go:365: failed to parse trace: p 1 is not running g 0 while unpark (offset 527904, time 408579106)
FAIL
exit status 1
FAIL    runtime/pprof   10.051s

c:\dev\go\src\runtime\pprof>
c:\dev\go\src\runtime\pprof>
c:\dev\go\src\runtime\pprof>go test -short
--- FAIL: TestTraceStressStartStop (0.98s)
        trace_test.go:365: failed to parse trace: p 5 is not running g 0 while unpark (offset 1153859, time 462090960)
FAIL
exit status 1
FAIL    runtime/pprof   10.344s

c:\dev\go\src\runtime\pprof>

Why? I guess I made mistake somewhere, but where?

Alex

@dvyukov
Copy link
Member

dvyukov commented Jun 25, 2015

Profileloop1 is not a real goroutine, it is always bound to a single thread. I would expect that your change is a no-op. Also CPU profiling is not enabled in trace tests. Are you sure that it is that change that makes difference for trace tests?

@kardianos
Copy link
Contributor

@dvyukov could you point me to the code that reads RDTSC ticks? I was able to find code for windows that that reads the OS ticks, but not a CPU instruction.

@alexbrainman Were those failures on a WinXP box on hardwre?

On WinXP when we heavily load the CPU (like the pprof stress tests do) the time per thread only has a resolution of 450 milliseconds, which is going to break pprof.

package main

import (
    "fmt"
    "os"
    "runtime"
    "sync"
    "time"
)

const (
    skipThreasholdMilli = 100
    T                   = 8
    timeoutMilli        = 100
)

func main() {
    if skip() {
        fmt.Println("SKIP PPROF")
        os.Exit(1)
    }
}

type pass struct {
    diff    int
    sameFor int
}

func skip() bool {
    done := make(chan struct{})
    time.AfterFunc(time.Millisecond*timeoutMilli, func() {
        close(done)
    })
    wg := &sync.WaitGroup{}
    result := make(chan pass, T)
    for range [T]struct{}{} {
        wg.Add(1)
        go func() {
            defer wg.Done()
            diff, sameFor := thrash(done)
            result <- pass{diff, sameFor}
        }()
    }
    wg.Wait()
    close(result)
    for p := range result {
        if p.diff > skipThreasholdMilli {
            fmt.Println(p.diff, " - but same for ", p.sameFor)
            return true
        }
    }
    return false
}

func nextDiffMilli() (int, int) {
    sameFor := 0
    last := time.Now().UnixNano()
    for {
        next := time.Now().UnixNano()
        if last != next {
            return int((next - last) / (1000 * 1000)), sameFor
        }
        sameFor++
    }
}

func thrash(done chan struct{}) (int, int) {
    runtime.LockOSThread()
    defer runtime.UnlockOSThread()

    diff := 0
    sameFor := 0
    for {
        select {
        case <-done:
            return diff, sameFor
        default:
            diff, sameFor = nextDiffMilli()
            if diff > skipThreasholdMilli {
                return diff, sameFor
            }
        }
    }
    return diff, sameFor
}

Output:

468  - but same for  1136486
SKIP PPROF

@dvyukov
Copy link
Member

dvyukov commented Jun 25, 2015

@kardianos Here is an example of how to call is:

rdtsc.go

package main

func cputicks() uint64

func main() {
    println(cputicks())
    println(cputicks())
}

rdtsc.s

TEXT ·cputicks(SB),4,$0-8
    BYTE    $0x0f; BYTE $0xae; BYTE $0xf0 // MFENCE
    RDTSC
    SHLQ    $32, DX
    ADDQ    DX, AX
    MOVQ    AX, ret+0(FP)
    RET

put both files into a single package and build.

@alexbrainman
Copy link
Member Author

@dvyukov you are correct - runtime/pprof tests fail regardless of my changes - the test is flaky.

I am not sure if you wanted me to test this:

c:\dev\src\t>type main.go
package main

func cputicks() uint64

func main() {
        prev := cputicks()
        for i := 0; i < 20; i++ {
                next := cputicks()
                println(prev, next - prev)
                prev = next
        }
}

c:\dev\src\t>go build && t
50769142142232179 1387
50769142142233566 955755
50769142143189321 536972
50769142143726293 525042
50769142144251335 521802
50769142144773137 552395
50769142145325532 558750
50769142145884282 532502
50769142146416784 517623
50769142146934407 519588
50769142147453995 519040
50769142147973035 550398
50769142148523433 519447
50769142149042880 518907
50769142149561787 521669
50769142150083456 516896
50769142150600352 551689
50769142151152041 529162
50769142151681203 518949
50769142152200152 517519

c:\dev\src\t>

Regarding https://go-review.googlesource.com/11478 why do you say that the change is noop? Why profileloop1 is bound to a single thread? Are you saying that none of functions we could potentially call from profileloop1 could result in execution rescheduled on a different thread?

Alex

@dvyukov
Copy link
Member

dvyukov commented Jun 26, 2015

I am not sure if you wanted me to test this:

Something similar needs to be tested on @kardianos builder. But the diff must not include println time, e.g. collect diffs in an array first, then print.

Why profileloop1 is bound to a single thread?

ASFAIR profileloop1 is not a goroutine, it is created directly with CreateThread API call. So it always has own thread. Scheduler preempts only real goroutines running on scheduler threads.

@alexbrainman
Copy link
Member Author

Thank you Dmitry.

c:\dev\src\t>type main.go
package main

func cputicks() uint64

func main() {
        var ticks [20]uint64
        for i := range ticks {
                ticks[i] = cputicks()
        }
        for i := 0; i < len(ticks)-1; i++ {
                println(ticks[i+1] - ticks[i])
        }
}

c:\dev\src\t>go build && t
1466
1238
1234
1229
1234
1234
1233
1230
1234
1233
1234
1234
1233
1226
1233
1230
1230
1233
1226

c:\dev\src\t>

Alex

@dvyukov
Copy link
Member

dvyukov commented Jun 26, 2015

@alexbrainman that's numbers from the machine with flaky trace tests?

On my machine it prints:

88
84
72
68
64
68
68
68
64
68
68
68
64
68
68
68
64
68
68

Your numbers look quite high, but I don't see how it can make trace tests flaky. On the opposite, too low numbers (in particular 0) would make trace tests flaky...

@alexbrainman
Copy link
Member Author

Yes, the shown output is from my windows amd64 pc with flaky tests. It seems I can easily reproduce them. If you have something for me to try on it, I would be happy to. Thank you.

Alex

@dvyukov
Copy link
Member

dvyukov commented Jun 26, 2015

with flaky trace tests or with flaky CPU profiler tests?

@alexbrainman
Copy link
Member Author

Well the failing TestTraceStress (see my yesterday post above). That's what we are trying to fix. Isn't it?

Alex

@dvyukov
Copy link
Member

dvyukov commented Jun 26, 2015

@alexbrainman ah, OK, I was confused by the cpu profiler change.
What is the processor? And OS version?
We need to port this program to windows:
#9729 (comment)
It will allow to verify whether it is an issue with hardware/os, or hardware/os are ok and there is an issue with tracer itself.

@kardianos
Copy link
Contributor

@dvyukov When I look for low diffs between ticks on my VM I can find them in the range of 1-40 (never zero). The host machine is always 97. The is what I used to determine this:

package main

import (
    "fmt"
    "os"
    "runtime"
    "time"
)

func cputicks() uint64

const (
    skipThreashold = 40
    timeoutMilli   = 1000
)

func main() {
    if skip() {
        fmt.Println("SKIP PPROF")
        os.Exit(1)
    }
}

func skip() bool {
    done := make(chan struct{})
    time.AfterFunc(time.Millisecond*timeoutMilli, func() {
        close(done)
    })
    min := thrash(done)
    fmt.Println(min)
    return min < skipThreashold
}

func nextDiffMilli() uint64 {
    last := cputicks()
    for {
        next := cputicks()
        if last != next {
            return next - last
        }
    }
}

func thrash(done chan struct{}) uint64 {
    runtime.LockOSThread()
    defer runtime.UnlockOSThread()

    var diff, min uint64
    min = 100000000
    for {
        select {
        case <-done:
            return min
        default:
            diff = nextDiffMilli()
            if diff < min {
                min = diff
            }
            if min < skipThreashold {
                return min
            }
        }
    }
}

@dvyukov
Copy link
Member

dvyukov commented Jun 26, 2015

never zero
that's because you filter zeros out (if last != next) :)

@kardianos
Copy link
Contributor

Sorry, hold over code. I had done other tests to show that it is never zero. But modified, still never shows zero:


func nextDiffMilli() uint64 {
    last := cputicks()
    next := cputicks()
    return next - last
}

@alexbrainman
Copy link
Member Author

What is the processor? And OS version?

It is vmware. Inside VM it says Intel Xeon CPU E5-2680 0 @ 2.70GHz, Windows 7.

We need to port this program to windows:
#9729 (comment)
It will allow to verify whether it is an issue with hardware/os, or hardware/os are ok and there is an issue with > tracer itself.

It looks like my pc has 1 CPU only. Your program seems like it needs multiple CPUs. Isn't it? Can I use Go?

Alex

@kardianos
Copy link
Contributor

@alexbrainman On both our setups where we can reproduce this we only have one CPU. In both cases we never see the same RDTSC value (it never appears to go backwards on our single core).

What I do see is under load and on idle the RDTSC has 100x variability in the time difference between ticks. On HW I may get 100 ticks diff each call or 300 ticks diff. But on the VM it varies between 1-30 to 300000 ticks.

@dvyukov I wasn't able to find any code that relied on the calculated tick frequency; it looks like the RDTSC time is directly encoded into the field and directly read before the events are sorted. But maybe I missed something.

@dvyukov
Copy link
Member

dvyukov commented Jun 29, 2015

it never appears to go backwards on our single core

How that was proved?

@dvyukov
Copy link
Member

dvyukov commented Jun 29, 2015

Run the following program when host is loaded:

package main

func cputicks() int64

func main() {
        for {
        t0, t1 := cputicks(), cputicks()
        if t1 - t0 <= 0 {
            println(t0, t1, t1 - t0)
            return
        }
        }
}

@kardianos
Copy link
Contributor

I've had this running for a half hour and nothing printed or exited. I'll
keep it running for a while longer, but I don't think that is the cause.

On Mon, Jun 29, 2015 at 9:40 AM Dmitry Vyukov notifications@github.com
wrote:

Run the following program when host is loaded:

package main
func cputicks() int64

func main() {
for {
t0, t1 := cputicks(), cputicks()
if t1 - t0 <= 0 {
println(t0, t1, t1 - t0)
return
}
}
}


Reply to this email directly or view it on GitHub
#11320 (comment).

@dvyukov
Copy link
Member

dvyukov commented Jun 30, 2015

@kardianos @alexbrainman OK, please dump a bad trace to file and attach it here. I will take a look as to what exactly is wrong there.

@dvyukov
Copy link
Member

dvyukov commented Jul 1, 2015

From the trace that @kardianos sent me offline:

179827100 GoBlock p=2 g=11 off=936312
179827123 GoUnblock p=2 g=0 off=936315 g=11
179827146 GoStart p=2 g=11 off=936319 g=11
179827169 GoBlock p=2 g=11 off=936322
179827192 GoUnblock p=2 g=0 off=936325 g=11
179827215 GoStart p=2 g=11 off=936329 g=11
179827238 GoBlock p=2 g=11 off=936332
179827261 GoUnblock p=2 g=0 off=936335 g=11
179827284 GoStart p=2 g=11 off=936339 g=11
179827307 GoBlock p=2 g=11 off=936342
179827330 GoUnblock p=2 g=0 off=936345 g=11
179827353 GoStart p=2 g=11 off=936349 g=11
179827376 GoBlock p=2 g=11 off=936352
179827399 GoStart p=2 g=11 off=936359 g=11 
179827399 GoUnblock p=2 g=0 off=936355 g=11

The last two evens happen at the same time.

Please check if https://go-review.googlesource.com/#/c/11834/ fixes tests.

@gopherbot
Copy link

CL https://golang.org/cl/11834 mentions this issue.

dvyukov added a commit that referenced this issue Jul 2, 2015
On some VMs two events can happen at the same time. For examples:
179827399 GoStart p=2 g=11 off=936359 g=11
179827399 GoUnblock p=2 g=0 off=936355 g=11
If we do non-stable sort, the events can be reordered making the trace inconsistent.
Do stable sort instead.

Batches are dumped in FIFO order, so if these same-time events are split into
separate batches, stable sort still works.

Events on different CPUs go into different batches and can be reordered.
But the intention is that causally-related events on different CPUs
will have larger (non-zero) time diff.

Update #11320

Change-Id: Id1df96af41dff68ea1782ab4b23d5afd63b890c9
Reviewed-on: https://go-review.googlesource.com/11834
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@kardianos
Copy link
Contributor

CL 11834 fixes this issue on my VM.

@alexbrainman
Copy link
Member Author

CL 11834 fixes this issue on my pc too. Thank you @dvyukov and @kardianos for seeing it till the end.

Alex

@mikioh mikioh modified the milestones: Go1.5, Go1.5Maybe Jul 2, 2015
@mikioh mikioh changed the title runtime/debug: TestTraceStress fails while unpark runtime/trace: TestTraceStress fails while unpark Jul 31, 2015
@golang golang locked and limited conversation to collaborators Aug 5, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants