Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/trace: TestTraceStressStartStop is unreliable #10476

Closed
davecheney opened this issue Apr 16, 2015 · 17 comments
Closed

runtime/trace: TestTraceStressStartStop is unreliable #10476

davecheney opened this issue Apr 16, 2015 · 17 comments
Milestone

Comments

@davecheney
Copy link
Contributor

pprof.TestTraceStressStartStop is unreliable and causes many builds to be marked as failing.

--- FAIL: TestTraceStressStartStop (0.03s)
trace_test.go:363: failed to parse trace: g 5 is not waiting during syscall exit (offset 357, time 0)
FAIL
FAIL runtime/pprof 8.453s

@davecheney davecheney added this to the Go1.5 milestone Apr 16, 2015
@davecheney davecheney changed the title runtime: TestTraceStressStartStop is unreliable runtime/pprof: TestTraceStressStartStop is unreliable Apr 16, 2015
@davecheney
Copy link
Contributor Author

@bradfitz
Copy link
Contributor

Dup of #9729

@rsc
Copy link
Contributor

rsc commented Apr 17, 2015

Unduplicated from #9729 and assigning Dmitriy. This failure seems to be particularly tied to system calls. The old failure was due to running when the world was stopped. We fixed that. I believe this new failure is due to a bug in:

commit 089d363
Author: Dmitry Vyukov dvyukov@google.com
AuthorDate: Fri Apr 10 18:32:40 2015 +0300
Commit: Dmitry Vyukov dvyukov@google.com
CommitDate: Fri Apr 10 17:39:06 2015 +0000

runtime: fix tracing of syscall exit

Fix tracing of syscall exit after:
https://go-review.googlesource.com/#/c/7504/

Change-Id: Idcde2aa826d2b9a05d0a90a80242b6bfa78846ab
Reviewed-on: https://go-review.googlesource.com/8728
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Dmitry Vyukov <dvyukov@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>

davecheney added a commit that referenced this issue Apr 17, 2015
Updates #10476

Change-Id: Ic4414f669104905c6004835be5cf0fa873553ea6
Reviewed-on: https://go-review.googlesource.com/8962
Reviewed-by: Russ Cox <rsc@golang.org>
@gopherbot
Copy link

CL https://golang.org/cl/9132 mentions this issue.

@mewmew
Copy link
Contributor

mewmew commented May 3, 2015

FYI, reproduced on rev 6f42b61.

$ go version
go version devel +6f42b61 Sun May 3 16:10:40 2015 +0000 linux/amd64
[~/go/src/runtime/pprof]$ go test
--- FAIL: TestTraceSymbolize (22.61s)
    trace_stack_test.go:130: failed to parse trace: p 0 is running before start (offset 669, time 6666204163)
--- FAIL: TestTraceStress (15.24s)
    trace_test.go:224: failed to parse trace: g 82 is not runnable before start (offset 1353327, time 50453959)
--- FAIL: TestTraceFutileWakeup (0.32s)
    trace_test.go:432: failed to parse trace: g 162 is not runnable before start (offset 390, time 155268)
FAIL
exit status 1
FAIL    runtime/pprof   104.363s

@davecheney
Copy link
Contributor Author

Robert. Is this running inside a VM ? If not, what hardware ?

On Mon, May 4, 2015 at 4:17 AM, Robin Eklind notifications@github.com
wrote:

FYI, reproduced on rev 6f42b61
6f42b61
.

$ go version
go version devel +6f42b61 Sun May 3 16:10:40 2015 +0000 linux/amd64

[~/go/src/runtime/pprof]$ go test
--- FAIL: TestTraceSymbolize (22.61s)
trace_stack_test.go:130: failed to parse trace: p 0 is running before start (offset 669, time 6666204163)
--- FAIL: TestTraceStress (15.24s)
trace_test.go:224: failed to parse trace: g 82 is not runnable before start (offset 1353327, time 50453959)
--- FAIL: TestTraceFutileWakeup (0.32s)
trace_test.go:432: failed to parse trace: g 162 is not runnable before start (offset 390, time 155268)
FAIL
exit status 1
FAIL runtime/pprof 104.363s


Reply to this email directly or view it on GitHub
#10476 (comment).

@mewmew
Copy link
Contributor

mewmew commented May 3, 2015

Robert. Is this running inside a VM ? If not, what hardware ?

Running on real (albeit old) hardware. Including the hardware specs below. P.S. my name is Robin :) Let me know if you need any more info.

[~]$ uname -a
Linux x61s 4.0.1-1-ARCH #1 SMP PREEMPT Wed Apr 29 12:00:26 CEST 2015 x86_64 GNU/Linux
[~]$ cat /proc/cpuinfo 
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 15
model name  : Intel(R) Core(TM)2 Duo CPU     L7500  @ 1.60GHz
stepping    : 11
microcode   : 0xba
cpu MHz     : 1200.000
cache size  : 4096 KB
physical id : 0
siblings    : 2
core id     : 0
cpu cores   : 2
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 10
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm ida dtherm tpr_shadow vnmi flexpriority
bugs        :
bogomips    : 3193.85
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model       : 15
model name  : Intel(R) Core(TM)2 Duo CPU     L7500  @ 1.60GHz
stepping    : 11
microcode   : 0xba
cpu MHz     : 1200.000
cache size  : 4096 KB
physical id : 0
siblings    : 2
core id     : 1
cpu cores   : 2
apicid      : 1
initial apicid  : 1
fpu     : yes
fpu_exception   : yes
cpuid level : 10
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm ida dtherm tpr_shadow vnmi flexpriority
bugs        :
bogomips    : 3193.85
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

@mewmew
Copy link
Contributor

mewmew commented May 3, 2015

I applied the changeset in CL https://golang.org/cl/9132, but got the same result:

--- FAIL: TestTraceStress (3.16s)
    trace_test.go:224: failed to parse trace: g 66 is not runnable before start (offset 193346, time 51468022)
--- FAIL: TestTraceStressStartStop (0.06s)
    trace_test.go:363: failed to parse trace: g 65 is not runnable before start (offset 247, time 21113)
--- FAIL: TestTraceFutileWakeup (0.50s)
    trace_test.go:427: failed to parse trace: g 119 is not waiting before unpark (offset 6266, time 2007994)
FAIL
FAIL    runtime/pprof   11.239s

@davecheney
Copy link
Contributor Author

That explains it. Sadly on those processors the TSC counter is not coherent
across cores and is affected by power scaling. The only way I know how to
make the TSC reliable on that cpu is to pin it to a core, which isn't easy
from user space, and really isn't easy without root.

On Mon, May 4, 2015 at 8:40 AM, Robin Eklind notifications@github.com
wrote:

Robert. Is this running inside a VM ? If not, what hardware ?

Running on real (albeit old) hardware. Including the hardware specs below.
P.S. my name is Robin :) Let me know if you need any more info.

[~]$ uname -a
Linux x61s 4.0.1-1-ARCH #1 SMP PREEMPT Wed Apr 29 12:00:26 CEST 2015 x86_64 GNU/Linux

[~]$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 Duo CPU L7500 @ 1.60GHz
stepping : 11
microcode : 0xba
cpu MHz : 1200.000
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64
monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm ida dtherm
tpr_shadow vnmi flexpriority
bugs :
bogomips : 3193.85
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 Duo CPU L7500 @ 1.60GHz
stepping : 11
microcode : 0xba
cpu MHz : 1200.000
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64
monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm ida dtherm
tpr_shadow vnmi flexpriority
bugs :
bogomips : 3193.85
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:


Reply to this email directly or view it on GitHub
#10476 (comment).

@minux
Copy link
Member

minux commented May 4, 2015

On Sun, May 3, 2015 at 7:05 PM, Dave Cheney notifications@github.com
wrote:

That explains it. Sadly on those processors the TSC counter is not coherent
across cores and is affected by power scaling. The only way I know how to
make the TSC reliable on that cpu is to pin it to a core, which isn't easy
from user space, and really isn't easy without root.

you can use taskset(1) to pin a process to a set of cores, without root.

@capnm
Copy link

capnm commented Jul 2, 2015

That explains it. Sadly on those processors the TSC counter is not coherent across cores and is
affected by power scaling. The only way I know how to

Not sure if it's being ignored as unfortunate (see #9729), in other case FYI the tests still break the build:
../bin/go version
go version devel +d16c7f8 Thu Jul 2 03:43:34 2015 +0000 linux/amd64

./all.bash

ok      runtime/debug   0.019s
--- FAIL: TestTraceSymbolize (0.01s)
    trace_stack_test.go:130: failed to parse trace: p 1 is running before start (offset 2418, time 313770)
--- FAIL: TestTraceStress (4.56s)
    trace_test.go:226: failed to parse trace: g 69 is not runnable before start (offset 128950, time 47964275)
--- FAIL: TestTraceStressStartStop (0.16s)
    trace_test.go:365: failed to parse trace: EvFrequency contains invalid frequency -2605703297 at offset 0x247
--- FAIL: TestTraceFutileWakeup (0.72s)
    trace_test.go:433: failed to parse trace: g 77 is not running while GoCreate (offset 13821, time 49249)
FAIL
FAIL    runtime/pprof   12.817s

@dvyukov
Copy link
Member

dvyukov commented Jul 2, 2015

@capnm Looking at the EvFrequency failure, it seems that RDTSC goes back on your machine.

@capnm
Copy link

capnm commented Jul 2, 2015

@dvyukov I believe that's impossible ;-)
your test from #9729 :
./tsc-sync-test
1-0: BROKEN 307937275582285 -> 307937068157247 (diff=207425038)
1-0: BROKEN 307938257532281 -> 307938050106672 (diff=207425609)
1-0: BROKEN 307939439698702 -> 307939232270310 (diff=207428392)

@dvyukov
Copy link
Member

dvyukov commented Jul 2, 2015

@capnm the test shows that it happens, though

@capnm
Copy link

capnm commented Jul 2, 2015

@dvyukov If I understand that correctly, RDTSC still gets on both cores forward, the trouble seams to be, that those CPU's just lose the synchronisation after waking up from cpu-throttling?

@dvyukov
Copy link
Member

dvyukov commented Jul 2, 2015

Yes.
Or maybe RDTSC on different CPUs were not synchronized during startup.

@capnm
Copy link

capnm commented Jul 2, 2015

Even after a sync the difference increases slightly over time (~5µs/sec) so I guess the first core 'sleeps' a bit longer.
I think, at least the tests could be skipped and pprof should resign with some suitable error message?

@mikioh mikioh changed the title runtime/pprof: TestTraceStressStartStop is unreliable runtime/trace: TestTraceStressStartStop is unreliable Jul 31, 2015
@golang golang locked and limited conversation to collaborators Aug 5, 2016
@rsc rsc unassigned dvyukov Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants