Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: CPU profiles incorrect on kernels with bad profile signal accounting #13405

Closed
mdempsky opened this issue Nov 25, 2015 · 17 comments
Milestone

Comments

@mdempsky
Copy link
Member

On build.golang.org at the moment, linux-arm builds have failed 6 times on the front page, all due to this TestStackBarrierProfiling failure:

--- FAIL: TestStackBarrierProfiling (5.08s)
    pprof_test.go:342: subprocess failed with exit status 1:
        --- FAIL: TestStackBarrierProfiling (5.02s)
            pprof_test.go:97: profile too short: [0x0 0x3 0x0 0x2710 0x0 0x0 0x1 0x0]
        FAIL
FAIL
FAIL    runtime/pprof   15.412s

http://build.golang.org/log/e65672824d6a674ea5d3281ae0c52a269283446f
http://build.golang.org/log/82d27007b1060f62d3ed9ef7c80dd0deb1fbfe8a
http://build.golang.org/log/8fe20cf43733befab66e42326f23727d1d923f07
http://build.golang.org/log/544dd7f74f4fcc8ec46b6ec8c254337146cf7f09
http://build.golang.org/log/979e69618860a68ed8f91c98613d1f4cf3346037
http://build.golang.org/log/92ffe2d467cb2306b9758d1d7672ba8c86d4b140

@bradfitz bradfitz added this to the Go1.6 milestone Nov 25, 2015
@aclements
Copy link
Member

This has happened a few times before this test was ever introduced, though they're probably unrelated:

2015-02-27T23:31:36-c2025c4/windows-amd64-gce
2015-03-13T22:46:56-f52b234/windows-amd64-gce
2015-09-02T05:58:52-5f2cda5/windows-amd64-2003kardianos

This particular failure really just means that there weren't any samples. @minux, @davecheney, any idea what could be causing zero profiling samples in TestStackBarrierProfiling specifically on linux/arm?

@rsc
Copy link
Contributor

rsc commented Dec 5, 2015

This is happening on trybots, which suggests that these are scaleway VMs. Is that right, @bradfitz?

Maybe the scaleway VMs have a profiling issue. Have we seen this on real hardware?

@bradfitz
Copy link
Contributor

bradfitz commented Dec 5, 2015

They are on Scaleway but Scaleway has no virtualization. Every instance is the same size and on physical hardware.

@aclements
Copy link
Member

To reproduce:

VM=$(gomote create linux-arm)
gomote push $VM
gomote run $VM /bin/sh -c 'PATH=$PWD/go/bin:$PATH; cd go/src && ./make.bash && cd runtime/pprof && go test -c'
gomote run $VM /bin/sh -c 'PATH=$PWD/go/bin:$PATH; cd go/src/runtime/pprof; while GODEBUG=gcstackbarrierall=1 GOGC=1 ./pprof.test -test.run TestStackBarrierProfiling -test.v; do true; done'

It usually only takes a few iterations.

It appears there's nothing particularly special about the iterations that do fail. The successful ones are on the order of two to ten samples, despite running for 5 seconds. There's some distribution around that, so sometimes it gets zero samples.

This failure requires all three of GOGC=1, TestStackBarrierProfiling, and arm. On arm, without GOGC=1 or with TestMathBigDivide, we get ~100 samples (which is still ~5x less than expected, but at least it isn't almost zero). On amd64, the test gets 600–700 samples.

@aclements
Copy link
Member

According to strace (run on an ARM Chromebook; on gomote strace makes it unreproducible), we're making what appear to be reasonable calls to setitimer and rt_sigprocmask, yet when the test fails the kernel hasn't sent us any SIGPROFs (it's not just that we're failing to process them).

This appears to be related to STW GC phases. The test spends ~50% of its time in GC and most of that is in STW phases. Forcing full STW with GODEBUG=gcstoptheworld=1 significantly reduces the number of collected samples. Even with GOGC=100, it usually only collects 10–20 samples. With STW and GOGC=1, the test always fails.

GOMAXPROCS also significantly affects the test. Setting GOMAXPROCS=1 fixes the test, even with STW GC, and yields ~500 samples in 5 seconds, which is the expected number. GOMAXPROCS=2 yields ~100 samples in 5 seconds. Usually I'm running with GOMAXPROCS=4.

Clearly the kernel doesn't know the runtime is in a STW GC phase, so there's something different about the process' behavior when it's in STW GC that's interacting with the kernel's profiling signals.

@aclements
Copy link
Member

This has something to do with sleeping. If I force getfull to osyield instead of usleep(100) when it's out of work and disable the usleep(100) in runqgrab, all of these problems go away and I start getting 600–700 samples, just like on amd64. Disabling one or the other isn't sufficient.

@aclements
Copy link
Member

This is a result of ARM's poor timer hardware and some strange design decisions in the Linux kernel.

First, Linux/ARM (at least the 3.2 kernel on the Scaleway machines the builders use [1]) is totally incapable of sleeping for 100 microseconds. A usleep(100) gives up the entire 10ms scheduling quantum, even if there's nothing else to run. This is probably because, according to /proc/timer_list, all of the timers have a resolution of 10ms (compare this to, say, linux/amd64, where the resolution is 1ns).

This alone doesn't explain why we aren't getting profiling signals; we still rack up about 2.3 seconds of CPU time during the 5 second test. However, ITIMER_PROF is fired by update_process_times, which checks to see if the current process' user plus system CPU time has passed the firing threshold. update_process_times is called by the system tick (specifically tick_nohz_handler in this case). We usleep roughly 1,200 across the various threads (there are ~4 active threads, so that's ~3 wall clock seconds of 10ms sleeps, which adds up almost perfectly with the 2.3 seconds of CPU time), so it's likely we usleep in every single scheduling quantum and hence always give away our time slice. As a result, there's never a running process when update_process_times runs. Even though the process racks up more than enough CPU time according to /proc/self/stat, update_process_times just never looks at it and thus never fires a SIGPROF.

[1] Scaleway kernel configuration is here: https://github.com/scaleway/kernel-tools/blob/master/armv7l/3.2.34-std/.config . Notably, CONFIG_HZ=100, which is where the 10ms scheduling quantum comes from, and CONFIG_HIGH_RES_TIMERS is not set.

/cc @dvyukov @RLH in case either of you have insights on what to do about this 100us sleeps.

@bradfitz
Copy link
Contributor

bradfitz commented Jan 5, 2016

@aclements
Copy link
Member

What about OpenBSD?

Probably. I can't say about the signal firing aspect without digging in to the OpenBSD kernel, but I tested usleep(100) and it takes 20 ms on OpenBSD (!).

@aclements
Copy link
Member

What about OpenBSD?

Yes, the logic in OpenBSD for triggering signalling profiles has exactly the same problem Linux's does. I bet Solaris does the same thing.

I ran https://go-review.googlesource.com/18251 through the trybots to see what else has poor usleep resolution and this seems to be the whole list.

(FWIW, I also checked the timer support on my Android phone and it actually supports 1ns resolution like linux/amd64, so it's not all linux/arm that's necessarily broken.)

@gopherbot
Copy link

CL https://golang.org/cl/18252 mentions this issue.

@aclements
Copy link
Member

One solution is simply to ignore this problem, say that profiling won't work very well on these platforms, and wait for them to catch up. This sounds like a cop-out, but I think it's reasonable because:

  1. The linux/arm builders are using an old kernel (in fact, the oldest supported Scaleway kernel). Every newer Scaleway kernel supports high-res timers. (https://github.com/scaleway/kernel-tools/tree/master/armv7l)
  2. As far as I can tell, every Raspberry Pi kernel (both v1 and v2 and even the "cut-down" kernels) has supported high-res timers. (https://github.com/raspberrypi/linux/tree/rpi-4.4.y/arch/arm/configs bcmrpi_defconfig for v1, bcm2709_defconfig for v2 and true all the way back to 3.2)
  3. OpenBSD and Solaris are second-class ports and there's evidence that only some Solaris platforms actually have trouble with short sleeps (so those, too, will probably go away).
  4. It's 2016. 10 ms is an eternity and it's reasonable to expect decent timer support.

That said, I would like to eliminate the usleep(100) from getfull because it's bad for other reasons. However, the usleep in runqgrab is a good idea (assuming the OS can do it), which makes that one much harder to eliminate.

Another alternative is to try to detect if usleep(100) actually works when the runtime starts up, but I don't know how to do this reliably on a system that may be loaded.

aclements added a commit that referenced this issue Jan 5, 2016
This test triggers a large number of usleep(100)s. linux/arm, openbsd,
and solaris have very poor timer resolution on the builders, so
usleep(100) actually gives up the whole scheduling quantum. On Linux
and OpenBSD (and probably Solaris), profiling signals are only
generated when a process completes a whole scheduling quantum, so this
test often gets zero profiling signals and fails.

Until we figure out what to do about this, skip this test on these
platforms.

Updates #13405.

Change-Id: Ica94e4a8ae7a8df3e5a840504f83ee2ec08727df
Reviewed-on: https://go-review.googlesource.com/18252
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
@rsc
Copy link
Contributor

rsc commented Jan 6, 2016 via email

@rsc rsc changed the title runtime/pprof: TestStackBarrierProfiling flaky on linux-arm runtime/pprof: CPU profiles incorrect on kernels with bad profile signal accounting Jan 6, 2016
@rsc
Copy link
Contributor

rsc commented Jan 6, 2016

I updated the title but I'm going to create a whole new issue instead.

@aclements
Copy link
Member

I've moved eliminating the usleep(100) from getfull (for reasons in addition to this issue) to #13842.

Should we close this issue as resolved, or do we want to hold it open until we can eliminate (or at least narrow down) the list of excluded platforms in TestStackBarrierProfiling?

@rsc
Copy link
Contributor

rsc commented Jan 6, 2016

I sent https://golang.org/cl/18257 for the documentation.

On Wed, Jan 6, 2016 at 10:28 AM, Russ Cox rsc@golang.org wrote:

Per discussion yesterday, let's disable the tests and then document the
issue and wait for obsolete operating systems to fix themselves.

@rsc
Copy link
Contributor

rsc commented Jan 6, 2016

Fine to close.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants