runtime: network blips during concurrent GC #20457

petermattis · 2017-05-22T17:25:28Z

What version of Go are you using (`go version`)?

go version go1.8.1 darwin/amd64

What operating system and processor architecture are you using (`go env`)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/pmattis/Development/go"
GORACE=""
GOROOT="/Users/pmattis/Development/go-1.8"
GOTOOLDIR="/Users/pmattis/Development/go-1.8/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/qc/fpqpgdqd167c70dtc6840xxh0000gn/T/go-build955668609=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"

What did you do?

While investigating tail latencies in CockroachDB, I noticed network blips during concurrent GC. The trace below is from a single-node cockroach cluster with a single client sending it requests as fast as possible. Note that the load is modest. My test machine is ~75% idle while gathering these traces.

trace.8.out.zip

Every time GC runs, there are no networking events for ~10ms which is quite close to the length of the GC run itself. This behavior occurs very quickly after starting the cluster and load generator. The heap size is modest (O(100 MB)).

Here is a zoom in on the first GC run from the above trace:

In addition to the blip while GC is running, there is a smaller blip after each GC run while sweeping. The image below shows G611 being called on to assist in sweep work from 62ms-69ms in the trace.

Note that I'm running both the cockroach process and the load generator (kv) on the same machine. Are the GC runs in the cockroach process starving the kv process of CPU? Running both processes with GOMAXPROCS=4 (I'm using an 8 core machine), shows much better behavior.

trace.4.out.zip

The text was updated successfully, but these errors were encountered:

dsnet · 2017-05-22T20:00:52Z

\cc @aclements

ianlancetaylor · 2017-05-23T00:01:55Z

I note that the sysmon goroutine polls the network every 10ms if nothing else has polled. Normally findrunnable will poll the network whenever it doesn't have another goroutine to run. Perhaps during the GC there is always another goroutine to run (to do GC work) and therefore we fall back to the sysmon polling.

petermattis · 2017-05-23T00:11:11Z

I forgot to add that I instrumented the Go runtime and can see lots of calls to netpoll via gcDrain -> pollWork. I also tried lowering the max sleep time in sysmon to 1ms, but that had no effect. Perhaps the 10ms is some OS level time slice on Darwin. I'll try and verify this same behavior on Linux.

petermattis · 2017-05-23T01:03:47Z

Linux shows similar behavior:

trace.linux.out.zip

I'm not sure what to make of the 3 idle procs (5, 6 and 7).

josharian · 2017-05-23T01:32:03Z

Probably unrelated, but #20307 also observed that GC had a detrimental effect on goroutine scheduling.

josharian · 2017-05-23T01:33:10Z

@petermattis if you can reproduce easily, would you mind trying out tip (and maybe 1.7.5) to see whether anything has changed?

petermattis · 2017-05-23T01:49:14Z

@josharian Tip as of go version devel +5e79787 Mon May 22 13:44:12 2017 +0000 darwin/amd64:

trace.tip.out.zip

The networking blip is still there during GC, but it is no longer precisely 10ms. The network blips during sweep seem to be much reduced:

petermattis · 2017-05-24T14:36:01Z

I'm reasonably convinced that what I'm seeing is CPU starvation of my load generator. I tweaked the output of GODEBUG=gctrace=1 to compute the amount of CPU used by GC during the concurrent mark phase:

gc 203 @28.154s 0/98%: 0.007+9.5+0.039 ms clock, 0.060+0.21/18/56+0.31 ms cpu, 82->82->42 MB, 84 MB goal, 8 P

The 0/98% indicates that GC has used 0% of the CPU over the lifetime of the process, but used 98% of the CPU during the 9.5ms of concurrent marking. The use of the idle CPU for GC is reasonable on the one hand, but makes the Go process a poor neighbor.

I tried disabling idle GC workers which significantly reduced the CPU spikes during GC, but made latencies much worse:

trace.noidlegc.out.zip

In the above trace, G377 is running on Proc 5 when GC starts. G377 then blocks on network IO and the 2 dedicated GC workers get started. But there are only 2 Procs running and there does not seem to be a mechanism to kick off another one (or perhaps I disabled that mechanism in my hack to disable idle GC workers). Adding a wakep call after a dedicated GC worker is started fixes this issue. This gets latencies to a better place:

trace.noidlegc2.out.zip

CPU usage during marking is much lower than stock go1.8.1:

gc 201 @30.960s 1/29%: 0.025+22+0.080 ms clock, 0.20+11/41/0+0.64 ms cpu, 87->88->45 MB, 89 MB goal, 8 P

The downside to removing the idle GC workers is that the GC periods are about twice as long.

Seems like there is a high-level question to be answered about how much CPU GC should use when the process is not CPU limited.

bradfitz · 2017-05-24T14:38:11Z

Dup of #17969 ("aggressive GC completion is disruptive to co-tenants") then?

petermattis · 2017-05-24T14:53:57Z

Yep, that seems like the same issue I'm seeing.

aclements · 2017-07-20T12:56:12Z

But there are only 2 Procs running and there does not seem to be a mechanism to kick off another one (or perhaps I disabled that mechanism in my hack to disable idle GC workers).

You're right that at the moment it's possible for not all idle Ps to help with GC. This is #14179 (which we fixed once, but the fix was broken, so then we unfixed it).

Seems like there is a high-level question to be answered about how much CPU GC should use when the process is not CPU limited.

Yes. :) But I don't know what the answer is. Ideally we'd get the OS scheduler involved, since it's the only thing that knows what CPU resources are really idle. But (right now) we need global coordination between the GC workers, so we can't just run them at OS idle priority.

Thanks for all the traces and debugging! Closing as a dup of #17969.

josharian added this to the Go1.9Maybe milestone May 23, 2017

petermattis mentioned this issue May 25, 2017

runtime: aggressive GC completion is disruptive to co-tenants #17969

Open

aclements closed this as completed Jul 20, 2017

golang locked and limited conversation to collaborators Jul 20, 2018

gopherbot added the FrozenDueToAge label Jul 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: network blips during concurrent GC #20457

runtime: network blips during concurrent GC #20457

petermattis commented May 22, 2017

dsnet commented May 22, 2017

ianlancetaylor commented May 23, 2017

petermattis commented May 23, 2017

petermattis commented May 23, 2017

josharian commented May 23, 2017

josharian commented May 23, 2017

petermattis commented May 23, 2017

petermattis commented May 24, 2017

bradfitz commented May 24, 2017

petermattis commented May 24, 2017

aclements commented Jul 20, 2017

runtime: network blips during concurrent GC #20457

runtime: network blips during concurrent GC #20457

Comments

petermattis commented May 22, 2017

What version of Go are you using (go version)?

What operating system and processor architecture are you using (go env)?

What did you do?

dsnet commented May 22, 2017

ianlancetaylor commented May 23, 2017

petermattis commented May 23, 2017

petermattis commented May 23, 2017

josharian commented May 23, 2017

josharian commented May 23, 2017

petermattis commented May 23, 2017

petermattis commented May 24, 2017

bradfitz commented May 24, 2017

petermattis commented May 24, 2017

aclements commented Jul 20, 2017

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?