Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: concurrent phase of GC causing full request stalls #17341

Closed
zviadm opened this issue Oct 4, 2016 · 5 comments
Closed

runtime: concurrent phase of GC causing full request stalls #17341

zviadm opened this issue Oct 4, 2016 · 5 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@zviadm
Copy link

zviadm commented Oct 4, 2016

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

Go 1.7
Also tried with: GO: fd29628

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"
CXX="g++"
CGO_ENABLED="1"

What did you do?

If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.

Wrote a benchmark to showcase the issue using native net/http library:
https://gist.github.com/zviadm/18263037d4b9b624fc8a5b4cb7005bb4

What this benchmark does is, create a simple http server, and then create lots of connections to it and perform very simple requests to it from those connections/goroutines. This simulates having an http server and lots of clients where most of them are idle.

What did you expect to see?

When run using something like:
$: ulimit -Sn 500000
$: GODEBUG=gctrace=1 ./net_gctest -port=5002 -nroutines=30000 -sleep_t=3s -max_latency=30ms &> benchout.txt

Expect to see it doing 10k QPS, with no stalls or tail latencies.

What did you see instead?

Sample output: https://gist.github.com/zviadm/82e9b899833db9dc741f9a6400a09d6c

gc 25 @222.337s 0%: 0.34+165+1.3 ms clock, 11+477/1307/2896+43 ms cpu, 1411->1415->723 MB, 1447 MB goal, 32 P
gc 26 @237.844s 0%: 0.30+165+1.4 ms clock, 9.8+504/1306/3076+47 ms cpu, 1411->1415->723 MB, 1447 MB goal, 32 P
gc 27 @253.378s 0%: 0.30+162+1.2 ms clock, 9.4+485/1286/2993+39 ms cpu, 1411->1415->723 MB, 1447 MB goal, 32 P

even though, STW phase of GC is <2ms, we see lots of stalls of >30ms while concurrent phase of the GC runs. pretty much all requests stall during that time period.

This is a pretty serious issue because that basically means if you have lots of idle connections (or potentially just go routines?) in a server, it causes huge tail latency spikes. And in real programs GC frequency will be higher, it can happen once every 5-10 seconds and now these spikes easily show up in p95/p99 latency times.

Note that the cpu utilization is very low throughout the test. If we perform same test with much smaller number of go routines, say only 200 or 300, it can do 200k QPS without any STALLs on that same machine.

@bradfitz bradfitz changed the title Concurrent phase of GC causing full request stalls. runtime: concurrent phase of GC causing full request stalls Oct 4, 2016
@zviadm
Copy link
Author

zviadm commented Oct 4, 2016

Maybe related to:
#15847
#16293

But it seems like issue remains with current 'master' too, so seems like something else

@quentinmit quentinmit added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 4, 2016
@quentinmit quentinmit added this to the Go1.8 milestone Oct 4, 2016
@quentinmit
Copy link
Contributor

@zviadm Can you send an execution trace? There are several known issues around this, which we're working on for Go 1.8. From the trace we should be able to tell which one(s) are affecting you.

https://golang.org/pkg/runtime/trace/

@cespare
Copy link
Contributor

cespare commented Oct 5, 2016

I reproduced this using the given gist. Log output here.

Relevant bit of the trace:

screen_20161005100414

I attached the full trace here. For me, the pause appears in the split 6/8 (15.636125645s-19.117813135s).

Edit: I originally posted a trace using Go 1.7. I updated the log and trace using head (dce0df2) instead.

@aclements
Copy link
Member

Thanks! It's clear from the trace that this is #16528. Closing as a duplicate.

@zviadm, if you subscribe to #16528, it would be great to get feedback on future fixes. You can also try CL 30031 from that issue, though based on my own tests that just shifts the pauses around (albeit to an easier-to-fix place, which is progress in some sense :)

@zviadm
Copy link
Author

zviadm commented Oct 5, 2016

Thanks for quick responses, I have subscribed to the other task. Will try to see what happens with that CL too, just as a data point.

@golang golang locked and limited conversation to collaborators Oct 5, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants