-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: gc pause bursts after upgrading from 1.16 to 1.17 #49542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@dop251 in your trace this is a forced GC ( |
I have seen 3 examples and yes, it is |
Thanks, one more question:
This in particular does sound like the potential problem you describe. Specifically STW has to get all Ps to stop, so each P should have a "proc stop" event before STW continues. (Note that a P may already be stopped, so "proc stop" could be before STW). If a P takes a long time to stop, then that would block everything else. I'd love to look at the trace you've collected to see if this is the case and what else is going on, if you think that is something you can share. |
Is there a way to share it privately? |
Sure, feel free to email me |
Shared a Google Drive folder with you. |
Hi, @prattmic . I encountered a similar problem. Any update about this issue? |
Hi @zhouguangyuan0718, please file a new issue with additional details (platform, how you measured pause times, which versions of Go you're running, etc.). I think this issue is unfortunately too stale to continue. I don't believe we were able to reproduce this at the time. Closing for now. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
N/A
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
After upgrading our production servers with a version built with go 1.17.2 we saw a radically different GC pause profile (comparing to go 1.16.6):
What's plotted here is a change in PauseTotalNs measured every minute. The marker corresponds to the time of the rollout. Note that it did not start straight away, only when the load kicked in in the morning.
Zooming in showed there are "bursts" among otherwise normal sub-millisecond runs:
To confirm this is a regression we did a build of the same code using go 1.16.6 and deployed it to one of the nodes. Also, we restarted another node at the same time, but left it running with 1.17.2. Comparing the graphs for the two nodes shows identical profile, except for the bursts which only occur on the node running 1.17.2:
(green represents 1.16, yellow -- 1.17).
Here is what it looks like in the trace (we've managed to get a couple of examples):
The root cause is not immediately obvious to me, but I suspect there is a rare race condition which sometimes prevents STW to complete. Note, there is one event recorded during STW. In the end stack trace it shows
runtime.selectgo:327
. However in another example there is also an event recorded shortly before the end of STW, but it just says "proc stop".We haven't been able to create a reproducible case for it. It looks like this only happens when heap size grows to a few tens of GB and then the frequency (but not the size) of the bursts depends on the load.
If there is any additional info required please let me know.
What did you expect to see?
Normal, sub-millisecond GC pauses.
What did you see instead?
Random bursts of up to 100ms.
The text was updated successfully, but these errors were encountered: