Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: 10ms-26ms latency from GC in go1.14rc1, possibly due to 'GC (idle)' work #37116

Closed
thepudds opened this issue Feb 7, 2020 · 15 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@thepudds
Copy link
Contributor

thepudds commented Feb 7, 2020

What version of Go are you using (go version)?

$ go1.14rc1 version

go version go1.14rc1 linux/amd64

Does this issue reproduce with the latest release?

Yes, go1.14rc1.

What operating system and processor architecture are you using (go env)?

Ubuntu 18.04.2 LTS (stock GCE image). 4 vCPUs, 15 GB.

go env Output
$ go1.14rc1 env

GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/user/.cache/go-build"
GOENV="/home/user/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/user/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/user/sdk/go1.14rc1"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/user/sdk/go1.14rc1/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build656951356=/tmp/go-build -gno-record-gcc-switches"

Background

A recent blog post reported an issue they saw in the >2 year old Go 1.10 when there are "tens of millions" of entries in an LRU, along with CPU spikes and response time spikes every two minutes, with a p95 response time reportedly hitting 50ms-300ms.

The two minute periodicity was explained within the blog by the following, which seems plausible:

we had written the Go code very efficiently and had very few allocations. We were not creating a lot of garbage.
After digging through the Go source code, we learned that Go will force a garbage collection run every 2 minutes at minimum. In other words, if garbage collection has not run for 2 minutes, regardless of heap growth, go will still force a garbage collection.

A separate recent post from one of the engineers theorized the problem was due to a very long linked list, but I am less sure if this comment is a precise description of the actual root cause:

we used a large free-list like structure, a very long linked list. The head of the list is maintained as a variable, which means that Go's mark phase must start scanning from the head and then pointer chase its way through the list. For whatever reason, Go does (did?) this section in a single-threaded manner with a global lock held. As a result, everything must wait until this extremely long pointer chase occurs.

An engineer from the same team might have also posted this roughly 2.5 years ago in #14812 (comment)

I've also got a service that's encountering long request latencies due to CPU starvation when the GC runs. Like a couple others have mentioned, I think the GC may act poorly on long linked lists - I have a 1M-long list/map used as an LRU cache.

What did you do?

Write a simple GC benchmark intended to emulate a LRU with a separate free list.

It's not 100% clear what their exact data structures were, including whether or not it was just an LRU that linked elements within the LRU to help with expiring the older items, vs. maybe there was an additional free list that was separate from the LRU, vs. something else entirely.

Last night, without too much data, I put together a simple gc benchmark of a LRU cache along with a separate free list.

The benchmark is here:

https://play.golang.org/p/-Vo1hqIAoNz

Here is a sample run with 50M items in a LRU cache (a map + the items within the map also are on a linked list), as well as 1M items on separate free list, and a background allocation rate of about 10k/sec:

$ go1.14rc1 build 
$ GODEBUG=gctrace=1 ./lru-gc-benchmark -free-list-items=1000000 -lru-items=50000000 -duration=20m -bg-alloc-sleep=10ms

What did you expect to see?

Ideally, sub-ms worst-case latency impact of GC.

What did you see instead?

This was fairly quick and dirty, and might not be at all related to the problem reported in the blog, but seemed to observe occasionally higher latency values.

It is important to note that the issue here does not seem to be anywhere near the scale reported in the blog:

  • The amplitude of the latency impact here (10ms-26ms) seems substantially smaller than what was reported on the blog (50ms-300ms).
  • The issue here is reporting worst-case latency, rather than p95 latency on the blog.

Here is an example where a single 26ms worst-case latency was observed when looking up an element in the LRU during a forced GC.

$ go1.14rc1 build  && GODEBUG=gctrace=1 ./lru-gc-benchmark -free-list-items=1000000 -lru-items=50000000 -duration=20m -bg-alloc-sleep=10ms 

gc 1 @0.011s 5%: 0.066+4.1+0.023 ms clock, 0.26+0.30/3.0/6.9+0.095 ms cpu, 4->4->3 MB, 5 MB goal, 4 P
gc 2 @0.022s 10%: 0.008+9.0+0.021 ms clock, 0.033+0.33/8.7/10+0.087 ms cpu, 7->7->6 MB, 8 MB goal, 4 P
gc 3 @0.044s 10%: 0.012+13+0.020 ms clock, 0.049+0/12/23+0.083 ms cpu, 13->13->12 MB, 14 MB goal, 4 P
gc 4 @0.087s 10%: 0.009+26+0.030 ms clock, 0.038+0/24/46+0.12 ms cpu, 27->27->25 MB, 28 MB goal, 4 P
gc 5 @0.177s 10%: 0.010+49+0.023 ms clock, 0.042+0/47/89+0.095 ms cpu, 53->54->50 MB, 54 MB goal, 4 P
gc 6 @0.369s 10%: 0.015+98+0.022 ms clock, 0.060+0/98/179+0.089 ms cpu, 107->108->100 MB, 108 MB goal, 4 P
gc 7 @0.767s 10%: 0.012+195+0.021 ms clock, 0.049+0/194/380+0.086 ms cpu, 213->216->200 MB, 214 MB goal, 4 P
gc 8 @1.586s 11%: 7.4+377+0.028 ms clock, 29+71/376/742+0.11 ms cpu, 426->430->398 MB, 427 MB goal, 4 P
gc 9 @3.343s 11%: 0.024+787+0.023 ms clock, 0.099+236/785/1501+0.094 ms cpu, 853->859->795 MB, 854 MB goal, 4 P
gc 10 @7.071s 11%: 0.023+1806+0.023 ms clock, 0.092+497/1803/3513+0.095 ms cpu, 1706->1719->1591 MB, 1707 MB goal, 4 P
gc 11 @14.905s 12%: 49+3796+0.023 ms clock, 196+1084/3796/7271+0.093 ms cpu, 3411->3439->3182 MB, 3412 MB goal, 4 P
gc 12 @31.292s 14%: 226+8162+0.043 ms clock, 904+5028/8161/14256+0.17 ms cpu, 6823->6866->6353 MB, 6824 MB goal, 4 P
LRU count:       50.00M
Free list count: 1.00M
BG allocation:   10000 / sec (approx.)
BG retain count: 10000
Printing slowest lookup times:
i=0          bgAllocs=540000   at=64.026s  lookup=0.009ms
i=4096       bgAllocs=540100   at=64.028s  lookup=0.016ms
i=8944       bgAllocs=540100   at=64.030s  lookup=0.118ms
i=9410       bgAllocs=540100   at=64.031s  lookup=0.123ms
i=52120      bgAllocs=540300   at=64.053s  lookup=0.141ms
i=54295      bgAllocs=540300   at=64.054s  lookup=0.148ms
i=81167      bgAllocs=540400   at=64.068s  lookup=0.185ms
i=141525     bgAllocs=540700   at=64.099s  lookup=0.193ms
i=473297     bgAllocs=542400   at=64.271s  lookup=0.205ms
i=820934     bgAllocs=544200   at=64.451s  lookup=0.233ms
i=1045129    bgAllocs=545300   at=64.572s  lookup=0.323ms
i=49675599   bgAllocs=810200   at=91.600s  lookup=0.651ms
i=59934276   bgAllocs=863000   at=97.000s  lookup=0.661ms
i=63432689   bgAllocs=881700   at=98.900s  lookup=0.696ms
GC forced
i=173370432  bgAllocs=1476800   at=159.688s  lookup=3.407ms
i=173381145  bgAllocs=1476800   at=159.727s  lookup=26.235ms
gc 13 @159.687s 5%: 0.096+17356+0.020 ms clock, 0.38+0.65/17348/34668+0.081 ms cpu, 9642->9659->8463 MB, 12706 MB goal, 4 P
GC forced
gc 14 @297.047s 4%: 0.12+17250+0.021 ms clock, 0.49+30/17250/34376+0.086 ms cpu, 8696->8713->8463 MB, 16926 MB goal, 4 P
GC forced
gc 15 @434.306s 4%: 0.10+16952+0.024 ms clock, 0.41+0.20/16949/33820+0.096 ms cpu, 8696->8712->8463 MB, 16927 MB goal, 4 P

I would guess that 26ms lookup latency is likely related to this 26ms 'GC (idle)' period where the main goroutine did not seem to be on a P:

image

Another sample run:


$ go1.14rc1 build  && GODEBUG=gctrace=1 ./lru-gc-benchmark -free-list-items=1000000 -lru-items=50000000 -duration=20m -bg-alloc-sleep=10ms 

gc 1 @0.011s 6%: 0.014+4.3+0.025 ms clock, 0.056+0.023/3.7/6.6+0.10 ms cpu, 4->4->3 MB, 5 MB goal, 4 P
gc 2 @0.021s 10%: 0.009+8.4+0.026 ms clock, 0.038+0.34/7.8/11+0.10 ms cpu, 7->7->6 MB, 8 MB goal, 4 P
gc 3 @0.043s 10%: 0.009+12+0.023 ms clock, 0.036+0/11/23+0.092 ms cpu, 13->13->12 MB, 14 MB goal, 4 P
gc 4 @0.084s 11%: 0.010+24+0.023 ms clock, 0.041+0/24/42+0.093 ms cpu, 27->27->25 MB, 28 MB goal, 4 P
gc 5 @0.171s 10%: 0.014+48+0.024 ms clock, 0.058+0/47/93+0.097 ms cpu, 53->54->50 MB, 54 MB goal, 4 P
gc 6 @0.364s 10%: 0.017+97+0.022 ms clock, 0.070+0/96/177+0.088 ms cpu, 107->108->100 MB, 108 MB goal, 4 P
gc 7 @0.758s 10%: 5.6+203+0.022 ms clock, 22+0.49/201/386+0.091 ms cpu, 213->216->200 MB, 214 MB goal, 4 P
gc 8 @1.605s 10%: 0.019+411+0.023 ms clock, 0.078+0/409/796+0.095 ms cpu, 426->432->399 MB, 427 MB goal, 4 P
gc 9 @3.312s 11%: 28+790+0.031 ms clock, 114+194/787/1520+0.12 ms cpu, 853->860->795 MB, 854 MB goal, 4 P
gc 10 @6.964s 12%: 46+1586+0.033 ms clock, 185+730/1585/2988+0.13 ms cpu, 1706->1716->1587 MB, 1707 MB goal, 4 P
gc 11 @14.638s 13%: 79+3760+0.022 ms clock, 316+1164/3758/7168+0.088 ms cpu, 3411->3438->3182 MB, 3412 MB goal, 4 P
gc 12 @31.187s 18%: 196+6723+0.007 ms clock, 785+11516/6720/7379+0.028 ms cpu, 6823->6824->6311 MB, 6824 MB goal, 4 P
LRU count:       50.00M
Free list count: 1.00M
BG allocation:   10000 / sec (approx.)
BG retain count: 10000
Printing slowest lookup times:
i=0          bgAllocs=530900   at=63.844s  lookup=0.006ms
i=100        bgAllocs=530900   at=63.844s  lookup=0.006ms
i=671        bgAllocs=530900   at=63.844s  lookup=0.028ms
i=2407       bgAllocs=530900   at=63.845s  lookup=0.030ms
i=4258       bgAllocs=530900   at=63.847s  lookup=0.120ms
i=80012      bgAllocs=531300   at=63.886s  lookup=0.634ms
i=134006913  bgAllocs=1258500   at=138.086s  lookup=0.679ms
i=145128912  bgAllocs=1321200   at=144.486s  lookup=0.703ms
GC forced
i=170136450  bgAllocs=1454900   at=158.133s  lookup=16.008ms
gc 13 @158.108s 6%: 0.11+17483+0.022 ms clock, 0.44+0.23/17480/34884+0.089 ms cpu, 9639->9656->8463 MB, 12622 MB goal, 4 P
GC forced
gc 14 @295.596s 5%: 0.13+16977+0.024 ms clock, 0.52+28/16977/33847+0.099 ms cpu, 8695->8712->8463 MB, 16927 MB goal, 4 P
GC forced
i=632749868  bgAllocs=3978800   at=432.596s  lookup=11.794ms
gc 15 @432.576s 4%: 0.073+17356+0.021 ms clock, 0.29+0.23/17355/34654+0.084 ms cpu, 8697->8714->8463 MB, 16927 MB goal, 4 P
GC forced
gc 16 @569.938s 4%: 0.17+16911+0.021 ms clock, 0.71+4.9/16908/33760+0.086 ms cpu, 8696->8713->8463 MB, 16926 MB goal, 4 P
GC forced
gc 17 @706.859s 3%: 0.19+17107+0.023 ms clock, 0.76+0.41/17106/34143+0.093 ms cpu, 8697->8714->8463 MB, 16927 MB goal, 4 P
GC forced
i=1327474340 bgAllocs=7763400   at=843.987s  lookup=8.791ms
gc 18 @843.969s 3%: 0.28+17297+0.022 ms clock, 1.1+0.82/17295/34532+0.088 ms cpu, 8696->8713->8463 MB, 16927 MB goal, 4 P
i=1344317800 bgAllocs=7849300   at=861.294s  lookup=13.015ms
...
^C

I suspect this trace section corresponds to the at=432.596s lookup=11.794ms observed latency:

image

In both cases, the traces started roughly 100 seconds after the start of the process, so the timestamps between the log message vs. images are offset roughly by 100 seconds.

Related issues

This issue might very well be a duplicate. For example, there is a recent list of 5 known issues in #14812 (comment). That said, if this issue is explained by one of the 5 items there, it is not immediately obvious to me which one, including some of the issues in that list are resolved now, but this happens in go1.14rc1. Also possibly related to #14179, #18155, or others.

@cagedmantis cagedmantis modified the milestones: Backlog, Go1.14 Feb 7, 2020
@cagedmantis cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 7, 2020
@cagedmantis
Copy link
Contributor

cagedmantis commented Feb 7, 2020

/cc @aclements @rsc @randall77

@heschi
Copy link
Contributor

heschi commented Feb 9, 2020

I think this is working as expected. At the heart of github.com/golang/groupcache/lru is a single large map. AFAIK the map has to be marked all in one go, which will take a long time. In the execution trace above you can see a MARK ASSIST segment below G21. If you click on that I think you'll find that it spent all of its time marking a single object, presumably the map. Austin and others will know better, of course.

Since a real server would need to lock the map, sharding it would be a good idea, which would also reduce the latency impact of a mark.

(cc @rsc, since I think @cagedmantis made a typo.)

@RLH
Copy link
Contributor

RLH commented Feb 9, 2020 via email

@randall77
Copy link
Contributor

randall77 commented Feb 9, 2020 via email

@cagedmantis
Copy link
Contributor

Thanks @heschik

@aclements
Copy link
Member

Thanks for trying to put something together to recreate this behavior, especially from the little scraps of information we have. :)

Based on the execution traces, I think you're probably right that the idle GC worker is the root of the problem here. The idle worker currently runs until a scheduler preemption, which happens only every 10 to 20ms. The trace shows a fifth goroutine that appears to enter the run queue a moment after the idle worker kicks in, which is presumably the user goroutine.

I agree with @randall77 that this is not likely to be related to the large map, since we do break up large objects when marking.

In general long linked lists are not very GC-friendly, since it has no choice but to walk the list sequentially. That will inherently serialize that part of marking (assuming it's the only work left), though I don't understand the quoted comment about this happening "with a global lock". I don't think there's been anything like a global lock here since we switched to a concurrent GC in Go 1.5.

@aclements aclements modified the milestones: Go1.14, Go1.15 Feb 10, 2020
@thepudds
Copy link
Contributor Author

thepudds commented Feb 11, 2020

The Discord blog post described above had said they saw GC-related latency problems with Go 1.9.2 (and also later said elsewhere they had tested but seemingly not deployed Go 1.10), so part of the intent here was to see if the pauses they reported could be reproduced in Go 1.9.2.

Here are some results from running that benchmark above in a loop overnight, with ten 15-minute runs first on 1.9.2 and then on 1.14rc1.

Comparing 1.9.2 to 1.14rc1

  • 1.9.2 was seeing occasional delays in the range 32->64ms, as well as above 64ms. None of those were present in 1.14rc1.
  • The frequency of delays in the range 4->16ms reduced by a factor of roughly 40x with 1.14rc1.
  • The frequency of delays in the range 1->4ms reduced by a factor of roughly 100x with 1.14rc1.
  • This was a latency benchmark and not really focused on throughput, but 1.14rc1 did about 30% more work over the same time interval.
  • Delays in the range 16->32ms were rare in both cases (1 or 0 out of a billion lookups or so), but persist in Go 1.14rc1, which is partly why this issue was opened.

Each line here in these two tables shows the counts of delays in different ranges for one 15-minute run:

Go 1.9.2 counts of delays

 0->1ms: 1034773027  1->4ms: 1640  4->16ms: 188  16->32ms: 0  32->64ms: 1  64ms->∞: 2
 0->1ms: 1020927457  1->4ms: 1232  4->16ms: 144  16->32ms: 1  32->64ms: 2  64ms->∞: 0
 0->1ms: 1027859639  1->4ms: 1724  4->16ms: 189  16->32ms: 0  32->64ms: 1  64ms->∞: 2
 0->1ms: 1023974382  1->4ms: 981   4->16ms: 109  16->32ms: 0  32->64ms: 0  64ms->∞: 0
 0->1ms: 1024773755  1->4ms: 1767  4->16ms: 186  16->32ms: 1  32->64ms: 0  64ms->∞: 1
 0->1ms: 1028694483  1->4ms: 1755  4->16ms: 187  16->32ms: 1  32->64ms: 0  64ms->∞: 2
 0->1ms: 1028308359  1->4ms: 2226  4->16ms: 236  16->32ms: 1  32->64ms: 0  64ms->∞: 1
 0->1ms: 1034795065  1->4ms: 1273  4->16ms: 143  16->32ms: 0  32->64ms: 0  64ms->∞: 0
 0->1ms: 1033581601  1->4ms: 1643  4->16ms: 172  16->32ms: 0  32->64ms: 0  64ms->∞: 0
 0->1ms: 1027894298  1->4ms: 1412  4->16ms: 152  16->32ms: 1  32->64ms: 0  64ms->∞: 0

Go 1.14rc1 counts of delays

 0->1ms: 1359388697  1->4ms: 6     4->16ms: 6    16->32ms: 1  32->64ms: 0  64ms->∞: 0
 0->1ms: 1355108389  1->4ms: 9     4->16ms: 5    16->32ms: 0  32->64ms: 0  64ms->∞: 0
 0->1ms: 1344033293  1->4ms: 13    4->16ms: 3    16->32ms: 0  32->64ms: 0  64ms->∞: 0
 0->1ms: 1333374534  1->4ms: 21    4->16ms: 5    16->32ms: 1  32->64ms: 0  64ms->∞: 0
 0->1ms: 1340122378  1->4ms: 19    4->16ms: 4    16->32ms: 0  32->64ms: 0  64ms->∞: 0
 0->1ms: 1353391580  1->4ms: 14    4->16ms: 6    16->32ms: 1  32->64ms: 0  64ms->∞: 0
 0->1ms: 1354059777  1->4ms: 9     4->16ms: 5    16->32ms: 0  32->64ms: 0  64ms->∞: 0
 0->1ms: 1366321263  1->4ms: 14    4->16ms: 3    16->32ms: 0  32->64ms: 0  64ms->∞: 0
 0->1ms: 1350138536  1->4ms: 16    4->16ms: 6    16->32ms: 0  32->64ms: 0  64ms->∞: 0
 0->1ms: 1366766376  1->4ms: 15    4->16ms: 1    16->32ms: 0  32->64ms: 0  64ms->∞: 0

I'm putting the raw results here including so that anyone running this benchmark in the future can see some of the current variability from run to run.

@thepudds
Copy link
Contributor Author

thepudds commented Feb 11, 2020

Also, the 5 worst lookups seen in 1.9.2 across those ten runs:

i=509412400  bgAllocs=4601400   at=486.606s  lookup=260.360ms
i=863625700  bgAllocs=7347600   at=775.641s  lookup=171.862ms
i=4738900    bgAllocs=760700    at=89.159s   lookup=137.284ms
i=163862600  bgAllocs=2008200   at=216.857s  lookup=117.707ms
i=429438200  bgAllocs=4006925   at=423.171s  lookup=117.672ms

As I noted above, this is a simplified benchmark, including the lookups are all done sequentially, so that a GC-induced delay of >100ms might be recorded only in one measurement here, whereas in a real system with parallel inbound requests a GC-induced delay of >100ms might impact many requests at once and hence impact multiple response time measurements.

@ianlancetaylor
Copy link
Contributor

Rolling forward to 1.16.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.15, Go1.16 Jun 16, 2020
@lwpyr
Copy link

lwpyr commented Aug 21, 2020

image
I observe similar phenomenons in our production codes, it may be the same issue. when one processor is doing the marking job, other processors are idle. G89980 in the picture uses select waiting a timer, but the runtime seems fail to preempt the gc idle processors then fails to awake G89980 on time, then the whole GC process is like STW for 20+ ms. the golang version is 1.15

@aclements
Copy link
Member

Rolling forward to 1.17 since this is certainly too risky to address in 1.16 at this point. Hopefully @mknyszek 's broader investigation into GC scheduling will improve the situation here.

@aclements aclements modified the milestones: Go1.16, Go1.17 Dec 22, 2020
@ianlancetaylor
Copy link
Contributor

I'm guessing that this isn't done for 1.17. Should we move this to 1.18 or should we move it to Backlog?

@mknyszek
Copy link
Contributor

Looking over this quickly, I think it'll be helped a bit by #44163. It looks like the scheduler saw the chance to schedule an idle GC worker because the only piece of work hadn't been observed by the queue, and then it proceeded to hog the whole quantum.

I think there are other good reasons to remove the idle GC workers, and this is one more reason. I would really like to remove them early in cycle for 1.18.

@gopherbot
Copy link

Change https://go.dev/cl/393394 mentions this issue: runtime: reduce max idle mark workers during periodic GC cycles

@mknyszek
Copy link
Contributor

Re-reading the issue, https://go.dev/cl/393394 should solve the issue. Please feel free to reopen if that's not true.

@golang golang locked and limited conversation to collaborators Apr 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

10 participants