runtime: sub optimal gc scalability #21056

TocarIP · 2017-07-17T21:28:54Z

What version of Go are you using (`go version`)?

go version devel +4e9c86a Wed Jun 28 17:33:40 2017 +0000 linux/amd64

What operating system and processor architecture are you using (`go env`)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/localdisk/itocar/gopath"
GORACE=""
GOROOT="/localdisk/itocar/golang"
GOTOOLDIR="/localdisk/itocar/golang/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build633808214=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

Run test/bench/garbage/tree2 on machine with 88 threads (2 sockets, 22 cores per socket, 2 threads per core) 2x Xeon E5-2699 v4
With following options:
./tree2 -cpus=88 -heapsize=1000000000 -cpuprofile=tree2.pprof

What did you expect to see?

runtime.gcDrain taking insignificant amount of time

What did you see instead?

runtime.gcDrain taking about half of all time:

Showing top 10 nodes out of 33
      flat  flat%   sum%        cum   cum%
    36.95s 45.03% 45.03%     75.98s 92.59%  runtime.gcDrain /localdisk/itocar/golang/src/runtime/mgcmark.go
    12.38s 15.09% 60.11%     12.38s 15.09%  runtime.(*lfstack).pop /localdisk/itocar/golang/src/runtime/lfstack.go
     7.51s  9.15% 69.27%      7.51s  9.15%  runtime.greyobject /localdisk/itocar/golang/src/runtime/mgcmark.go
     6.28s  7.65% 76.92%     19.49s 23.75%  runtime.scanobject /localdisk/itocar/golang/src/runtime/mgcmark.go
     4.54s  5.53% 82.45%      4.54s  5.53%  runtime.(*lfstack).push /localdisk/itocar/golang/src/runtime/lfstack.go

Looking into runtime.gcDrain, I see that almost all time is spent on
35.66s 35.66s 924: if work.full == 0 {

I couldn't reproduce this behavior on machine with small number of cores.
Looking into cache miss profile shows that this is due to all cores updating head of work.full,
which causes all reads needed for check to miss cache.

The text was updated successfully, but these errors were encountered:

TocarIP · 2017-07-17T21:45:10Z

Also reproducible on 1.8

ianlancetaylor · 2017-07-17T21:46:26Z

CC @RLH @aclements

aclements · 2017-07-18T02:07:34Z

Curiously, while I was able to reproduce this on a 48 thread (24 core) machine with ./tree2 -cpus=48 -heapsize=1000000000, I wasn't able to reproduce it on that machine with garbage -benchmem 1024.

@TocarIP, have you seen this in other/more realistic benchmarks? I wonder if there's something pessimal about tree2's heap.

TocarIP · 2017-07-18T17:48:08Z

I've got a report from customer about excessive cache misses in gcdrain, but there were no reproducer. Tree2 is just the random benchmark were I was able to reproduce it. For ./garbage -benchmem=1024 I see about 13% time in gcdrain[N] on machine above (88 threads) with almost all the time spent in the same work.full == 0 check
perf output below:

 12.19%  garbage  garbage           [.] runtime.scanobject
   6.68%  garbage  garbage           [.] runtime.gcDrainN
   6.49%  garbage  garbage           [.] runtime.gcDrain
   6.45%  garbage  garbage           [.] runtime.greyobject
   5.90%  garbage  garbage           [.] runtime.mallocgc
   4.87%  garbage  garbage           [.] runtime.heapBitsForObject
   4.59%  garbage  garbage           [.] runtime.lock
   2.77%  garbage  garbage           [.] runtime.lfstackpop

gopherbot · 2017-09-11T19:33:45Z

Change https://golang.org/cl/62971 mentions this issue: runtime: reduce contention in gcDrain

RLH · 2017-09-12T20:38:00Z

Tree2 builds a large tree and then calls runtime.GC() a number of times so it isn't a very representative benchmark. Does the non-blocking partitioned stack data structure appear in the literature? Googling doesn't immediately reveal a reference.

…

On Mon, Sep 11, 2017 at 3:34 PM, GopherBot ***@***.***> wrote: Change https://golang.org/cl/62971 mentions this issue: runtime: reduce contention in gcDrain — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21056 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA7Wn9hd6Xb9e5gWUePyr30yEHKLshqtks5shYsqgaJpZM4Oaj8e> .

TocarIP · 2017-09-13T21:10:18Z

I agree that tree2 is not very representative, but I'm not sure what is representative.
E. g. golang.org/x/benchmarks/garbage got 10% faster in 88-thread case:

Garbage/benchmem-MB=1024-88                   944µs ± 1%                   854µs ± 2%  -9.50%

Is this more representative?
I'm more than willing to measure some other, more representative case, I just not sure what exactly to measure.
As for literature, idea of having several read/write locations was proposed in e. g. https://people.csail.mit.edu/shanir/publications/Lock_Free.pdf . TBB uses similar data-structure for priority queue https://github.com/01org/tbb/blob/tbb_2018/src/tbb/task_stream.h but with an extra bit mask for faster checking. I din't bother with bitmask because current version reduced time spent in GcDrain to reasonable level, by itself.

When running on machines with large numbers of cores (~40+) a Golang garbage collection bug seems to dominate runtime - see golang/go#21056 for more info. This change limits the number of cores to what we found was most performant on our hardware - others may want to experiment.

zhuangqh · 2023-08-22T08:10:29Z

Hi go maintainer @rsc @aclements , I also address this issue when running containerd on a large cpu machine(192core). Setting the maxprocsize to 192 or larger than 32 will cause to a much worse gc lantency.
Here is some investigation.

machine spec

Architecture:        x86_64
CPU(s):              192
On-line CPU(s) list: 0-191
Thread(s) per core:  2
Core(s) per socket:  48
Model name:          AMD EPYC 7K62 48-Core Processor
NUMA node0 CPU(s):   0-47,96-143
NUMA node1 CPU(s):   48-95,144-191

go version 1.21

My workload is a typical container engine, which fork some container process and performs some io operation. In order to create containers with high concurrency and high speed, our business is very sensitive to lantency. (Start the container with 200 concurrency)

GOMAXPROC=32

top operation is syscall and rtsigprocmask invoked by forking.

(pprof) top
Showing nodes accounting for 42.81s, 82.61% of 51.82s total
Dropped 1077 nodes (cum <= 0.26s)
Showing top 10 nodes out of 308
      flat  flat%   sum%        cum   cum%
    33.85s 65.32% 65.32%     33.85s 65.32%  runtime/internal/syscall.Syscall6
     5.79s 11.17% 76.50%      5.79s 11.17%  runtime.rtsigprocmask
     0.82s  1.58% 78.08%      0.82s  1.58%  runtime.futex
     0.43s  0.83% 78.91%      2.15s  4.15%  runtime.mallocgc
     0.42s  0.81% 79.72%      0.42s  0.81%  runtime.epollctl
     0.37s  0.71% 80.43%      0.57s  1.10%  runtime.heapBitsSetType
     0.33s  0.64% 81.07%      0.33s  0.64%  runtime.nextFreeFast (inline)
     0.28s  0.54% 81.61%      0.28s  0.54%  runtime.epollwait
     0.26s   0.5% 82.11%      0.26s   0.5%  runtime.markBits.isMarked
     0.26s   0.5% 82.61%      0.28s  0.54%  runtime.step

scheduler wait is high due to small proc and high concurrently go fork. together with 20000+ resident go routine.

so i try to increase go proc size

GOMAXPROC=96

(pprof) top
Showing nodes accounting for 55.11s, 78.05% of 70.61s total
Dropped 1221 nodes (cum <= 0.35s)
Showing top 10 nodes out of 349
      flat  flat%   sum%        cum   cum%
    33.48s 47.42% 47.42%     33.48s 47.42%  runtime/internal/syscall.Syscall6
    11.10s 15.72% 63.14%     11.98s 16.97%  runtime.(*lfstack).pop (inline)
     4.15s  5.88% 69.01%      4.15s  5.88%  runtime.rtsigprocmask
     1.53s  2.17% 71.18%      2.03s  2.87%  runtime.gcDrainN
     1.34s  1.90% 73.08%      1.34s  1.90%  runtime.futex
     0.88s  1.25% 74.32%      0.88s  1.25%  runtime.lfstackUnpack (inline)
     0.78s  1.10% 75.43%      0.78s  1.10%  runtime.epollctl
     0.69s  0.98% 76.41%     13.41s 18.99%  runtime.gcDrain
     0.66s  0.93% 77.34%      5.24s  7.42%  runtime.mallocgc

lfstack and gcDrain cost much cpu.

dive into go trace. 25% p in GC (dedicated) state. but the remaining g seems to do nothing, all of them busy doing GC(idle)

The entire gc process took about 110 milliseconds, and almost no user threads were executing. This introduces a huge lantency.

GOMAXPROC=96 with patch

I've apply @TocarIP 's patch. And also change workType.empty to partionedStack, which still be a hot spot under 96 proc.

type workType struct {
	full  partionedStack   // lock-free list of full blocks workbuf
	empty partionedStack          // lock-free list of empty blocks workbuf

after that, it reduce lfstackpop to a acceptable level

(pprof) top
Showing nodes accounting for 27.76s, 74.15% of 37.44s total
Dropped 1068 nodes (cum <= 0.19s)
Showing top 10 nodes out of 371
      flat  flat%   sum%        cum   cum%
    21.55s 57.56% 57.56%     21.55s 57.56%  runtime/internal/syscall.Syscall6
     2.28s  6.09% 63.65%      2.28s  6.09%  runtime.rtsigprocmask
     0.96s  2.56% 66.21%      0.96s  2.56%  runtime.(*lfstack).pop (inline)
     0.81s  2.16% 68.38%      0.81s  2.16%  runtime.futex
     0.43s  1.15% 69.52%      2.21s  5.90%  runtime.mallocgc
     0.40s  1.07% 70.59%      0.40s  1.07%  runtime.epollctl
     0.39s  1.04% 71.63%      0.39s  1.04%  runtime.nextFreeFast (inline)
     0.34s  0.91% 72.54%      0.34s  0.91%  runtime.(*lfstack).empty (inline)
     0.31s  0.83% 73.37%      2.16s  5.77%  runtime.gentraceback
     0.29s  0.77% 74.15%      0.47s  1.26%  runtime.heapBitsSetType

and user g has a chance to run during gc

The whole gc time is reduced to 25ms.

conclusion

@TocarIP 's patch is make sense, it reduce the cpu cache miss when performs cas.
I 'm willing to take over this patch if you are not working on it 😄 @TocarIP

TocarIP · 2023-08-22T17:16:45Z

I'm not working on it, so feel free to take over

zhuangqh · 2023-08-28T09:40:54Z

I'm not working on it, so feel free to take over

There is a issue about the parition stack😂. It is not "linear" because there are many linearization points due to partition.
For example, g A performs pop operation, when iterating the stack by index order 2 3 0 1. Another g B many push successfully to index 3 after A iterate over it . Thus g A pop nothing.

zhuangqh · 2023-09-04T13:13:42Z

There is a realistic benchmark to reproduce this issue. https://github.com/grpc/grpc-go/blob/master/benchmark/run_bench.sh

issue a heavy benchmark

./run_bench.sh -c 96 -r 1 -req 4096 -resp 4096

in 8 core machine, everythin works fine

(pprof) top gcDrain
Active filters:
   focus=gcDrain
Showing nodes accounting for 1.91s, 2.91% of 65.55s total
Dropped 48 nodes (cum <= 0.33s)
Showing top 10 nodes out of 92
      flat  flat%   sum%        cum   cum%
     0.44s  0.67%  0.67%      0.56s  0.85%  runtime.findObject
     0.41s  0.63%  1.30%      0.41s  0.63%  runtime.procyield
     0.38s  0.58%  1.88%      1.41s  2.15%  runtime.scanobject
     0.11s  0.17%  2.04%      0.11s  0.17%  runtime.(*gcBits).bitp (inline)
     0.11s  0.17%  2.21%      0.22s  0.34%  runtime.(*unwinder).resolveInternal
     0.10s  0.15%  2.36%      0.10s  0.15%  runtime.heapBits.nextFast (inline)
     0.10s  0.15%  2.52%      0.13s   0.2%  runtime.heapBitsForAddr
     0.09s  0.14%  2.65%      0.09s  0.14%  runtime.(*mspan).base (inline)
     0.09s  0.14%  2.79%      0.11s  0.17%  runtime.spanOf (inline)
     0.08s  0.12%  2.91%      0.24s  0.37%  runtime.pcvalue

in 96 core machine

(pprof) top gcDrain
Active filters:
   focus=gcDrain
Showing nodes accounting for 258.07s, 26.51% of 973.33s total
Dropped 184 nodes (cum <= 4.87s)
Showing top 10 nodes out of 63
      flat  flat%   sum%        cum   cum%
   198.70s 20.41% 20.41%    198.79s 20.42%  runtime.(*lfstack).pop (inline)
    23.07s  2.37% 22.78%     23.07s  2.37%  runtime.tgkill
     9.79s  1.01% 23.79%    168.38s 17.30%  runtime.gcDrain
     6.12s  0.63% 24.42%      6.12s  0.63%  runtime.(*lfstack).push
     5.15s  0.53% 24.95%      5.19s  0.53%  runtime.(*gcBits).bitp (inline)
     4.12s  0.42% 25.37%      4.55s  0.47%  runtime.findObject
     3.99s  0.41% 25.78%     98.81s 10.15%  runtime.gcDrainN
     3.46s  0.36% 26.14%    191.66s 19.69%  runtime.getempty
     3.30s  0.34% 26.48%     13.73s  1.41%  runtime.scanobject
     0.37s 0.038% 26.51%     10.48s  1.08%  runtime.greyobject

zhuangqh · 2023-09-04T13:27:58Z

After investigation, the elimination-backoff stack algorithm may be a better algo to solve this question.
I create a initial implement here https://go-review.googlesource.com/c/go/+/525435

What do you think about this issue @aclements @mknyszek @ianlancetaylor I'm struggling with the gc lantency when running on machine with large number of cpu

i am not a expert of goruntime, need some help....

mknyszek · 2023-09-05T00:23:38Z

CC @golang/runtime

prattmic · 2025-01-07T04:59:49Z

Some of the bottlenecks here on lfstack look related to #68399.

Regarding the GOMAXPROCS=96 with lots of idle workers: The idle GC workers should only be running if there are no user goroutines to run. Thus they shouldn't cause "huge latency" spikes. If so, something definitely seems wrong there.

To be clear, some approach to avoid hammer the lfstack too hard does seem like it would make sense, but that may be within the lfstack itself (such as your backoff), or some more fundamental reorganization to avoid needing to hit it too hard in the first place.

prattmic · 2025-02-10T18:27:05Z

There is another case of this in #65064 (comment). After the epoll issue is resolved, the next visible scalability bottleneck is on work.empty in getempty via stack scanning (the program has ~130k goroutines).

ianlancetaylor added this to the Go1.10 milestone Jul 17, 2017

ianlancetaylor added the release-blocker label Jul 17, 2017

bradfitz added the NeedsInvestigation label Jul 18, 2017

aclements removed the release-blocker label Nov 21, 2017

rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017

ianlancetaylor modified the milestones: Go1.11, Go1.12 Jul 10, 2018

aclements added the Performance label Jan 8, 2019

aclements modified the milestones: Go1.12, Go1.13 Jan 8, 2019

aclements modified the milestones: Go1.13, Go1.14 Jun 25, 2019

rsc modified the milestones: Go1.14, Backlog Oct 9, 2019

mknyszek added the compiler/runtime label Sep 5, 2023

mknyszek added this to Go Compiler / Runtime Sep 6, 2023

mknyszek moved this to Todo in Go Compiler / Runtime Sep 20, 2023

thepudds mentioned this issue Jan 3, 2025

runtime: findrunnableGCWorker consumed much higher CPU under 64c AWS instance (ARM64), comparing to 32c #68399

Open

prattmic added the Scalability label Jan 3, 2025

prattmic mentioned this issue Jan 30, 2025

runtime: epoll scalability problem with 192 core machine and 1k+ ready sockets #65064

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: sub optimal gc scalability #21056

runtime: sub optimal gc scalability #21056

TocarIP commented Jul 17, 2017 •

edited

Loading

TocarIP commented Jul 17, 2017

ianlancetaylor commented Jul 17, 2017

aclements commented Jul 18, 2017

TocarIP commented Jul 18, 2017

gopherbot commented Sep 11, 2017

RLH commented Sep 12, 2017 via email

TocarIP commented Sep 13, 2017 •

edited

Loading

zhuangqh commented Aug 22, 2023 •

edited

Loading

TocarIP commented Aug 22, 2023

zhuangqh commented Aug 28, 2023

zhuangqh commented Sep 4, 2023

zhuangqh commented Sep 4, 2023 •

edited

Loading

mknyszek commented Sep 5, 2023

prattmic commented Jan 7, 2025

prattmic commented Feb 10, 2025

runtime: sub optimal gc scalability #21056

runtime: sub optimal gc scalability #21056

Comments

TocarIP commented Jul 17, 2017 • edited Loading

What version of Go are you using (go version)?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

TocarIP commented Jul 17, 2017

ianlancetaylor commented Jul 17, 2017

aclements commented Jul 18, 2017

TocarIP commented Jul 18, 2017

gopherbot commented Sep 11, 2017

RLH commented Sep 12, 2017 via email

TocarIP commented Sep 13, 2017 • edited Loading

zhuangqh commented Aug 22, 2023 • edited Loading

GOMAXPROC=32

GOMAXPROC=96

GOMAXPROC=96 with patch

conclusion

TocarIP commented Aug 22, 2023

zhuangqh commented Aug 28, 2023

zhuangqh commented Sep 4, 2023

zhuangqh commented Sep 4, 2023 • edited Loading

mknyszek commented Sep 5, 2023

prattmic commented Jan 7, 2025

prattmic commented Feb 10, 2025

TocarIP commented Jul 17, 2017 •

edited

Loading

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

TocarIP commented Sep 13, 2017 •

edited

Loading

zhuangqh commented Aug 22, 2023 •

edited

Loading

zhuangqh commented Sep 4, 2023 •

edited

Loading