Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: sub optimal gc scalability #21056

Open
TocarIP opened this issue Jul 17, 2017 · 13 comments
Open

runtime: sub optimal gc scalability #21056

TocarIP opened this issue Jul 17, 2017 · 13 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@TocarIP
Copy link
Contributor

TocarIP commented Jul 17, 2017

What version of Go are you using (go version)?

go version devel +4e9c86a Wed Jun 28 17:33:40 2017 +0000 linux/amd64

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/localdisk/itocar/gopath"
GORACE=""
GOROOT="/localdisk/itocar/golang"
GOTOOLDIR="/localdisk/itocar/golang/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build633808214=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

Run test/bench/garbage/tree2 on machine with 88 threads (2 sockets, 22 cores per socket, 2 threads per core) 2x Xeon E5-2699 v4
With following options:
./tree2 -cpus=88 -heapsize=1000000000 -cpuprofile=tree2.pprof

What did you expect to see?

runtime.gcDrain taking insignificant amount of time

What did you see instead?

runtime.gcDrain taking about half of all time:

Showing top 10 nodes out of 33
      flat  flat%   sum%        cum   cum%
    36.95s 45.03% 45.03%     75.98s 92.59%  runtime.gcDrain /localdisk/itocar/golang/src/runtime/mgcmark.go
    12.38s 15.09% 60.11%     12.38s 15.09%  runtime.(*lfstack).pop /localdisk/itocar/golang/src/runtime/lfstack.go
     7.51s  9.15% 69.27%      7.51s  9.15%  runtime.greyobject /localdisk/itocar/golang/src/runtime/mgcmark.go
     6.28s  7.65% 76.92%     19.49s 23.75%  runtime.scanobject /localdisk/itocar/golang/src/runtime/mgcmark.go
     4.54s  5.53% 82.45%      4.54s  5.53%  runtime.(*lfstack).push /localdisk/itocar/golang/src/runtime/lfstack.go

Looking into runtime.gcDrain, I see that almost all time is spent on
35.66s 35.66s 924: if work.full == 0 {

I couldn't reproduce this behavior on machine with small number of cores.
Looking into cache miss profile shows that this is due to all cores updating head of work.full,
which causes all reads needed for check to miss cache.

@TocarIP
Copy link
Contributor Author

TocarIP commented Jul 17, 2017

Also reproducible on 1.8

@ianlancetaylor
Copy link
Contributor

CC @RLH @aclements

@aclements
Copy link
Member

Curiously, while I was able to reproduce this on a 48 thread (24 core) machine with ./tree2 -cpus=48 -heapsize=1000000000, I wasn't able to reproduce it on that machine with garbage -benchmem 1024.

@TocarIP, have you seen this in other/more realistic benchmarks? I wonder if there's something pessimal about tree2's heap.

@bradfitz bradfitz added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 18, 2017
@TocarIP
Copy link
Contributor Author

TocarIP commented Jul 18, 2017

I've got a report from customer about excessive cache misses in gcdrain, but there were no reproducer. Tree2 is just the random benchmark were I was able to reproduce it. For ./garbage -benchmem=1024 I see about 13% time in gcdrain[N] on machine above (88 threads) with almost all the time spent in the same work.full == 0 check
perf output below:

 12.19%  garbage  garbage           [.] runtime.scanobject
   6.68%  garbage  garbage           [.] runtime.gcDrainN
   6.49%  garbage  garbage           [.] runtime.gcDrain
   6.45%  garbage  garbage           [.] runtime.greyobject
   5.90%  garbage  garbage           [.] runtime.mallocgc
   4.87%  garbage  garbage           [.] runtime.heapBitsForObject
   4.59%  garbage  garbage           [.] runtime.lock
   2.77%  garbage  garbage           [.] runtime.lfstackpop

@gopherbot
Copy link

Change https://golang.org/cl/62971 mentions this issue: runtime: reduce contention in gcDrain

@RLH
Copy link
Contributor

RLH commented Sep 12, 2017 via email

@TocarIP
Copy link
Contributor Author

TocarIP commented Sep 13, 2017

I agree that tree2 is not very representative, but I'm not sure what is representative.
E. g. golang.org/x/benchmarks/garbage got 10% faster in 88-thread case:

Garbage/benchmem-MB=1024-88                   944µs ± 1%                   854µs ± 2%  -9.50%

Is this more representative?
I'm more than willing to measure some other, more representative case, I just not sure what exactly to measure.
As for literature, idea of having several read/write locations was proposed in e. g. https://people.csail.mit.edu/shanir/publications/Lock_Free.pdf . TBB uses similar data-structure for priority queue https://github.com/01org/tbb/blob/tbb_2018/src/tbb/task_stream.h but with an extra bit mask for faster checking. I din't bother with bitmask because current version reduced time spent in GcDrain to reasonable level, by itself.

domodwyer added a commit to domodwyer/mpjbt that referenced this issue Sep 27, 2017
When running on machines with large numbers of cores (~40+) a Golang
garbage collection bug seems to dominate runtime - see
golang/go#21056 for more info.

This change limits the number of cores to what we found was most
performant on our hardware - others may want to experiment.
@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017
@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.12 Jul 10, 2018
@aclements aclements modified the milestones: Go1.12, Go1.13 Jan 8, 2019
@aclements aclements modified the milestones: Go1.13, Go1.14 Jun 25, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@zhuangqh
Copy link

zhuangqh commented Aug 22, 2023

Hi go maintainer @rsc @aclements , I also address this issue when running containerd on a large cpu machine(192core). Setting the maxprocsize to 192 or larger than 32 will cause to a much worse gc lantency.
Here is some investigation.

machine spec

Architecture:        x86_64
CPU(s):              192
On-line CPU(s) list: 0-191
Thread(s) per core:  2
Core(s) per socket:  48
Model name:          AMD EPYC 7K62 48-Core Processor
NUMA node0 CPU(s):   0-47,96-143
NUMA node1 CPU(s):   48-95,144-191

go version 1.21

My workload is a typical container engine, which fork some container process and performs some io operation. In order to create containers with high concurrency and high speed, our business is very sensitive to lantency. (Start the container with 200 concurrency)

GOMAXPROC=32

top operation is syscall and rtsigprocmask invoked by forking.

(pprof) top
Showing nodes accounting for 42.81s, 82.61% of 51.82s total
Dropped 1077 nodes (cum <= 0.26s)
Showing top 10 nodes out of 308
      flat  flat%   sum%        cum   cum%
    33.85s 65.32% 65.32%     33.85s 65.32%  runtime/internal/syscall.Syscall6
     5.79s 11.17% 76.50%      5.79s 11.17%  runtime.rtsigprocmask
     0.82s  1.58% 78.08%      0.82s  1.58%  runtime.futex
     0.43s  0.83% 78.91%      2.15s  4.15%  runtime.mallocgc
     0.42s  0.81% 79.72%      0.42s  0.81%  runtime.epollctl
     0.37s  0.71% 80.43%      0.57s  1.10%  runtime.heapBitsSetType
     0.33s  0.64% 81.07%      0.33s  0.64%  runtime.nextFreeFast (inline)
     0.28s  0.54% 81.61%      0.28s  0.54%  runtime.epollwait
     0.26s   0.5% 82.11%      0.26s   0.5%  runtime.markBits.isMarked
     0.26s   0.5% 82.61%      0.28s  0.54%  runtime.step

scheduler wait is high due to small proc and high concurrently go fork. together with 20000+ resident go routine.
image

so i try to increase go proc size

GOMAXPROC=96

(pprof) top
Showing nodes accounting for 55.11s, 78.05% of 70.61s total
Dropped 1221 nodes (cum <= 0.35s)
Showing top 10 nodes out of 349
      flat  flat%   sum%        cum   cum%
    33.48s 47.42% 47.42%     33.48s 47.42%  runtime/internal/syscall.Syscall6
    11.10s 15.72% 63.14%     11.98s 16.97%  runtime.(*lfstack).pop (inline)
     4.15s  5.88% 69.01%      4.15s  5.88%  runtime.rtsigprocmask
     1.53s  2.17% 71.18%      2.03s  2.87%  runtime.gcDrainN
     1.34s  1.90% 73.08%      1.34s  1.90%  runtime.futex
     0.88s  1.25% 74.32%      0.88s  1.25%  runtime.lfstackUnpack (inline)
     0.78s  1.10% 75.43%      0.78s  1.10%  runtime.epollctl
     0.69s  0.98% 76.41%     13.41s 18.99%  runtime.gcDrain
     0.66s  0.93% 77.34%      5.24s  7.42%  runtime.mallocgc

lfstack and gcDrain cost much cpu.

dive into go trace. 25% p in GC (dedicated) state. but the remaining g seems to do nothing, all of them busy doing GC(idle)
image
The entire gc process took about 110 milliseconds, and almost no user threads were executing. This introduces a huge lantency.

GOMAXPROC=96 with patch

I've apply @TocarIP 's patch. And also change workType.empty to partionedStack, which still be a hot spot under 96 proc.

type workType struct {
	full  partionedStack   // lock-free list of full blocks workbuf
	empty partionedStack          // lock-free list of empty blocks workbuf

after that, it reduce lfstackpop to a acceptable level

(pprof) top
Showing nodes accounting for 27.76s, 74.15% of 37.44s total
Dropped 1068 nodes (cum <= 0.19s)
Showing top 10 nodes out of 371
      flat  flat%   sum%        cum   cum%
    21.55s 57.56% 57.56%     21.55s 57.56%  runtime/internal/syscall.Syscall6
     2.28s  6.09% 63.65%      2.28s  6.09%  runtime.rtsigprocmask
     0.96s  2.56% 66.21%      0.96s  2.56%  runtime.(*lfstack).pop (inline)
     0.81s  2.16% 68.38%      0.81s  2.16%  runtime.futex
     0.43s  1.15% 69.52%      2.21s  5.90%  runtime.mallocgc
     0.40s  1.07% 70.59%      0.40s  1.07%  runtime.epollctl
     0.39s  1.04% 71.63%      0.39s  1.04%  runtime.nextFreeFast (inline)
     0.34s  0.91% 72.54%      0.34s  0.91%  runtime.(*lfstack).empty (inline)
     0.31s  0.83% 73.37%      2.16s  5.77%  runtime.gentraceback
     0.29s  0.77% 74.15%      0.47s  1.26%  runtime.heapBitsSetType

and user g has a chance to run during gc
image

The whole gc time is reduced to 25ms.

conclusion

@TocarIP 's patch is make sense, it reduce the cpu cache miss when performs cas.
I 'm willing to take over this patch if you are not working on it 😄 @TocarIP

@TocarIP
Copy link
Contributor Author

TocarIP commented Aug 22, 2023

I'm not working on it, so feel free to take over

@zhuangqh
Copy link

I'm not working on it, so feel free to take over

There is a issue about the parition stack😂. It is not "linear" because there are many linearization points due to partition.
For example, g A performs pop operation, when iterating the stack by index order 2 3 0 1. Another g B many push successfully to index 3 after A iterate over it . Thus g A pop nothing.

@zhuangqh
Copy link

zhuangqh commented Sep 4, 2023

There is a realistic benchmark to reproduce this issue. https://github.com/grpc/grpc-go/blob/master/benchmark/run_bench.sh

issue a heavy benchmark

./run_bench.sh -c 96 -r 1 -req 4096 -resp 4096

in 8 core machine, everythin works fine

(pprof) top gcDrain
Active filters:
   focus=gcDrain
Showing nodes accounting for 1.91s, 2.91% of 65.55s total
Dropped 48 nodes (cum <= 0.33s)
Showing top 10 nodes out of 92
      flat  flat%   sum%        cum   cum%
     0.44s  0.67%  0.67%      0.56s  0.85%  runtime.findObject
     0.41s  0.63%  1.30%      0.41s  0.63%  runtime.procyield
     0.38s  0.58%  1.88%      1.41s  2.15%  runtime.scanobject
     0.11s  0.17%  2.04%      0.11s  0.17%  runtime.(*gcBits).bitp (inline)
     0.11s  0.17%  2.21%      0.22s  0.34%  runtime.(*unwinder).resolveInternal
     0.10s  0.15%  2.36%      0.10s  0.15%  runtime.heapBits.nextFast (inline)
     0.10s  0.15%  2.52%      0.13s   0.2%  runtime.heapBitsForAddr
     0.09s  0.14%  2.65%      0.09s  0.14%  runtime.(*mspan).base (inline)
     0.09s  0.14%  2.79%      0.11s  0.17%  runtime.spanOf (inline)
     0.08s  0.12%  2.91%      0.24s  0.37%  runtime.pcvalue

in 96 core machine

(pprof) top gcDrain
Active filters:
   focus=gcDrain
Showing nodes accounting for 258.07s, 26.51% of 973.33s total
Dropped 184 nodes (cum <= 4.87s)
Showing top 10 nodes out of 63
      flat  flat%   sum%        cum   cum%
   198.70s 20.41% 20.41%    198.79s 20.42%  runtime.(*lfstack).pop (inline)
    23.07s  2.37% 22.78%     23.07s  2.37%  runtime.tgkill
     9.79s  1.01% 23.79%    168.38s 17.30%  runtime.gcDrain
     6.12s  0.63% 24.42%      6.12s  0.63%  runtime.(*lfstack).push
     5.15s  0.53% 24.95%      5.19s  0.53%  runtime.(*gcBits).bitp (inline)
     4.12s  0.42% 25.37%      4.55s  0.47%  runtime.findObject
     3.99s  0.41% 25.78%     98.81s 10.15%  runtime.gcDrainN
     3.46s  0.36% 26.14%    191.66s 19.69%  runtime.getempty
     3.30s  0.34% 26.48%     13.73s  1.41%  runtime.scanobject
     0.37s 0.038% 26.51%     10.48s  1.08%  runtime.greyobject

@zhuangqh
Copy link

zhuangqh commented Sep 4, 2023

After investigation, the elimination-backoff stack algorithm may be a better algo to solve this question.
I create a initial implement here https://go-review.googlesource.com/c/go/+/525435

What do you think about this issue @aclements @mknyszek @ianlancetaylor I'm struggling with the gc lantency when running on machine with large number of cpu

i am not a expert of goruntime, need some help....

@mknyszek mknyszek added the compiler/runtime Issues related to the Go compiler and/or runtime. label Sep 5, 2023
@mknyszek
Copy link
Contributor

mknyszek commented Sep 5, 2023

CC @golang/runtime

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Development

No branches or pull requests

9 participants