-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: sub optimal gc scalability #21056
Comments
Also reproducible on 1.8 |
CC @RLH @aclements |
Curiously, while I was able to reproduce this on a 48 thread (24 core) machine with @TocarIP, have you seen this in other/more realistic benchmarks? I wonder if there's something pessimal about tree2's heap. |
I've got a report from customer about excessive cache misses in gcdrain, but there were no reproducer. Tree2 is just the random benchmark were I was able to reproduce it. For ./garbage -benchmem=1024 I see about 13% time in gcdrain[N] on machine above (88 threads) with almost all the time spent in the same work.full == 0 check
|
Change https://golang.org/cl/62971 mentions this issue: |
Tree2 builds a large tree and then calls runtime.GC() a number of times so
it isn't a very representative benchmark. Does the non-blocking partitioned
stack data structure appear in the literature? Googling doesn't immediately
reveal a reference.
…On Mon, Sep 11, 2017 at 3:34 PM, GopherBot ***@***.***> wrote:
Change https://golang.org/cl/62971 mentions this issue: runtime: reduce
contention in gcDrain
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21056 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA7Wn9hd6Xb9e5gWUePyr30yEHKLshqtks5shYsqgaJpZM4Oaj8e>
.
|
I agree that tree2 is not very representative, but I'm not sure what is representative.
Is this more representative? |
When running on machines with large numbers of cores (~40+) a Golang garbage collection bug seems to dominate runtime - see golang/go#21056 for more info. This change limits the number of cores to what we found was most performant on our hardware - others may want to experiment.
Hi go maintainer @rsc @aclements , I also address this issue when running containerd on a large cpu machine(192core). Setting the maxprocsize to 192 or larger than 32 will cause to a much worse gc lantency. machine spec
go version 1.21 My workload is a typical container engine, which fork some container process and performs some io operation. In order to create containers with high concurrency and high speed, our business is very sensitive to lantency. (Start the container with 200 concurrency) GOMAXPROC=32top operation is syscall and rtsigprocmask invoked by forking.
scheduler wait is high due to small proc and high concurrently go fork. together with 20000+ resident go routine. so i try to increase go proc size GOMAXPROC=96
lfstack and gcDrain cost much cpu. dive into go trace. 25% p in GC (dedicated) state. but the remaining g seems to do nothing, all of them busy doing GC(idle) GOMAXPROC=96 with patchI've apply @TocarIP 's patch. And also change workType.empty to partionedStack, which still be a hot spot under 96 proc. type workType struct {
full partionedStack // lock-free list of full blocks workbuf
empty partionedStack // lock-free list of empty blocks workbuf after that, it reduce lfstackpop to a acceptable level
and user g has a chance to run during gc The whole gc time is reduced to 25ms. conclusion@TocarIP 's patch is make sense, it reduce the cpu cache miss when performs cas. |
I'm not working on it, so feel free to take over |
There is a issue about the parition stack😂. It is not "linear" because there are many linearization points due to partition. |
There is a realistic benchmark to reproduce this issue. https://github.com/grpc/grpc-go/blob/master/benchmark/run_bench.sh issue a heavy benchmark ./run_bench.sh -c 96 -r 1 -req 4096 -resp 4096 in 8 core machine, everythin works fine
in 96 core machine
|
After investigation, the elimination-backoff stack algorithm may be a better algo to solve this question. What do you think about this issue @aclements @mknyszek @ianlancetaylor I'm struggling with the gc lantency when running on machine with large number of cpu i am not a expert of goruntime, need some help.... |
CC @golang/runtime |
Some of the bottlenecks here on lfstack look related to #68399. Regarding the GOMAXPROCS=96 with lots of idle workers: The idle GC workers should only be running if there are no user goroutines to run. Thus they shouldn't cause "huge latency" spikes. If so, something definitely seems wrong there. To be clear, some approach to avoid hammer the lfstack too hard does seem like it would make sense, but that may be within the lfstack itself (such as your backoff), or some more fundamental reorganization to avoid needing to hit it too hard in the first place. |
There is another case of this in #65064 (comment). After the epoll issue is resolved, the next visible scalability bottleneck is on |
What version of Go are you using (
go version
)?go version devel +4e9c86a Wed Jun 28 17:33:40 2017 +0000 linux/amd64
What operating system and processor architecture are you using (
go env
)?GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/localdisk/itocar/gopath"
GORACE=""
GOROOT="/localdisk/itocar/golang"
GOTOOLDIR="/localdisk/itocar/golang/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build633808214=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
What did you do?
Run test/bench/garbage/tree2 on machine with 88 threads (2 sockets, 22 cores per socket, 2 threads per core) 2x Xeon E5-2699 v4
With following options:
./tree2 -cpus=88 -heapsize=1000000000 -cpuprofile=tree2.pprof
What did you expect to see?
runtime.gcDrain taking insignificant amount of time
What did you see instead?
runtime.gcDrain taking about half of all time:
Looking into runtime.gcDrain, I see that almost all time is spent on
35.66s 35.66s 924: if work.full == 0 {
I couldn't reproduce this behavior on machine with small number of cores.
Looking into cache miss profile shows that this is due to all cores updating head of work.full,
which causes all reads needed for check to miss cache.
The text was updated successfully, but these errors were encountered: