runtime: frequent enlisting of short-lived background workers leads to performance regression with async preemption #37465

klauspost · 2020-02-26T12:02:00Z

What version of Go are you using (`go version`)?

λ go version
go version go1.14 windows/amd64

λ go version
go version go1.13.8 windows/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (`go env`)?

go env Output

λ go env                                                                                                                                       
set GO111MODULE=                                                                                                                               
set GOARCH=amd64                                                                                                                               
set GOBIN=                                                                                                                                     
set GOCACHE=C:\Users\klaus\AppData\Local\go-build                                                                                              
set GOENV=C:\Users\klaus\AppData\Roaming\go\env                                                                                                
set GOEXE=.exe                                                                                                                                 
set GOFLAGS=                                                                                                                                   
set GOHOSTARCH=amd64                                                                                                                           
set GOHOSTOS=windows                                                                                                                           
set GONOPROXY=                                                                                                                                 
set GONOSUMDB=                                                                                                                                 
set GOOS=windows                                                                                                                               
set GOPATH=e:\gopath                                                                                                                           
set GOPRIVATE=                                                                                                                                 
set GOPROXY=https://goproxy.io                                                                                                                 
set GOROOT=c:\go                                                                                                                               
set GOSUMDB=sum.golang.org                                                                                                                     
set GOTMPDIR=                                                                                                                                  
set GOTOOLDIR=c:\go\pkg\tool\windows_amd64                                                                                                     
set GCCGO=gccgo                                                                                                                                
set AR=ar                                                                                                                                      
set CC=gcc                                                                                                                                     
set CXX=g++                                                                                                                                    
set CGO_ENABLED=1                                                                                                                              
set GOMOD=                                                                                                                                     
set CGO_CFLAGS=-g -O2                                                                                                                          
set CGO_CPPFLAGS=                                                                                                                              
set CGO_CXXFLAGS=-g -O2                                                                                                                        
set CGO_FFLAGS=-g -O2                                                                                                                          
set CGO_LDFLAGS=-g -O2                                                                                                                         
set PKG_CONFIG=pkg-config                                                                                                                      
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=d:\temp\wintemp\go-build155272862=/tmp/go-build -gno-record-gcc-switches

What did you do?

Run benchmark: https://play.golang.org/p/WeuJg6yaOuJ

go test -bench=. -test.benchtime=10s used to test.

What did you expect to see?

Close or similar benchmark speeds.

What did you see instead?

40% performance regression.

λ benchcmp go113.txt go114.txt
benchmark                                       old ns/op     new ns/op     delta
BenchmarkCompressAllocationsSingle/flate-32     87026         121741        +39.89%
BenchmarkCompressAllocationsSingle/gzip-32      88654         122632        +38.33%

This is not a purely theoretical benchmark. While suboptimal, this is the easiest way to compress a piece of data, so this will be seen in the wild. It could also indicate a general regression for applications allocating a lot.

Edit: This is not related to changes in the referenced packages. Seeing this when using packages outside the stdlib as well.

The text was updated successfully, but these errors were encountered:

klauspost · 2020-02-26T12:11:47Z

Somebody proposed to remove 'ReportAllocs'. It has an impact, but not very big:

λ benchcmp go113-no.txt go114-no.txt
benchmark                                       old ns/op     new ns/op     delta
BenchmarkCompressAllocationsSingle/flate-32     88082         117794        +33.73%
BenchmarkCompressAllocationsSingle/gzip-32      89279         118946        +33.23%

klauspost · 2020-02-26T12:42:55Z

Turning off DEP seems to reduce the impact, but still considerably slower:

λ bcdedit.exe /set {current} nx AlwaysOff -> Reboot. Confirm setting with λ wmic OS Get DataExecutionPrevention_SupportPolicy -> 0.

λ benchcmp go113-depoff.txt go114-depoff.txt
benchmark                                       old ns/op     new ns/op     delta
BenchmarkCompressAllocationsSingle/flate-32     87619         107249        +22.40%
BenchmarkCompressAllocationsSingle/gzip-32      89344         107670        +20.51%

This actually matches my initial measurements before isolating the issue, so I am unsure if this is just some variance based on system state.

josharian · 2020-02-26T14:21:46Z

cc @mknyszek

cagedmantis · 2020-02-26T15:22:40Z

/cc @aclements @rsc @randall77

mknyszek · 2020-02-26T15:40:08Z

Or... hm. I spoke too soon. 5 runs shows that the benchmark is pretty noisy on Linux and it's not obvious that there's a regression.

name                                old time/op    new time/op    delta
CompressAllocationsSingle/flate-48     182µs ±12%     169µs ± 4%    ~     (p=0.151 n=5+5)
CompressAllocationsSingle/gzip-48      170µs ± 7%     184µs ± 9%    ~     (p=0.056 n=5+5)

name                                old alloc/op   new alloc/op   delta
CompressAllocationsSingle/flate-48     814kB ± 0%     814kB ± 0%  +0.00%  (p=0.016 n=4+5)
CompressAllocationsSingle/gzip-48      814kB ± 0%     814kB ± 0%  +0.00%  (p=0.008 n=5+5)

name                                old allocs/op  new allocs/op  delta
CompressAllocationsSingle/flate-48      16.0 ± 0%      16.0 ± 0%    ~     (all equal)
CompressAllocationsSingle/gzip-48       17.0 ± 0%      17.0 ± 0%    ~     (all equal)

That's what I get for making conclusions on just one benchmark run. This probably needs many more runs to see a concrete difference, if one exists.

Since this seems much more easily reproducible on Windows, I'm going to try there.

leafrhythms · 2020-02-26T15:41:33Z

Cannot reproduce on linux.

$ go env

GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/leaf/.cache/go-build"
GOENV="/home/leaf/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/leaf/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/lib/go-1.14"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go-1.14/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build636896530=/tmp/go-build -gno-record-gcc-switches"

$ ~/go/bin/benchstat bench_1.1{3,4}.txt

name old time/op new time/op delta
CompressAllocationsSingle/flate-8 180µs ± 2% 181µs ± 4% ~ (p=1.000 n=10+10)
CompressAllocationsSingle/gzip-8 183µs ± 7% 204µs ± 1% +11.38% (p=0.000 n=9+10)

Edit: benchmark run with ReportAlloc removed.

mknyszek · 2020-02-26T15:54:33Z

While I set up my Windows environment: @klauspost what suggests to you that this is a regression in allocation performance vs. something else?

klauspost · 2020-02-26T16:09:49Z

@mknyszek It is only really doing allocations. The compression part is trivial, but it needs enough content to activate all allocations.

Profiles: alloc-pprof.zip

λ go tool pprof cpu113.pprof
Type: cpu
Time: Feb 26, 2020 at 5:05pm (CET)
Duration: 25.46s, Total samples = 1.01mins (238.80%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 39.56s, 65.07% of 60.80s total
Dropped 227 nodes (cum <= 0.30s)
Showing top 10 nodes out of 144
      flat  flat%   sum%        cum   cum%
    11.28s 18.55% 18.55%     14.95s 24.59%  runtime.scanobject
     5.51s  9.06% 27.62%      5.54s  9.11%  runtime.stdcall2
     4.86s  7.99% 35.61%      4.86s  7.99%  runtime.runGCProg
     4.76s  7.83% 43.44%      4.76s  7.83%  runtime.memclrNoHeapPointers
     3.82s  6.28% 49.72%      3.83s  6.30%  runtime.stdcall3
        2s  3.29% 53.01%      2.03s  3.34%  runtime.runqgrab
     1.97s  3.24% 56.25%      1.97s  3.24%  runtime.procyield
     1.94s  3.19% 59.44%      1.96s  3.22%  runtime.(*lfstack).pop
     1.86s  3.06% 62.50%      1.90s  3.12%  runtime.stdcall1
     1.56s  2.57% 65.07%     21.19s 34.85%  runtime.gcDrain

λ go tool pprof cpu114.pprof
Type: cpu
Time: Feb 26, 2020 at 5:07pm (CET)
Duration: 26.61s, Total samples = 1.34mins (303.24%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 1.07mins, 79.31% of 1.34mins total
Dropped 271 nodes (cum <= 0.01mins)
Showing top 10 nodes out of 95
      flat  flat%   sum%        cum   cum%
  0.54mins 40.51% 40.51%   0.55mins 40.59%  runtime.stdcall2
  0.17mins 12.93% 53.43%   0.22mins 16.34%  runtime.scanobject
  0.12mins  8.99% 62.42%   0.12mins  8.99%  runtime.memclrNoHeapPointers
  0.08mins  5.94% 68.36%   0.08mins  6.00%  runtime.stdcall1
  0.07mins  4.87% 73.23%   0.07mins  4.87%  runtime.runGCProg
  0.02mins  1.65% 74.88%   0.02mins  1.65%  runtime.procyield
  0.02mins  1.28% 76.15%   0.02mins  1.28%  runtime.runqgrab
  0.02mins  1.17% 77.32%   0.02mins  1.18%  runtime.stdcall7
  0.01mins  1.00% 78.32%   0.02mins  1.61%  runtime.greyobject
  0.01mins  0.99% 79.31%   0.03mins  1.90%  runtime.findObject

klauspost · 2020-02-26T16:13:38Z

SVGs:

Go 1.13: https://files.klauspost.com/profile002.svg
Go 1.14: https://files.klauspost.com/profile001.svg

leitzler · 2020-02-26T16:23:33Z

FWIW, I ran it on Linux Arm64 as well (shared VM so anything goes), but the results seems consitent:

$ benchcmp go11{3,4}.txt
benchcmp is deprecated in favor of benchstat: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
benchmark                                      old ns/op     new ns/op     delta
BenchmarkCompressAllocationsSingle/flate-8     408292        491127        +20.29%
BenchmarkCompressAllocationsSingle/gzip-8      365706        453792        +24.09%

$ go env

$ go env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/home/leitzler/.cache/go-build"
GOENV="/home/leitzler/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/leitzler/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_arm64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build024325012=/tmp/go-build -gno-record-gcc-switches"

mknyszek · 2020-02-26T16:24:56Z

@klauspost Thanks for the profiles! I was about to do this myself.

I disagree that this indicates anything to with allocation; none of those symbols are involved in allocation. The first one I see is in the top 20 in the diff, and it's right at the bottom (and it's a deleted symbol at that!).

(pprof) top 20
Showing nodes accounting for 21.44s, 35.26% of 60.80s total
Dropped 261 nodes (cum <= 0.30s)
Showing top 20 nodes out of 173
      flat  flat%   sum%        cum   cum%
    27.17s 44.69% 44.69%     27.21s 44.75%  runtime.stdcall2
    -3.74s  6.15% 38.54%     -3.75s  6.17%  runtime.stdcall3
     2.93s  4.82% 43.36%      2.94s  4.84%  runtime.stdcall1
     2.49s  4.10% 47.45%      2.49s  4.10%  runtime.memclrNoHeapPointers
    -1.48s  2.43% 45.02%     -1.50s  2.47%  runtime.(*lfstack).pop
    -0.97s  1.60% 43.42%        -1s  1.64%  runtime.runqgrab
     0.94s  1.55% 44.97%      0.95s  1.56%  runtime.stdcall7
    -0.93s  1.53% 43.44%     -0.93s  1.53%  runtime.runGCProg
    -0.89s  1.46% 41.97%     31.05s 51.07%  runtime.gcDrain
    -0.88s  1.45% 40.53%     20.40s 33.55%  runtime.lock
    -0.85s  1.40% 39.13%     -1.77s  2.91%  runtime.scanobject
    -0.64s  1.05% 38.08%     -0.64s  1.05%  runtime.procyield
    -0.55s   0.9% 37.17%     -0.53s  0.87%  runtime.heapBits.next
     0.50s  0.82% 37.99%      0.50s  0.82%  runtime.shouldStealTimers (inline)
    -0.45s  0.74% 37.25%     -0.14s  0.23%  runtime.findObject
    -0.43s  0.71% 36.55%     -1.43s  2.35%  runtime.runqsteal
    -0.41s  0.67% 35.87%     -4.83s  7.94%  runtime.findrunnable
    -0.38s  0.62% 35.25%     -0.39s  0.64%  runtime.(*lfstack).push
     0.33s  0.54% 35.79%      0.34s  0.56%  runtime.scanblock
    -0.32s  0.53% 35.26%     -0.40s  0.66%  runtime.(*treapNode).updateInvariants

What's more interesting to me is the prevalence of stdcall2 in the diff. Peeking at that a few times...

(pprof) peek runtime.stdcall2 
Showing nodes accounting for 19.88s, 32.70% of 60.80s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                            21.65s 79.57% |   runtime.semasleep
                                             5.56s 20.43% |   runtime.preemptM
    27.17s 44.69% 44.69%     27.21s 44.75%                | runtime.stdcall2
                                             0.04s  0.15% |   runtime.stdcall
----------------------------------------------------------+-------------
(pprof) peek runtime.semasleep
Showing nodes accounting for 19.88s, 32.70% of 60.80s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                            21.68s   100% |   runtime.lock
     0.03s 0.049% 0.049%     21.68s 35.66%                | runtime.semasleep
                                            21.65s 99.86% |   runtime.stdcall2
----------------------------------------------------------+-------------
(pprof) peek runtime.lock
Showing nodes accounting for 19.88s, 32.70% of 60.80s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                            27.34s 134.02% |   runtime.preemptM
                                            -2.33s 11.42% |   runtime.(*mheap).alloc_m
                                            -1.57s  7.70% |   runtime.findrunnable
                                            -0.79s  3.87% |   runtime.stopm
                                            -0.62s  3.04% |   runtime.exitsyscallfast_pidle
                                            -0.33s  1.62% |   runtime.gcResetMarkState
                                            -0.32s  1.57% |   runtime.(*mheap).allocManual
                                            -0.31s  1.52% |   runtime.(*mheap).freeSpan.func1
                                            -0.22s  1.08% |   runtime.bgscavenge.func2
                                            -0.14s  0.69% |   runtime.timerproc
                                             0.14s  0.69% |   runtime.goschedImpl
                                             0.13s  0.64% |   runtime.(*mheap).allocSpan
                                            -0.11s  0.54% |   runtime.startm
                                            -0.07s  0.34% |   runtime.addtimer
                                            -0.07s  0.34% |   runtime.handoffp
                                            -0.07s  0.34% |   runtime.scavengeSleep
                                            -0.06s  0.29% |   runtime.gcBgMarkWorker.func2
                                            -0.05s  0.25% |   runtime.exitsyscall0
                                            -0.04s   0.2% |   runtime.gcstopm
                                             0.04s   0.2% |   runtime.gcFlushBgCredit
                                            -0.03s  0.15% |   runtime.runSafePointFn
                                            -0.03s  0.15% |   runtime.wakeScavenger
                                             0.02s 0.098% |   runtime.(*pageAlloc).scavengeOne.func1
                                            -0.02s 0.098% |   runtime.addspecial
                                             0.02s 0.098% |   runtime.forEachP
                                            -0.02s 0.098% |   runtime.acquireSudog
                                            -0.02s 0.098% |   runtime.gcSweep
                                            -0.02s 0.098% |   runtime.stackcache_clear
                                            -0.02s 0.098% |   runtime.stopTheWorldWithSema
                                             0.01s 0.049% |   runtime.(*mcentral).cacheSpan
                                             0.01s 0.049% |   runtime.freespecial
                                             0.01s 0.049% |   runtime.mProf_Free
                                            -0.01s 0.049% |   runtime.(*mheap).freeManual
                                            -0.01s 0.049% |   runtime.(*mheap).reclaimChunk
                                            -0.01s 0.049% |   runtime.nextMarkBitArenaEpoch
                                            -0.01s 0.049% |   runtime.scavengeSleep.func1
                                            -0.01s 0.049% |   runtime.setprofilebucket
                                            -0.01s 0.049% |   runtime.startTheWorldWithSema
                                                 0     0% |   runtime.(*mcentral).freeSpan
                                                 0     0% |   runtime.(*mcentral).uncacheSpan
                                                 0     0% |   runtime.(*mheap).reclaim
                                                 0     0% |   runtime.mProf_Malloc
                                                 0     0% |   runtime.schedEnableUser
    -0.88s  1.45%  1.45%     20.40s 33.55%                | runtime.lock
                                            21.68s 106.27% |   runtime.semasleep
                                            -0.58s  2.84% |   runtime.procyield
                                             0.14s  0.69% |   runtime.osyield
                                             0.03s  0.15% |   runtime.semacreate
                                             0.01s 0.049% |   runtime.newstack
----------------------------------------------------------+-------------
                                            -0.15s   100% |   runtime.checkdead
    -0.15s  0.25%  1.69%     -0.15s  0.25%                | runtime.lockextra
----------------------------------------------------------+-------------
(pprof) peek runtime.preemptM
Showing nodes accounting for 19.88s, 32.70% of 60.80s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                            38.18s   100% |   runtime.preemptone
     0.15s  0.25%  0.25%     38.18s 62.80%                | runtime.preemptM
                                            27.34s 71.61% |   runtime.lock
                                             5.56s 14.56% |   runtime.stdcall2
                                             2.92s  7.65% |   runtime.stdcall1
                                             1.14s  2.99% |   runtime.unlock
                                             0.95s  2.49% |   runtime.stdcall7
                                             0.07s  0.18% |   runtime.wantAsyncPreempt (inline)
                                             0.03s 0.079% |   runtime.gFromTLS
                                             0.02s 0.052% |   runtime.isAsyncSafePoint
----------------------------------------------------------+-------------

It looks like a lock in preemptM is either contended or hit very often. I'll try to see exactly what line it's coming from.

@klauspost Your SVGs suggest the same thing (look at the difference between stdcall2 in both) AFAICT.

CC @aclements because asynchronous preemption.

mknyszek · 2020-02-26T16:26:50Z

@leitzler Could you share some profiles? I don't readily have access to dedicated arm64 hardware and I'm curious if that's the same thing or a different issue.

mknyszek · 2020-02-26T16:33:36Z

@aclements Most of it is coming from os_windows.go:1185 in go1.14. The line is lock(&suspendLock).

mknyszek · 2020-02-26T16:50:35Z

Digging deeper with @cherrymui, it looks like most of the preemption requests are coming from runtime.(*gcControllerState).enlistWorker. This benchmark is single-threaded, and enlistWorker only tries to preempt user code instead of taking an idle P. I notice that GOMAXPROCS=32 for the original reported issue, so it should just take an idle P (since there's lots!) but it doesn't. It preempts user code instead.

There's a comment there about #19112 which talks about deadlocks for why we can't just take an idle P...

mknyszek · 2020-02-26T16:51:21Z

@leitzler Can you please file a new issue? It's almost certainly unrelated to this one, because this is all about Windows-specific code.

klauspost · 2020-02-26T17:03:19Z

@mknyszek No expert on this, but it could be processor specific AFAICT. Running on Ryzen 3950x here - 16c/32t. I will run on my 6c/12t Intel laptop and see if it reproduces as well.

Edit: 6c/12t: Shows 24% degradation as well. HOWEVER, by running with -cpu=4 it is neutral. I will run a few more test on my desktop, since my laptop is a bit noisier.

leitzler · 2020-02-26T17:31:04Z

I’ll give it another run with profiles later tonight when I get home. (In a new issue depending on how this turns out meanwhile)

klauspost · 2020-02-26T17:45:18Z

The number of threads affects the numbers quite a bit:

commandline: go test -bench=. -test.benchtime=10s -cpu=1,2,4,8,16,32

λ benchcmp 113t.txt 114t.txt
benchmark                                       old ns/op     new ns/op     delta
BenchmarkCompressAllocationsSingle/flate        83597         65062         -22.17%
BenchmarkCompressAllocationsSingle/flate-2      83767         91519         +9.25%
BenchmarkCompressAllocationsSingle/flate-4      75547         80647         +6.75%
BenchmarkCompressAllocationsSingle/flate-8      79197         84755         +7.02%
BenchmarkCompressAllocationsSingle/flate-16     82064         97530         +18.85%
BenchmarkCompressAllocationsSingle/flate-32     88006         106613        +21.14%
BenchmarkCompressAllocationsSingle/gzip         91624         108324        +18.23%
BenchmarkCompressAllocationsSingle/gzip-2       80929         97885         +20.95%
BenchmarkCompressAllocationsSingle/gzip-4       79115         81557         +3.09%
BenchmarkCompressAllocationsSingle/gzip-8       82221         85995         +4.59%
BenchmarkCompressAllocationsSingle/gzip-16      86477         96064         +11.09%
BenchmarkCompressAllocationsSingle/gzip-32      91024         108272        +18.95%

Even though the test is single-threaded, the total number of threads available has a very big impact.

I am rerunning the benchmark with -test.count=5, but so far the numbers match up. The outliers are not flukes AFAICT.

Edit: To confirm the numbers, here is 5 runs of each:

name                                old time/op  new time/op   delta
CompressAllocationsSingle/flate     82.4µs ± 1%   64.5µs ± 0%  -21.70%  (p=0.008 n=5+5)
CompressAllocationsSingle/flate-2   85.4µs ± 1%   89.6µs ± 2%   +4.98%  (p=0.008 n=5+5)
CompressAllocationsSingle/flate-4   75.7µs ± 1%   84.7µs ± 1%  +11.82%  (p=0.008 n=5+5)
CompressAllocationsSingle/flate-8   77.3µs ± 1%   84.7µs ± 0%   +9.58%  (p=0.008 n=5+5)
CompressAllocationsSingle/flate-16  83.0µs ± 2%   97.4µs ± 2%  +17.28%  (p=0.008 n=5+5)
CompressAllocationsSingle/flate-32  87.8µs ± 1%  108.7µs ± 0%  +23.79%  (p=0.008 n=5+5)
CompressAllocationsSingle/gzip      91.2µs ± 0%  107.6µs ± 1%  +17.92%  (p=0.008 n=5+5)
CompressAllocationsSingle/gzip-2    83.8µs ± 2%   95.5µs ± 5%  +13.98%  (p=0.008 n=5+5)
CompressAllocationsSingle/gzip-4    79.1µs ± 0%   82.7µs ± 2%   +4.47%  (p=0.008 n=5+5)
CompressAllocationsSingle/gzip-8    80.3µs ± 1%   87.9µs ± 0%   +9.49%  (p=0.008 n=5+5)

(benchmark auto-terminated, but the numbers for gzip-16 and gzip-32 should be confirmed as well).

cherrymui · 2020-02-26T18:42:29Z

Yeah, I think this is consistent with our previous analysis. There is no contention if there is just one thread. The more threads, the more threads trying to preempt each other, thus the heavier contention.

leitzler · 2020-02-26T22:43:05Z

@leitzler Could you share some profiles? I don't readily have access to dedicated arm64 hardware and I'm curious if that's the same thing or a different issue.

It isn't dedicated hardware so I'm not sure it is that much worth, it's a 8 vcpu 8 gb VM. I re-ran the test, 10 times each alternating between 1.13 & 1.14 and it was actually way less difference:

$ benchstat go113.txt go114.txt
name                               old time/op    new time/op    delta
CompressAllocationsSingle/flate-8     437µs ± 3%     469µs ± 1%   +7.37%  (p=0.000 n=10+9)
CompressAllocationsSingle/gzip-8      381µs ± 2%     435µs ± 3%  +14.28%  (p=0.000 n=9+10)

profiles.zip

klauspost · 2020-02-26T22:53:14Z

@mknyszek I don't know why you changed the title to 'compression benchmark'. This is not related to compression.

Change the benchmark to:

		for i := 0; i < b.N; i++ {
			b := make([]byte, 800<<10)
			_, _ = io.Copy(ioutil.Discard, bytes.NewBuffer(b))
		}

and the result is:

λ benchcmp 113.txt 114.txt
benchmark                                       old ns/op     new ns/op     delta
BenchmarkAllocation/flate-32     49657         56362         +13.50%

mknyszek · 2020-02-26T22:58:44Z

@klauspost That's true, thanks for pointing it out. I'll fix it again.

@leitzler Looking at the profiles... it looks like it's exactly the same issue. runtime.tgkill is suddenly at the top of the profile and it comes from signalM, which in turn comes from preemptM. Then it follows the same path up to enlistWorker.

mknyszek · 2020-02-28T02:45:01Z

I think the answer here is to just fix #19112 and this will mostly go away.

Putting on the help wanted preemptively (ha) in case it turns out #19112 is too hard to fix (or I just need help fixing it), then we need a different solution.

mknyszek · 2020-02-28T17:18:21Z

So I think fixing #19112 is feasible... but it's tricky. Anything we do would basically need an invariant that you can't try to change a G's status while holding sched.lock because the scan path can acquire it in the general case.

The fix is really simple, I think: make injectglist just casgstatus(_Gwaiting, _Grunnable) for all the goroutines in the list before putting them on the global run queue. AFAICT this is the only place in the whole codebase where we try to change a G's status while holding sched.lock but maybe this is too restrictive. I'm pretty sure it's also safe to do this without the sched.lock because Gs on a gList are only owned by the list.

The alternative is to have enlistWorker hand off the "wakep" to somebody else (like a "p waker" goroutine or sysmon or something) but the delay there may be too high.

@aclements and @ianlancetaylor, WDYT?

Yet another alternative is to not try to solve this problem and instead avoid having enlistWorker fire so often. This would require a heuristic or something for when more mark workers are actually needed, and would require more thought.

ianlancetaylor · 2020-02-28T17:31:12Z

@mknyszek Well, what I think is that I think someone should review https://golang.org/cl/216198 which seems to already do what you want.

mknyszek · 2020-02-28T17:38:00Z

@ianlancetaylor Ah. Well, I totally forgot about that patch. :) Happy to review it.

gopherbot · 2020-03-17T15:53:40Z

Change https://golang.org/cl/223797 mentions this issue: runtime: prefer to wake an idle P when enlisting bg mark workers

networkimprov · 2020-03-17T16:20:12Z

Should this have Milestone 1.15 and backport to 1.14?

mknyszek · 2020-03-17T16:45:47Z

@networkimprov Milestone 1.15 definitely. Backport to 1.14 might be possible, it's a bit risky though (just https://golang.org/cl/223797 isn't actually enough; the change it depends on is larger).

mknyszek · 2020-03-25T15:53:56Z

@klauspost I haven't gotten around to verifying this myself yet (some noise problems with the Windows box I have access to), but it would still be good to make sure the change fixes the problem for you (perhaps there's a difference between our Windows systems; and it also looks like maybe you have a less noisy system). Could you please try to run your benchmark with Go at the current master branch and https://golang.org/cl/223797 applied on top?

klauspost · 2020-03-25T15:55:27Z

I will attempt to verify it tomorrow 👍

mknyszek · 2020-04-21T19:17:23Z

@klauspost The freeze is coming soon, friendly ping! :) Please give it a try if you get the chance or let me know if you can't.

klauspost · 2020-04-22T08:47:57Z

Sorry @mknyszek . Didn't have a build system set up, so that got me sidetracked.

Unfortunately it doesn't seem to make any measurable difference for me and the original regression still stands.

Go 1.13.8:

BenchmarkCompressAllocationsSingle/flate-32         	  147931	     81671 ns/op	  813699 B/op	      16 allocs/op
BenchmarkCompressAllocationsSingle/gzip-32          	  145420	     83511 ns/op	  813876 B/op	      17 allocs/op

Current master (79395c55e2):

BenchmarkCompressAllocationsSingle/flate-32         	  116026	    109378 ns/op	  813701 B/op	      16 allocs/op
BenchmarkCompressAllocationsSingle/gzip-32          	  106546	    110225 ns/op	  813877 B/op	      17 allocs/op

Master with 223797 applied:

BenchmarkCompressAllocationsSingle/flate-32         	  109663	    110618 ns/op	  813701 B/op	      16 allocs/op
BenchmarkCompressAllocationsSingle/gzip-32          	  107888	    111298 ns/op	  813876 B/op	      17 allocs/op

mknyszek · 2020-04-22T13:37:02Z

Hmmm... I see. Thank you for trying! This needs more investigation, then. That CL is still probably worth landing (if only to close #19112).

ianlancetaylor · 2020-06-16T02:58:38Z

Moving milestone to 1.16.

aclements · 2020-11-17T18:19:55Z

Unfortunately, CL 223797 still has some lock ordering issues, so we've decided it's safer to bump this to 1.17.

mknyszek · 2021-04-14T22:46:39Z

Kicking this to the backlog. I don't have time before the release to try to get this change in. It's pretty risky, and doesn't even resolve the issue here.

To that point, @tklauser it's been a long time, but is still something observable to you? Did anything improve?

networkimprov · 2021-04-16T02:46:05Z

@klauspost I believe the previous comment was intended for you :-)

klauspost · 2021-04-16T12:22:54Z

Still seeing pretty much the same performance as 1.14:

λ go version
go version go1.16.2 windows/amd64
...
BenchmarkCompressAllocationsSingle/flate-32         	    8982	    120915 ns/op	  813701 B/op	      16 allocs/op
BenchmarkCompressAllocationsSingle/gzip-32          	   10000	    120592 ns/op	  813876 B/op	      17 allocs/op

cagedmantis changed the title ~~go 1.14: Big allocation speed regression~~ runtime: big allocation speed regression in go1.14 Feb 26, 2020

bcmills added the Performance label Feb 26, 2020

cagedmantis added the NeedsInvestigation label Feb 26, 2020

cagedmantis added this to the Backlog milestone Feb 26, 2020

mknyszek self-assigned this Feb 26, 2020

This comment has been minimized.

Sign in to view

mknyszek changed the title ~~runtime: big allocation speed regression in go1.14~~ runtime: compression benchmark regression in go1.14 due to async preemption Feb 26, 2020

mknyszek changed the title ~~runtime: compression benchmark regression in go1.14 due to async preemption~~ runtime: frequent enlisting of short-lived background workers leads to performance regression with async preemption Feb 26, 2020

mknyszek added NeedsFix help wanted and removed NeedsInvestigation labels Feb 28, 2020

mknyszek modified the milestones: Backlog, Go1.15 Mar 17, 2020

ianlancetaylor modified the milestones: Go1.15, Go1.16 Jun 16, 2020

aclements modified the milestones: Go1.16, Go1.17 Nov 17, 2020

mknyszek modified the milestones: Go1.17, Backlog Apr 14, 2021

rsc unassigned mknyszek Jun 23, 2022

gabyhelp mentioned this issue Mar 5, 2025

runtime: spinbitmutex performance differs between 1.24.0 and 1.24.1, what changed? #72117

Closed

runtime: frequent enlisting of short-lived background workers leads to performance regression with async preemption #37465

runtime: frequent enlisting of short-lived background workers leads to performance regression with async preemption #37465

Comments

klauspost commented Feb 26, 2020 • edited Loading

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

klauspost commented Feb 26, 2020

klauspost commented Feb 26, 2020 • edited Loading

josharian commented Feb 26, 2020

cagedmantis commented Feb 26, 2020

This comment has been minimized.

mknyszek commented Feb 26, 2020

leafrhythms commented Feb 26, 2020 • edited Loading

mknyszek commented Feb 26, 2020

klauspost commented Feb 26, 2020 • edited Loading

klauspost commented Feb 26, 2020 • edited Loading

leitzler commented Feb 26, 2020

mknyszek commented Feb 26, 2020 • edited Loading

mknyszek commented Feb 26, 2020

mknyszek commented Feb 26, 2020

mknyszek commented Feb 26, 2020

mknyszek commented Feb 26, 2020

klauspost commented Feb 26, 2020 • edited Loading

leitzler commented Feb 26, 2020

klauspost commented Feb 26, 2020 • edited Loading

cherrymui commented Feb 26, 2020

leitzler commented Feb 26, 2020

klauspost commented Feb 26, 2020

mknyszek commented Feb 26, 2020

mknyszek commented Feb 28, 2020

mknyszek commented Feb 28, 2020

ianlancetaylor commented Feb 28, 2020 • edited Loading

mknyszek commented Feb 28, 2020

gopherbot commented Mar 17, 2020

networkimprov commented Mar 17, 2020

mknyszek commented Mar 17, 2020

mknyszek commented Mar 25, 2020

klauspost commented Mar 25, 2020

mknyszek commented Apr 21, 2020

klauspost commented Apr 22, 2020

mknyszek commented Apr 22, 2020

ianlancetaylor commented Jun 16, 2020

aclements commented Nov 17, 2020

mknyszek commented Apr 14, 2021

networkimprov commented Apr 16, 2021

klauspost commented Apr 16, 2021

klauspost commented Feb 26, 2020 •

edited

Loading

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

klauspost commented Feb 26, 2020 •

edited

Loading

leafrhythms commented Feb 26, 2020 •

edited

Loading

klauspost commented Feb 26, 2020 •

edited

Loading

klauspost commented Feb 26, 2020 •

edited

Loading

mknyszek commented Feb 26, 2020 •

edited

Loading

klauspost commented Feb 26, 2020 •

edited

Loading

klauspost commented Feb 26, 2020 •

edited

Loading

ianlancetaylor commented Feb 28, 2020 •

edited

Loading