Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: GC 3x slower in tip vs 1.1.2 for large numbers of goroutines #6482

Closed
rhysh opened this issue Sep 25, 2013 · 12 comments
Closed

runtime: GC 3x slower in tip vs 1.1.2 for large numbers of goroutines #6482

rhysh opened this issue Sep 25, 2013 · 12 comments
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented Sep 25, 2013

I'm working on a server process that handles a large number of TCP connections
(100k-750k) and uses three goroutines to service each of them. I've been seeing
embarrassing GC pause times with tip, something like 9x longer pauses than I see with
1.1. The goroutines communicate over channels, and one goroutine per connection uses a
select statement in a for loop.

I've written a benchmark that shows a 3x slowdown between 1.1.2 and today's tip
(9c91baff43a1) - it seems to be related to the number of goroutines there are, not the
specifics of the channel operations they're doing.

Benchmark at: http://play.golang.org/p/eA_zTh8H_Y

Do you have any thoughts?


$ ./go1.1.2/bin/go test -bench=Goroutine
testing: warning: no tests to run
PASS
BenchmarkGoroutineSelect          50     141360272 ns/op
BenchmarkGoroutineBlocking        20     120926236 ns/op
BenchmarkGoroutineForRange        20     118987469 ns/op
BenchmarkGoroutineIdle        20     101294142 ns/op
ok      _/home/users/rhys/gc    27.293s
$ ./go1.2.fa9c4c1f250d/bin/go test -bench=Goroutine
testing: warning: no tests to run
PASS
BenchmarkGoroutineSelect          10     353352975 ns/op
BenchmarkGoroutineBlocking        10     333268543 ns/op
BenchmarkGoroutineForRange        10     355201404 ns/op
BenchmarkGoroutineIdle        10     216753281 ns/op
ok      _/home/users/rhys/gc    20.722s
$ ./go1.2.9c91baff43a1/bin/go test -bench=Goroutine
testing: warning: no tests to run
PASS
BenchmarkGoroutineSelect          10     354450215 ns/op
BenchmarkGoroutineBlocking        10     336186693 ns/op
BenchmarkGoroutineForRange        10     359474965 ns/op
BenchmarkGoroutineIdle        10     216975255 ns/op
ok      _/home/users/rhys/gc    20.848s

$ ./go1.1.2/bin/go version
go version go1.1.2 linux/amd64
$ ./go1.2.fa9c4c1f250d/bin/go version
go version devel +fa9c4c1f250d Mon Sep 23 17:16:59 2013 -0400 linux/amd64
$ ./go1.2.9c91baff43a1/bin/go version
go version devel +9c91baff43a1 Wed Sep 25 18:15:29 2013 +1000 linux/amd64
@dvyukov
Copy link
Member

dvyukov commented Sep 25, 2013

Comment 1:

This is known issue. This is caused by GC precision improvements.
I am not sure what we can do at this point for Go1.2.

@gopherbot
Copy link
Contributor

Comment 2 by hans.stimer:

Simple, don't ship 1.2 until it is fixed, or rip it out until it is ready.
I'm not trying to be flip, but Go's main selling point is the ease of creating large
quantities of concurrent processes with acceptable performance. This is hardly an edge
condition.
This is such a severe problem that it will prevent many serious Go users from upgrading. 
From where I stand, the additional precision is meaningless if it increases GC time 9x.
With 1.1.x the VM grows to about 2x resident, but the resident stays fairly constant,
and the VM doesn't appear to grow without bounds. Not ideal, but a mild problem compared
to a 9x increase in GC time.

@ianlancetaylor
Copy link
Member

Comment 3:

Labels changed: added go1.2maybe.

@gopherbot
Copy link
Contributor

Comment 4 by hans.stimer:

May I also suggest a GO1.2Maybe for the new GC going in. :-)

@davecheney
Copy link
Contributor

Comment 5:

I think your micro benchmark is leading you astray. In the Go1 benchmarks, gc bound
tasks like binary tree have _decreased_ in runtime, which runs counter to your assertion.
Full results:
lucky(~/devel/autobench) % make
/home/dfc/devel/autobench/work/go.11/misc/benchcmp
/home/dfc/devel/autobench/work/go1-11.txt /home/dfc/devel/autobench/work/go1-tip.txt
benchmark                         old ns/op    new ns/op    delta
BenchmarkBinaryTree17            2147483647   2147483647  -16.35%
BenchmarkFannkuch11              2147483647   2147483647   -2.93%
BenchmarkFmtFprintfEmpty                103          100   -2.91%
BenchmarkFmtFprintfString               302          253  -16.23%
BenchmarkFmtFprintfInt                  200          221  +10.50%
BenchmarkFmtFprintfIntInt               335          327   -2.39%
BenchmarkFmtFprintfPrefixedInt          373          291  -21.98%
BenchmarkFmtFprintfFloat                529          455  -13.99%
BenchmarkFmtManyArgs                   1408         1207  -14.28%
BenchmarkGobDecode                 13340638     10182422  -23.67%
BenchmarkGobEncode                 14378134     15727007   +9.38%
BenchmarkGzip                     511413238    466426341   -8.80%
BenchmarkGunzip                   141615409    115645734  -18.34%
BenchmarkHTTPClientServer             58614        49320  -15.86%
BenchmarkJSONEncode                51195827     29764960  -41.86%
BenchmarkJSONDecode               109983875    110148725   +0.15%
BenchmarkMandelbrot200              4878674      4846368   -0.66%
BenchmarkGoParse                    7006652      6030072  -13.94%
BenchmarkRegexpMatchEasy0_32            136          140   +2.94%
BenchmarkRegexpMatchEasy0_1K            350          383   +9.43%
BenchmarkRegexpMatchEasy1_32            107          116   +8.41%
BenchmarkRegexpMatchEasy1_1K            922          934   +1.30%
BenchmarkRegexpMatchMedium_32           197          203   +3.05%
BenchmarkRegexpMatchMedium_1K         74180        73690   -0.66%
BenchmarkRegexpMatchHard_32            3744         3428   -8.44%
BenchmarkRegexpMatchHard_1K          118619       111374   -6.11%
BenchmarkRevcomp                  973314967    761606833  -21.75%
BenchmarkTemplate                 147874541    133148104   -9.96%
BenchmarkTimeParse                      521          448  -14.01%
BenchmarkTimeFormat                     583          450  -22.81%
benchmark                          old MB/s     new MB/s  speedup
BenchmarkGobDecode                    57.53        75.38    1.31x
BenchmarkGobEncode                    53.38        48.80    0.91x
BenchmarkGzip                         37.94        41.60    1.10x
BenchmarkGunzip                      137.02       167.79    1.22x
BenchmarkJSONEncode                   37.90        65.19    1.72x
BenchmarkJSONDecode                   17.64        17.62    1.00x
BenchmarkGoParse                       8.27         9.61    1.16x
BenchmarkRegexpMatchEasy0_32         235.21       227.39    0.97x
BenchmarkRegexpMatchEasy0_1K        2924.41      2671.68    0.91x
BenchmarkRegexpMatchEasy1_32         297.64       275.47    0.93x
BenchmarkRegexpMatchEasy1_1K        1110.12      1095.48    0.99x
BenchmarkRegexpMatchMedium_32          5.06         4.92    0.97x
BenchmarkRegexpMatchMedium_1K         13.80        13.90    1.01x
BenchmarkRegexpMatchHard_32            8.55         9.33    1.09x
BenchmarkRegexpMatchHard_1K            8.63         9.19    1.06x
BenchmarkRevcomp                     261.14       333.72    1.28x
BenchmarkTemplate                     13.12        14.57    1.11x
/home/dfc/devel/autobench/work/go.11/misc/benchcmp
/home/dfc/devel/autobench/work/runtime-11.txt
/home/dfc/devel/autobench/work/runtime-tip.txt
benchmark                                old ns/op    new ns/op    delta
BenchmarkAppend                                104          104   +0.00%
BenchmarkAppend1Byte                           213          142  -33.33%
BenchmarkAppend4Bytes                          299          132  -55.85%
BenchmarkAppend8Bytes                          382          134  -64.92%
BenchmarkAppend16Bytes                         529          232  -56.14%
BenchmarkAppend32Bytes                         531          139  -73.82%
BenchmarkAppendSpecialCase                      45           39  -13.35%
BenchmarkSelectUncontended                     265          272   +2.64%
BenchmarkSelectContended                       266          266   +0.00%
BenchmarkSelectNonblock                        112          112   +0.00%
BenchmarkChanUncontended                        69           67   -1.88%
BenchmarkChanContended                          70           68   -2.42%
BenchmarkChanSync                              152          153   +0.66%
BenchmarkChanProdCons0                         158          155   -1.90%
BenchmarkChanProdCons10                         93           95   +1.82%
BenchmarkChanProdCons100                        71           72   +1.12%
BenchmarkChanProdConsWork0                     777          782   +0.64%
BenchmarkChanProdConsWork10                    720          722   +0.28%
BenchmarkChanProdConsWork100                   699          698   -0.14%
BenchmarkChanCreation                          194          194   +0.00%
BenchmarkChanSem                                66           65   -0.30%
BenchmarkCallClosure                             2            2   +0.00%
BenchmarkCallClosure1                            3            3   +5.83%
BenchmarkCallClosure2                           55           45  -17.09%
BenchmarkCallClosure3                           56           45  -18.51%
BenchmarkCallClosure4                           55           46  -16.99%
BenchmarkComplex128DivNormal                    24           24   +0.00%
BenchmarkComplex128DivNisNaN                    15           15   +5.33%
BenchmarkComplex128DivDisNaN                    15           15   +1.31%
BenchmarkComplex128DivNisInf                    10           12  +14.68%
BenchmarkComplex128DivDisInf                    11           11   +3.60%
BenchmarkConvT2ESmall                           10           10   +3.92%
BenchmarkConvT2EUintptr                          0            0   +4.69%
BenchmarkConvT2ELarge                           63           61   -3.31%
BenchmarkConvT2ISmall                           12           12   -0.79%
BenchmarkConvT2IUintptr                          1            1   +0.90%
BenchmarkConvT2ILarge                           66           61   -8.41%
BenchmarkConvI2E                                 4            4   +7.64%
BenchmarkConvI2I                                16           16   +2.42%
BenchmarkAssertE2T                              11           11   -4.20%
BenchmarkAssertE2TLarge                         12           12   +0.00%
BenchmarkAssertE2I                              18           18   +0.00%
BenchmarkAssertI2T                              12           11   -4.17%
BenchmarkAssertI2I                              25           18  -28.63%
BenchmarkAssertI2E                               4            4   -1.11%
BenchmarkAssertE2E                               0            0   +0.00%
BenchmarkMalloc8                                52           42  -19.70%
BenchmarkMalloc16                               56           47  -16.13%
BenchmarkMallocTypeInfo8                        75           60  -19.41%
BenchmarkMallocTypeInfo16                       79           65  -16.84%
BenchmarkHashStringSpeed                        27           28   +2.54%
BenchmarkHashInt32Speed                         17           20  +18.24%
BenchmarkHashInt64Speed                         17           18   +3.45%
BenchmarkHashStringArraySpeed                   85           85   -0.35%
BenchmarkMegMap                                 18           23  +25.95%
BenchmarkMegOneMap                              25           14  -43.97%
BenchmarkMegEqMap                            80808        77276   -4.37%
BenchmarkMegEmptyMap                             3            3   +7.91%
BenchmarkSmallStrMap                            20           22  +10.40%
BenchmarkMapStringKeysEight_16                  21           25  +21.43%
BenchmarkMapStringKeysEight_32                  18           22  +19.57%
BenchmarkMapStringKeysEight_64                  18           22  +19.57%
BenchmarkMapStringKeysEight_1M                  18           22  +19.57%
BenchmarkIntMap                                 13           19  +46.67%
BenchmarkRepeatedLookupStrMapKey32              40           40   -0.74%
BenchmarkRepeatedLookupStrMapKey1M          340521       336266   -1.25%
BenchmarkNewEmptyMap                           135          137   +1.48%
BenchmarkMemmove32                              22            5  -76.68%
BenchmarkMemmove4K                             185          177   -4.32%
BenchmarkMemmove64K                           3593         3547   -1.28%
BenchmarkMemmove4M                          830265       823422   -0.82%
BenchmarkMemmove64M                       13271665     13210000   -0.46%
BenchmarkFinalizer                             171          179   +4.68%
BenchmarkFinalizerRun                          717          698   -2.65%
BenchmarkStackGrowth                           738          878  +18.97%
BenchmarkStackGrowthDeep                    105341       105403   +0.06%
BenchmarkCreateGoroutines                      115          118   +2.61%
BenchmarkCreateGoroutinesParallel              114          118   +3.51%
BenchmarkMatmult                                10           11   +2.78%
BenchmarkIfaceCmp100                           236          227   -3.81%
BenchmarkIfaceCmpNil100                        174          172   -1.15%
BenchmarkDefer                                  97           71  -26.57%
BenchmarkDefer10                                90           65  -27.56%
BenchmarkDeferMany                             130          103  -20.77%
BenchmarkCompareStringEqual                     10           11   +3.70%
BenchmarkCompareStringIdentical                  4            4  +10.99%
BenchmarkCompareStringSameLength                 9            9   +3.58%
BenchmarkCompareStringDifferentLength            1            2   +3.63%
BenchmarkCompareStringBigUnaligned           86791        81568   -6.02%
BenchmarkCompareStringBig                    77180        77602   +0.55%
benchmark                                 old MB/s     new MB/s  speedup
BenchmarkMemmove32                         1396.98      5987.51    4.29x
BenchmarkMemmove4K                        22137.54     23092.03    1.04x
BenchmarkMemmove64K                       18237.54     18475.72    1.01x
BenchmarkMemmove4M                         5051.76      5093.75    1.01x
BenchmarkMemmove64M                        5056.55      5080.16    1.00x
BenchmarkCompareStringBigUnaligned        12081.62     12855.27    1.06x
BenchmarkCompareStringBig                 13586.08     13512.33    0.99x
/home/dfc/devel/autobench/work/go.11/misc/benchcmp
/home/dfc/devel/autobench/work/http-11.txt /home/dfc/devel/autobench/work/http-tip.txt
benchmark                                   old ns/op    new ns/op    delta
BenchmarkHeaderWriteSubset                       1455         1119  -23.09%
BenchmarkReadRequestChrome                       7697         6959   -9.59%
BenchmarkReadRequestCurl                         4065         3573  -12.10%
BenchmarkReadRequestApachebench                  4017         3555  -11.50%
BenchmarkReadRequestSiege                        5321         4675  -12.14%
BenchmarkReadRequestWrk                          2698         2651   -1.74%
BenchmarkClientServer                           58534        57838   -1.19%
BenchmarkClientServerParallel4                  54188        55070   +1.63%
BenchmarkClientServerParallel64                 56747        53766   -5.25%
BenchmarkServer                                335560       299159  -10.85%
BenchmarkServerFakeConnNoKeepAlive              16659        12884  -22.66%
BenchmarkServerFakeConnWithKeepAlive            13830        10271  -25.73%
BenchmarkServerFakeConnWithKeepAliveLite         9030         6427  -28.83%
BenchmarkServerHandlerTypeLen                   11513         8706  -24.38%
BenchmarkServerHandlerNoLen                     10593         7930  -25.14%
BenchmarkServerHandlerNoType                    11033         8248  -25.24%
BenchmarkServerHandlerNoHeader                   8611         5716  -33.62%
benchmark                                    old MB/s     new MB/s  speedup
BenchmarkReadRequestChrome                      79.37        87.79    1.11x
BenchmarkReadRequestCurl                        19.19        21.83    1.14x
BenchmarkReadRequestApachebench                 20.41        23.06    1.13x
BenchmarkReadRequestSiege                       28.38        32.29    1.14x
BenchmarkReadRequestWrk                         14.83        15.08    1.02x
/home/dfc/devel/autobench/work/go.11/misc/benchcmp
/home/dfc/devel/autobench/work/floats-11.txt
/home/dfc/devel/autobench/work/floats-tip.txt
benchmark                  old ns/op    new ns/op    delta
BenchmarkMinSmall                 12           13   +0.78%
BenchmarkMinMed                  693          692   -0.14%
BenchmarkMinLarge              67208        66055   -1.72%
BenchmarkMinHuge             8936179      9025296   +1.00%
BenchmarkAddTwoSmall              21           25  +17.59%
BenchmarkAddFourSmall             43           48  +11.81%
BenchmarkAddTwoMed               982         1018   +3.67%
BenchmarkAddFourMed             2903         2929   +0.90%
BenchmarkAddTwoLarge          103809       101980   -1.76%
BenchmarkAddFourLarge         343695       342168   -0.44%
BenchmarkAddTwoHuge         16345354     16320656   -0.15%
BenchmarkAddFourHuge        49105734     49138748   +0.07%
BenchmarkLogSumExpSmall          265          264   -0.38%
BenchmarkLogSumExpMed          20086        20072   -0.07%
BenchmarkLogSumExpLarge      2015764      2012523   -0.16%
BenchmarkLogSumExpHuge     206482185    206366944   -0.06%
BenchmarkDotSmall                 15           16   +5.23%
BenchmarkDotMed                 1024         1032   +0.78%
BenchmarkDotLarge             106982       109836   +2.67%
BenchmarkDotHuge            13493495     13540756   +0.35%

@minux
Copy link
Member

minux commented Sep 26, 2013

Comment 6:

just curious, is the overhead caused by precise scan of goroutine stacks?

@rsc
Copy link
Contributor

rsc commented Sep 26, 2013

Comment 7:

FWIW I have already been examining the performance implications of the
various changes and whether some should be disabled. Thanks for the
additional test case.

@lexprfuncall
Copy link

Comment 8:

You can disable scanning stacks by frames by changing the its namesake variable at the
top of mgc0.c.

@rsc
Copy link
Contributor

rsc commented Oct 1, 2013

Comment 9:

Labels changed: added go1.2, removed go1.2maybe.

@rsc
Copy link
Contributor

rsc commented Oct 2, 2013

Comment 10:

This issue was closed by revision c3dadca.

Status changed to Fixed.

@gopherbot
Copy link
Contributor

Comment 11 by hans.stimer:

Thank you.

@rhysh
Copy link
Contributor Author

rhysh commented Oct 3, 2013

Comment 12:

Hi Russ,
Thanks for checking this out. I ran my app with the pre- and post-fix versions and it
looks like the GC time is back under control (and slightly better than 1.1.2's
performance, as Dave predicted).
With 7266a3768bfa (post-fix) and around 280k goroutines I see GC times like:
gc59(8): 564+287+183 ms, 4586 -> 2621 MB
roughly 3.7µs per goroutine (ignoring the effect that heap size has on GC time :) )
With 4971493b0e14 (pre-fix) and around 300k goroutines I see GC times like:
gc78(8): 5391+282+1142 ms, 4527 -> 2511 MB
roughly 22.7µs per goroutine
And with go1.1.2 and around 540k goroutines I see GC times like:
gc167(8): 1566+381+335 ms, 11746 -> 9231 MB
roughly 4.2µs per goroutine
This looks resolved to me, thank you!

@rsc rsc added this to the Go1.2 milestone Apr 14, 2015
@rsc rsc removed the go1.2 label Apr 14, 2015
@golang golang locked and limited conversation to collaborators Jun 25, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants