runtime: mutator assists are over-aggressive, especially at high GOGC #14951

aclements · 2016-03-24T21:32:18Z

@RLH and I have been discussing this problem for the past week. This issue is to track our thoughts.

The current GC pacer is conservative about scheduling mutator assists and this leads to a cycle where we over-assist (use too much CPU for GC) and under-shoot the heap goal.

Specifically, GC currently assumes it may have to scan the entire live heap before the live heap size reaches the heap goal. It sets the assist ratio accordingly. If the reachable heap is growing, this conservative assumption is necessary to prevent the mutator from outpacing the garbage collector and spiraling the heap size up. However, in steady state (where the reachable heap size is roughly the same as it was the last GC cycle), this means we'll finish garbage collection when we're less than 100/(GOGC+100) of the way from the trigger to the goal (1/2 way for GOGC=100; 1/4 way for GOGC=300; etc).

You might think the trigger controller would adapt by setting the GC trigger lower in response to the over-utilization, but in fact, this drives the trigger controller into a state of persistent CPU over-utilization and heap under-shoot. Currently, the trigger controller will leave the trigger where it is if we're anywhere on the white edge between CPU utilization and heap over-shoot shown below:

The under-shoot caused by the over-assist forces the trigger controller to the left of the ideal point on this curve. Since this persists, we eventually settle well to the left of the ideal, repeatedly setting the trigger too close to the goal, which leads to the next cycle setting a high assist ratio and again finishing early.

This is particularly visible in applications with a high allocation rate and/or high GOGC.

aclements · 2016-03-24T21:44:42Z

Some approaches we've taken or thought about:

Multiply the assist ratio by an "aggressiveness" that starts out 0 at the trigger and grows linearly to 1 at the goal. The idea is that this would damp the assists early in the cycle, let the background mark workers get ahead, and hopefully finish closer to the goal. This didn't work well because allocation proceeded so quickly early in the GC cycle without the assists to slow it down that the assist ratio quickly ramped up anyway.
A variation of this was to set the assist ratio to 0 until live heap reached half-way between the trigger and the goal. This had the same problem because allocation was completely unimpeded.
Let GC go in to "overtime" if the heap is growing. When live heap is between the trigger and the goal, compute the assist ratio assuming we only have to scan the same amount of heap as the last GC cycle marked. If we pass the goal, set an "overtime" goal to something like 2*goal and set the assist ratio assuming we have to scan the whole heap by the time we reach the overtime goal.
Set (or cap) the assist ratio to 2 (scan 2 bytes for every 1 byte allocated). The idea is to eliminate the within-cycle feedback loop, allow GC to pass the goal, and use this feedback to adjust the trigger between cycles. In principle, the cross-cycle feedback loop should adjust in steady state so that the GC stops passing the goal after a few cycles. By setting the assist ratio to > 1, we still bound the total size the heap can grow to during a cycle. Setting it to 2 specifically bounds it to 2*trigger.

randall77 · 2016-03-24T22:21:49Z

I'm not sure I understand the desire to hit the heap goal if the live set is growing. If the live set is growing, we're going to set a higher heap goal for the next cycle. Why not go over the heap goal this cycle in that case?

At the last GC, assume we had a live set of size L so we set the heap goal to be 2L (for GOGC=100). Just set the trigger & assist ratio so that we scan L bytes by the time total allocation is up to 2L. Then if we reach a total allocation of 2L and scanning still isn't done, then we know the live set is larger than L and we can grow the heap (i.e. miss the heap goal) with a clean conscience.

aclements · 2016-03-24T22:48:35Z

@randall77, this is basically the idea behind proposal 3. However, it's important to have the more aggressive fallback if the heap is actually growing. In fact, in the original pacer design, I set the assist ratio to target scanning the expected reachable heap (the marked heap from last cycle) by the heap goal and this let aggressive mutators spiral the heap size out of control. I don't remember the exact details of this failure mode, but it may have happened if the system got into a state where the assist ratio was less than 1 (or only slightly larger than 1).

gopherbot · 2016-03-30T22:01:07Z

CL https://golang.org/cl/21324 mentions this issue.

RLH · 2016-03-30T22:38:08Z

These number show various approaches to dealing with increasing CPU availability
for the mutators. While the numbers have not led to an obvious solution they are
valuable in understanding the issue. As a baseline I ran the garbage benchmark on
1.6. Whatever solution we come up with should have comparable throughput to 1.6.

The benchMasterMar28 runs show the performance of tip on March 28 which is
approximately 9% faster than 1.6. The dev:garbage branch is an additional 3% on top
of the 9%. Bottom line is that we have 3% wiggle room in trading GC throughput for
increased concurrency.

If we change benchMasterMarch28 to allocate black during the GC cycle the result is
not significantly different than allocating white due to the fact that the assists are so
high that the amount of actual availability and allocation is so low.

Here is the timing grid that considers throughput.

1.6 baseline is 2.70. Master is 2.45. The dev:garbage branch is 2.352, 3% better
than Master. The large improvement seen between allocating white and allocating
black for the scan assist schemes is significant and we are investigating it.

The fixed ratio schemes are scan one word for each word allocated or scan two
words for each word allocated.

              White             Black
Master         2.45            2.41
Scan1for1      3.78            2.92
Scan2For1      3.07            2.61

Here is the grid from a gctrace sampling somewhere near 3.5 seconds. The traces nearby
all exhibited similar data.

Master White
gc 63 @3.520s 25%: 0.062+20+0.42 ms clock, 0.12+143/62/0+0.84 ms cpu, 122->125->70 MB, 127 MB goal, 12 P

Master Black
gc 66 @3.600s 24%: 0.20+20+0.63 ms clock, 0.41+132/60/0+1.2 ms cpu, 132->135->68 MB, 139 MB goal, 12 P

Scan1for1 Black
gc 53 @3.349s 21%: 0.032+40+0.29 ms clock, 0.096+64/120/0+0.87 ms cpu, 129->150->84 MB, 132 MB goal, 12 P

Scan2for1 Black
gc 62 @3.584s 22%: 0.015+30+0.44 ms clock, 0.18+91/89/0+5.3 ms cpu, 125->138->75 MB, 128 MB goal, 12 P

Scan1for1 White
gc 45 @3.573s 22%: 0.10+75+0.44 ms clock, 0.31+23/222/0+1.3 ms cpu, 117->154->95 MB, 118 MB goal, 12 P

Scan2for1 White
gc 57 @3.487s 24%: 0.17+33+0.66 ms clock, 0.34+99/100/14+1.3 ms cpu, 128->142->79 MB, 132 MB goal, 12 P

Time spent doing assists dropped when we went to either the 2 for 1 or the 1 for 1 fixed
ratio approach. At the same time the length of the GC significantly increased over baseline
and not surprisingly the 2 for 1 was shorter than the 1 for 1 due to increased assists.

Allocating black shortened the GC since the amount of scan work decreases.

As we make more of the CPUs available to the mutators during a GC the more advantage there
is in allocating black probably because there is less to scan and less time with the write
barrier on. The literature argues for allocating white to decrease floating garbage. These
numbers do not support that conclusion.

We missed the target heap size in all of the fixed ratio tests, sometimes by quite a bit.
The 2 for 1 approaches used 25% of the CPU for the GC proper and 25% for the assists so
even with 50% of the CPUs dedicated to the GC we were unable to meet our heap size goals.

CL 21287 contains the location of where
to make changes if one wishes to duplicate these numbers.

Here is the raw data from the various runs.

cd benchmarks/bench/bin
bench1.6 bench=garbage
GOPERF-METRIC:allocated=2801684
GOPERF-METRIC:allocs=68905
GOPERF-METRIC:cputime=31848608
GOPERF-METRIC:gc-pause-one=704841
GOPERF-METRIC:gc-pause-total=31153
GOPERF-METRIC:rss=193101824
GOPERF-METRIC:sys-gc=5554176
GOPERF-METRIC:sys-heap=170098688
GOPERF-METRIC:sys-other=15517896
GOPERF-METRIC:sys-stack=884736
GOPERF-METRIC:sys-total=192055496
GOPERF-METRIC:time=2699980
GOPERF-METRIC:virtual-mem=763604992
GOPERF-FILE:cpuprof=/tmp/46.prof.txt
GOPERF-FILE:memprof=/tmp/47.prof.txt
gc 47 @3.017s 25%: 0.079+24+0.62 ms clock, 0.15+166/72/0.068+1.2 ms cpu, 128->130->67 MB, 132 MB goal, 12 P
gc 48 @3.087s 25%: 0.10+26+0.55 ms clock, 0.20+179/77/0.092+1.1 ms cpu, 127->130->69 MB, 133 MB goal, 12 P
gc 49 @3.161s 25%: 0.086+27+0.70 ms clock, 0.17+184/76/0+1.4 ms cpu, 130->133->65 MB, 136 MB goal, 12 P
gc 50 @3.237s 25%: 0.12+25+0.72 ms clock, 0.25+169/71/0+1.4 ms cpu, 126->128->69 MB, 129 MB goal, 12 P
gc 51 @3.312s 25%: 0.21+24+0.52 ms clock, 0.42+166/69/2.8+1.0 ms cpu, 134->135->67 MB, 137 MB goal, 12 P
gc 52 @3.384s 25%: 0.10+22+0.59 ms clock, 0.20+164/65/0.021+1.1 ms cpu, 131->133->63 MB, 135 MB goal, 12 P
gc 53 @3.454s 25%: 0.10+23+0.55 ms clock, 0.20+169/67/0.067+1.1 ms cpu, 123->125->66 MB, 127 MB goal, 12 P
gc 54 @3.522s 25%: 0.085+23+0.49 ms clock, 0.17+148/69/0.047+0.99 ms cpu, 127->129->65 MB, 132 MB goal, 12 P
gc 55 @3.591s 25%: 0.085+25+0.63 ms clock, 0.17+153/74/0.022+1.2 ms cpu, 123->126->63 MB, 128 MB goal, 12 P
gc 56 @3.663s 25%: 0.10+21+0.55 ms clock, 0.20+162/60/0.047+1.1 ms cpu, 121->123->62 MB, 124 MB goal, 12 P
gc 57 @3.730s 25%: 0.12+23+0.58 ms clock, 0.25+182/67/0.15+1.1 ms cpu, 122->123->64 MB, 125 MB goal, 12 P
gc 58 @3.799s 25%: 0.14+24+0.66 ms clock, 0.28+177/70/0.004+1.3 ms cpu, 125->127->66 MB, 129 MB goal, 12 P
gc 59 @3.866s 25%: 0.083+25+0.78 ms clock, 0.16+175/74/0+1.5 ms cpu, 125->128->66 MB, 132 MB goal, 12 P
gc 60 @3.937s 25%: 0.65+31+0.50 ms clock, 1.3+171/93/0+1.0 ms cpu, 121->124->68 MB, 128 MB goal, 12 P

./benchMasterMar28 -bench=garbage
GOPERF-METRIC:allocated=2780204
GOPERF-METRIC:allocs=68624
GOPERF-METRIC:cputime=28810325
GOPERF-METRIC:gc-pause-one=575984
GOPERF-METRIC:gc-pause-total=25343
GOPERF-METRIC:rss=196517888
GOPERF-METRIC:sys-gc=5629952
GOPERF-METRIC:sys-heap=173768704
GOPERF-METRIC:sys-other=11366928
GOPERF-METRIC:sys-stack=884736
GOPERF-METRIC:sys-total=191650320
GOPERF-METRIC:time=2452319
GOPERF-METRIC:virtual-mem=775798784
GOPERF-FILE:cpuprof=/tmp/46.prof.txt
GOPERF-FILE:memprof=/tmp/47.prof.txt
gc 54 @3.010s 25%: 0.025+18+0.46 ms clock, 0.10+135/53/0.072+1.8 ms cpu, 122->124->65 MB, 125 MB goal, 12 P
gc 55 @3.068s 25%: 0.031+19+0.39 ms clock, 0.063+142/56/0.066+0.79 ms cpu, 127->129->67 MB, 130 MB goal, 12 P
gc 56 @3.127s 25%: 0.036+18+0.44 ms clock, 0.073+134/56/0+0.88 ms cpu, 130->132->65 MB, 135 MB goal, 12 P
gc 57 @3.182s 25%: 0.026+18+0.49 ms clock, 0.15+125/53/0+2.9 ms cpu, 124->127->62 MB, 129 MB goal, 12 P
gc 58 @3.236s 25%: 0.037+18+0.44 ms clock, 0.34+107/54/19+4.0 ms cpu, 118->121->63 MB, 123 MB goal, 12 P
gc 59 @3.291s 25%: 0.017+18+0.48 ms clock, 0.21+129/54/4.1+5.8 ms cpu, 122->123->65 MB, 125 MB goal, 12 P
gc 60 @3.351s 25%: 0.042+18+0.58 ms clock, 0.085+141/54/0.006+1.1 ms cpu, 128->129->66 MB, 131 MB goal, 12 P
gc 61 @3.407s 25%: 0.045+18+0.48 ms clock, 0.090+143/54/0+0.97 ms cpu, 128->129->66 MB, 133 MB goal, 12 P
gc 62 @3.465s 25%: 0.35+17+0.39 ms clock, 0.70+130/53/0+0.79 ms cpu, 128->130->64 MB, 133 MB goal, 12 P
gc 63 @3.520s 25%: 0.062+20+0.42 ms clock, 0.12+143/62/0+0.84 ms cpu, 122->125->70 MB, 127 MB goal, 12 P
gc 64 @3.578s 25%: 0.038+19+0.40 ms clock, 0.077+129/57/0+0.80 ms cpu, 130->133->64 MB, 137 MB goal, 12 P
gc 65 @3.634s 25%: 0.032+18+0.23 ms clock, 0.19+122/54/11+1.4 ms cpu, 123->125->64 MB, 126 MB goal, 12 P
gc 66 @3.691s 26%: 0.020+18+0.29 ms clock, 0.12+145/55/2.2+1.7 ms cpu, 126->128->69 MB, 129 MB goal, 12 P
gc 67 @3.750s 26%: 0.035+18+0.43 ms clock, 0.071+130/54/0+0.87 ms cpu, 133->135->64 MB, 138 MB goal, 12 P
gc 68 @3.805s 26%: 0.027+18+0.89 ms clock, 0.13+134/54/0+4.4 ms cpu, 121->123->65 MB, 127 MB goal, 12 P
gc 69 @3.861s 26%: 0.19+19+0.63 ms clock, 0.96+135/55/0.063+3.1 ms cpu, 124->127->67 MB, 129 MB goal, 12 P
gc 70 @3.920s 26%: 0.027+19+0.60 ms clock, 0.10+139/57/0+2.4 ms cpu, 127->130->67 MB, 131 MB goal, 12 P
gc 71 @3.979s 26%: 0.077+19+0.45 ms clock, 0.69+138/56/0.073+4.0 ms cpu, 129->131->68 MB, 132 MB goal, 12 P

./benchAllocatBlackOrig -bench=garbage
GOPERF-METRIC:allocated=2780656
GOPERF-METRIC:allocs=68649
GOPERF-METRIC:cputime=28378118
GOPERF-METRIC:gc-pause-one=606225
GOPERF-METRIC:gc-pause-total=26552
GOPERF-METRIC:rss=190029824
GOPERF-METRIC:sys-gc=5603328
GOPERF-METRIC:sys-heap=173146112
GOPERF-METRIC:sys-other=11368768
GOPERF-METRIC:sys-stack=786432
GOPERF-METRIC:sys-total=190904640
GOPERF-METRIC:time=2414792
GOPERF-METRIC:virtual-mem=814186496
GOPERF-FILE:cpuprof=/tmp/46.prof.txt
GOPERF-FILE:memprof=/tmp/47.prof.txt

gc 56 @3.044s 24%: 0.027+18+0.25 ms clock, 0.081+128/53/0.46+0.75 ms cpu, 121->123->65 MB, 125 MB goal, 12 P
gc 57 @3.099s 24%: 0.026+18+0.41 ms clock, 0.20+129/53/0.063+3.3 ms cpu, 125->127->64 MB, 129 MB goal, 12 P
gc 58 @3.155s 24%: 0.024+16+0.27 ms clock, 0.19+123/49/13+2.2 ms cpu, 125->126->65 MB, 128 MB goal, 12 P
gc 59 @3.210s 24%: 0.021+18+0.25 ms clock, 0.14+133/54/0+1.8 ms cpu, 127->129->66 MB, 130 MB goal, 12 P
gc 60 @3.266s 24%: 0.025+18+0.45 ms clock, 0.050+135/54/0+0.90 ms cpu, 127->129->66 MB, 132 MB goal, 12 P
gc 61 @3.323s 24%: 0.022+17+0.41 ms clock, 0.22+128/52/0+4.1 ms cpu, 127->129->64 MB, 132 MB goal, 12 P
gc 62 @3.377s 24%: 0.027+16+0.81 ms clock, 0.32+123/49/0+9.7 ms cpu, 124->126->62 MB, 128 MB goal, 12 P
gc 63 @3.431s 24%: 0.023+17+0.57 ms clock, 0.16+127/52/0.023+3.9 ms cpu, 120->121->63 MB, 123 MB goal, 12 P
gc 64 @3.486s 24%: 0.028+18+0.59 ms clock, 0.11+134/55/0.051+2.3 ms cpu, 122->124->66 MB, 127 MB goal, 12 P
gc 65 @3.541s 24%: 0.033+21+0.36 ms clock, 0.23+139/63/0+2.5 ms cpu, 124->128->71 MB, 131 MB goal, 12 P
gc 66 @3.600s 24%: 0.20+20+0.63 ms clock, 0.41+132/60/0+1.2 ms cpu, 132->135->68 MB, 139 MB goal, 12 P
gc 67 @3.659s 24%: 0.036+17+0.44 ms clock, 0.073+128/52/0+0.88 ms cpu, 129->131->63 MB, 133 MB goal, 12 P
gc 68 @3.714s 24%: 0.030+19+0.48 ms clock, 0.061+135/55/1.0+0.96 ms cpu, 124->125->67 MB, 127 MB goal, 12 P
gc 69 @3.774s 24%: 0.029+18+0.53 ms clock, 0.17+137/54/0+3.1 ms cpu, 130->132->66 MB, 133 MB goal, 12 P
gc 70 @3.831s 24%: 0.025+17+0.42 ms clock, 0.10+135/52/0.024+1.7 ms cpu, 129->131->66 MB, 133 MB goal, 12 P
gc 71 @3.887s 24%: 0.026+19+0.52 ms clock, 0.10+122/56/0+2.0 ms cpu, 126->129->64 MB, 131 MB goal, 12 P
gc 72 @3.943s 24%: 0.017+19+0.28 ms clock, 0.17+115/56/16+2.8 ms cpu, 122->125->67 MB, 127 MB goal, 12 P

./benchDev -bench=garbage # March 28 best numbers using the dev branch with Ctz optimizations.
GOPERF-METRIC:allocated=3074068
GOPERF-METRIC:allocs=75288
GOPERF-METRIC:cputime=27570650
GOPERF-METRIC:gc-pause-one=626263
GOPERF-METRIC:gc-pause-total=27555
GOPERF-METRIC:rss=191041536
GOPERF-METRIC:sys-gc=5519360
GOPERF-METRIC:sys-heap=171081728
GOPERF-METRIC:sys-other=12169368
GOPERF-METRIC:sys-stack=819200
GOPERF-METRIC:sys-total=189589656
GOPERF-METRIC:time=2351587
GOPERF-METRIC:virtual-mem=805724160
GOPERF-FILE:cpuprof=/tmp/46.prof.txt
GOPERF-FILE:memprof=/tmp/47.prof.txt

gc 53 @3.056s 23%: 0.031+17+0.38 ms clock, 0.15+132/52/0+1.9 ms cpu, 125->126->64 MB, 128 MB goal, 12 P
gc 54 @3.110s 23%: 0.20+17+0.39 ms clock, 0.41+120/50/4.2+0.78 ms cpu, 125->126->62 MB, 128 MB goal, 12 P
gc 55 @3.163s 23%: 0.024+20+0.45 ms clock, 0.19+142/59/0.050+3.6 ms cpu, 119->122->70 MB, 123 MB goal, 12 P
gc 56 @3.221s 23%: 0.23+21+0.22 ms clock, 0.46+140/53/0+0.45 ms cpu, 131->134->69 MB, 138 MB goal, 12 P
gc 57 @3.279s 23%: 0.71+19+0.54 ms clock, 2.8+141/57/0+2.1 ms cpu, 130->132->67 MB, 135 MB goal, 12 P
gc 58 @3.337s 23%: 0.17+21+0.55 ms clock, 1.4+143/62/0+4.4 ms cpu, 130->132->68 MB, 133 MB goal, 12 P
gc 59 @3.397s 23%: 0.045+17+0.35 ms clock, 0.090+133/52/0+0.70 ms cpu, 133->135->63 MB, 137 MB goal, 12 P
gc 60 @3.452s 23%: 0.033+17+0.58 ms clock, 0.16+129/50/2.6+2.9 ms cpu, 124->125->63 MB, 127 MB goal, 12 P
gc 61 @3.505s 23%: 0.46+18+0.50 ms clock, 1.4+130/55/0.075+1.5 ms cpu, 122->124->63 MB, 126 MB goal, 12 P
gc 62 @3.560s 23%: 0.15+20+0.61 ms clock, 0.92+120/59/0+3.6 ms cpu, 121->124->66 MB, 126 MB goal, 12 P
gc 63 @3.617s 23%: 0.063+20+0.48 ms clock, 0.18+134/58/0.050+1.4 ms cpu, 124->127->65 MB, 129 MB goal, 12 P
gc 64 @3.673s 24%: 0.019+20+0.44 ms clock, 0.21+142/58/0+4.9 ms cpu, 123->125->69 MB, 127 MB goal, 12 P
gc 65 @3.732s 24%: 0.16+20+0.49 ms clock, 0.33+141/58/0.001+0.99 ms cpu, 133->135->67 MB, 137 MB goal, 12 P
gc 66 @3.798s 24%: 0.30+17+0.82 ms clock, 0.60+98/50/8.2+1.6 ms cpu, 132->134->58 MB, 135 MB goal, 12 P
gc 67 @3.851s 24%: 0.029+18+0.54 ms clock, 0.20+115/55/8.2+3.7 ms cpu, 114->116->64 MB, 117 MB goal, 12 P
gc 68 @3.905s 24%: 0.19+19+0.60 ms clock, 1.1+119/55/0.95+3.6 ms cpu, 122->125->65 MB, 128 MB goal, 12 P
gc 69 @3.960s 24%: 0.11+20+0.64 ms clock, 0.83+126/61/0+4.4 ms cpu, 122->125->65 MB, 128 MB goal, 12 P

./benchAllocatBlackScan1for1 -bench=garbage
GOPERF-METRIC:allocated=2780205
GOPERF-METRIC:allocs=68623
GOPERF-METRIC:cputime=34221560
GOPERF-METRIC:gc-pause-one=735410
GOPERF-METRIC:gc-pause-total=32505
GOPERF-METRIC:rss=220323840
GOPERF-METRIC:sys-gc=10213376
GOPERF-METRIC:sys-heap=193134592
GOPERF-METRIC:sys-other=11352352
GOPERF-METRIC:sys-stack=786432
GOPERF-METRIC:sys-total=215486752
GOPERF-METRIC:time=2920883
GOPERF-METRIC:virtual-mem=943915008
GOPERF-FILE:cpuprof=/tmp/46.prof.txt
GOPERF-FILE:memprof=/tmp/47.prof.txt
gc 48 @3.016s 21%: 0.56+34+0.29 ms clock, 2.2+57/101/0+1.1 ms cpu, 128->144->73 MB, 131 MB goal, 12 P
gc 49 @3.077s 21%: 0.67+42+0.23 ms clock, 5.3+51/126/0+1.8 ms cpu, 112->135->85 MB, 115 MB goal, 12 P
gc 50 @3.145s 21%: 0.023+41+0.28 ms clock, 0.23+35/122/13+2.8 ms cpu, 125->148->83 MB, 128 MB goal, 12 P
gc 51 @3.212s 21%: 0.12+43+0.38 ms clock, 1.0+21/128/38+3.0 ms cpu, 121->146->87 MB, 124 MB goal, 12 P
gc 52 @3.281s 21%: 0.017+40+0.25 ms clock, 0.21+62/106/25+3.1 ms cpu, 125->145->84 MB, 128 MB goal, 12 P
gc 53 @3.349s 21%: 0.032+40+0.29 ms clock, 0.096+64/120/0+0.87 ms cpu, 129->150->84 MB, 132 MB goal, 12 P
gc 54 @3.417s 21%: 0.43+41+0.29 ms clock, 5.1+47/124/0+3.5 ms cpu, 124->147->83 MB, 128 MB goal, 12 P
gc 55 @3.483s 21%: 0.16+34+0.28 ms clock, 1.6+25/95/61+2.8 ms cpu, 120->135->77 MB, 123 MB goal, 12 P
gc 56 @3.548s 21%: 0.021+43+0.27 ms clock, 0.085+58/124/0+1.0 ms cpu, 124->148->87 MB, 127 MB goal, 12 P
gc 57 @3.616s 21%: 1.0+40+0.30 ms clock, 12+64/121/0+3.6 ms cpu, 127->147->83 MB, 130 MB goal, 12 P
gc 58 @3.684s 21%: 0.65+37+0.33 ms clock, 5.2+61/111/0+2.6 ms cpu, 124->143->79 MB, 127 MB goal, 12 P
gc 59 @3.750s 21%: 0.028+38+0.27 ms clock, 0.22+60/103/0+2.1 ms cpu, 120->140->79 MB, 123 MB goal, 12 P
gc 60 @3.813s 21%: 7.0+35+0.30 ms clock, 21+74/103/0+0.91 ms cpu, 116->134->79 MB, 119 MB goal, 12 P
gc 61 @3.883s 21%: 0.031+38+0.29 ms clock, 0.22+64/114/0+2.0 ms cpu, 122->142->82 MB, 125 MB goal, 12 P
gc 62 @3.949s 21%: 0.39+38+0.24 ms clock, 1.1+70/115/0+0.73 ms cpu, 123->143->81 MB, 126 MB goal, 12 P

./benchAllocatBlackScan2for1 -bench=garbage
GOPERF-METRIC:allocated=2780980
GOPERF-METRIC:allocs=68669
GOPERF-METRIC:cputime=30555106
GOPERF-METRIC:gc-pause-one=710828
GOPERF-METRIC:gc-pause-total=30992
GOPERF-METRIC:rss=205619200
GOPERF-METRIC:sys-gc=6676480
GOPERF-METRIC:sys-heap=183631872
GOPERF-METRIC:sys-other=11679120
GOPERF-METRIC:sys-stack=851968
GOPERF-METRIC:sys-total=202839440
GOPERF-METRIC:time=2608209
GOPERF-METRIC:virtual-mem=1094815744
GOPERF-FILE:cpuprof=/tmp/46.prof.txt
GOPERF-FILE:memprof=/tmp/47.prof.txt
gc 53 @3.041s 21%: 3.2+28+0.31 ms clock, 13+103/84/0+1.2 ms cpu, 124->135->74 MB, 127 MB goal, 12 P
gc 54 @3.106s 21%: 0.18+28+0.41 ms clock, 0.37+103/85/0+0.83 ms cpu, 127->139->76 MB, 131 MB goal, 12 P
gc 55 @3.169s 21%: 0.026+30+0.25 ms clock, 0.18+83/88/20+1.8 ms cpu, 129->141->78 MB, 132 MB goal, 12 P
gc 56 @3.233s 22%: 0.024+28+0.25 ms clock, 0.14+121/81/0+1.5 ms cpu, 132->142->78 MB, 135 MB goal, 12 P
gc 57 @3.297s 22%: 0.030+20+0.60 ms clock, 0.060+97/61/0+1.2 ms cpu, 135->142->61 MB, 139 MB goal, 12 P
gc 58 @3.350s 22%: 0.036+28+0.24 ms clock, 0.072+97/81/0+0.49 ms cpu, 109->121->73 MB, 112 MB goal, 12 P
gc 59 @3.411s 22%: 0.15+24+0.46 ms clock, 0.95+101/71/0+2.8 ms cpu, 124->134->70 MB, 128 MB goal, 12 P
gc 60 @3.468s 22%: 0.025+25+0.44 ms clock, 0.17+84/76/0+3.0 ms cpu, 121->131->67 MB, 124 MB goal, 12 P
gc 61 @3.524s 22%: 0.030+27+0.58 ms clock, 0.060+80/82/17+1.1 ms cpu, 114->125->73 MB, 117 MB goal, 12 P
gc 62 @3.584s 22%: 0.015+30+0.44 ms clock, 0.18+91/89/0+5.3 ms cpu, 125->138->75 MB, 128 MB goal, 12 P
gc 63 @3.646s 22%: 0.024+30+0.43 ms clock, 0.14+81/89/1.0+2.6 ms cpu, 125->138->73 MB, 128 MB goal, 12 P
gc 64 @3.706s 22%: 0.16+26+0.42 ms clock, 0.32+93/78/0+0.84 ms cpu, 122->132->70 MB, 124 MB goal, 12 P
gc 65 @3.764s 22%: 0.013+26+0.49 ms clock, 0.16+69/78/22+5.9 ms cpu, 120->129->69 MB, 123 MB goal, 12 P
gc 66 @3.822s 22%: 0.017+26+0.49 ms clock, 0.17+92/79/0.001+4.9 ms cpu, 119->130->70 MB, 122 MB goal, 12 P
gc 67 @3.879s 22%: 5.7+26+0.52 ms clock, 11+121/77/0+1.0 ms cpu, 119->127->71 MB, 122 MB goal, 12 P
gc 68 @3.946s 22%: 0.028+26+0.31 ms clock, 0.056+104/72/0+0.63 ms cpu, 126->135->70 MB, 129 MB goal, 12 P

./benchAllocatWhiteScan1for1 -bench=garbage
GOPERF-METRIC:allocated=2780554
GOPERF-METRIC:allocs=68638
GOPERF-METRIC:cputime=44090842
GOPERF-METRIC:gc-pause-one=1437192
GOPERF-METRIC:gc-pause-total=66829
GOPERF-METRIC:rss=232046592
GOPERF-METRIC:sys-gc=7659520
GOPERF-METRIC:sys-heap=209879040
GOPERF-METRIC:sys-other=10847632
GOPERF-METRIC:sys-stack=753664
GOPERF-METRIC:sys-total=229139856
GOPERF-METRIC:time=3779232
GOPERF-METRIC:virtual-mem=766160896
GOPERF-FILE:cpuprof=/tmp/46.prof.txt
GOPERF-FILE:memprof=/tmp/47.prof.txt
gc 40 @3.072s 22%: 0.027+97+0.47 ms clock, 0.27+39/254/0+4.7 ms cpu, 110->161->109 MB, 112 MB goal, 12 P
gc 41 @3.182s 22%: 0.11+88+0.55 ms clock, 0.89+67/237/0.078+4.4 ms cpu, 116->163->103 MB, 119 MB goal, 12 P
gc 42 @3.284s 22%: 0.031+70+0.63 ms clock, 0.063+33/193/0+1.2 ms cpu, 114->148->86 MB, 116 MB goal, 12 P
gc 43 @3.375s 22%: 0.041+77+0.95 ms clock, 0.083+18/214/0+1.9 ms cpu, 105->140->90 MB, 107 MB goal, 12 P
gc 44 @3.472s 22%: 0.22+87+0.44 ms clock, 0.44+63/224/0+0.89 ms cpu, 109->152->100 MB, 112 MB goal, 12 P
gc 45 @3.573s 22%: 0.10+75+0.44 ms clock, 0.31+23/222/0+1.3 ms cpu, 117->154->95 MB, 118 MB goal, 12 P
gc 46 @3.667s 22%: 0.047+79+0.60 ms clock, 0.28+41/235/0+3.6 ms cpu, 115->160->106 MB, 117 MB goal, 12 P
gc 47 @3.762s 22%: 0.037+72+0.61 ms clock, 0.075+42/215/0+1.2 ms cpu, 121->157->92 MB, 124 MB goal, 12 P
gc 48 @3.855s 22%: 0.024+70+0.59 ms clock, 0.12+47/210/0+2.9 ms cpu, 113->150->89 MB, 116 MB goal, 12 P
gc 49 @3.941s 22%: 0.067+69+0.51 ms clock, 0.33+44/206/0+2.5 ms cpu, 106->144->95 MB, 109 MB goal, 12 P
gc 50 @4.026s 22%: 0.11+58+0.44 ms clock, 0.77+40/176/0+3.1 ms cpu, 114->145->83 MB, 117 MB goal, 12 P

./benchAllocatWhiteScan2for1 -bench=garbage
GOPERF-METRIC:allocated=2780262
GOPERF-METRIC:allocs=68629
GOPERF-METRIC:cputime=32769944
GOPERF-METRIC:gc-pause-one=861449
GOPERF-METRIC:gc-pause-total=37559
GOPERF-METRIC:rss=204873728
GOPERF-METRIC:sys-gc=6516736
GOPERF-METRIC:sys-heap=185761792
GOPERF-METRIC:sys-other=11133096
GOPERF-METRIC:sys-stack=819200
GOPERF-METRIC:sys-total=204230824
GOPERF-METRIC:time=3069541
GOPERF-METRIC:virtual-mem=719392768
GOPERF-FILE:cpuprof=/tmp/46.prof.txt
GOPERF-FILE:memprof=/tmp/47.prof.txt
gc 50 @3.020s 23%: 0.026+34+0.61 ms clock, 0.13+121/102/0+3.0 ms cpu, 122->136->83 MB, 125 MB goal, 12 P
gc 51 @3.088s 23%: 0.028+36+0.49 ms clock, 0.17+105/108/0+2.9 ms cpu, 138->154->78 MB, 141 MB goal, 12 P
gc 52 @3.155s 23%: 0.023+39+0.54 ms clock, 0.19+96/117/10+4.3 ms cpu, 123->142->83 MB, 127 MB goal, 12 P
gc 53 @3.224s 23%: 0.043+32+0.61 ms clock, 0.39+101/95/0+5.5 ms cpu, 130->143->75 MB, 133 MB goal, 12 P
gc 54 @3.287s 23%: 0.017+36+0.45 ms clock, 0.17+72/107/28+4.5 ms cpu, 123->138->76 MB, 126 MB goal, 12 P
gc 55 @3.352s 23%: 0.041+35+0.46 ms clock, 0.41+104/104/0+4.6 ms cpu, 121->136->78 MB, 124 MB goal, 12 P
gc 56 @3.418s 24%: 2.0+35+0.54 ms clock, 4.0+114/107/0+1.0 ms cpu, 125->140->80 MB, 128 MB goal, 12 P
gc 57 @3.487s 24%: 0.17+33+0.66 ms clock, 0.34+99/100/14+1.3 ms cpu, 128->142->79 MB, 132 MB goal, 12 P
gc 58 @3.555s 24%: 0.10+29+0.45 ms clock, 0.43+99/87/0+1.8 ms cpu, 131->143->70 MB, 135 MB goal, 12 P
gc 59 @3.617s 24%: 0.043+31+0.50 ms clock, 0.39+93/94/0+4.5 ms cpu, 116->130->71 MB, 119 MB goal, 12 P
gc 60 @3.678s 24%: 0.32+36+0.49 ms clock, 2.6+102/109/0+3.9 ms cpu, 114->130->76 MB, 117 MB goal, 12 P
gc 61 @3.743s 24%: 3.0+29+0.53 ms clock, 6.0+98/88/7.0+1.0 ms cpu, 122->133->71 MB, 124 MB goal, 12 P
gc 62 @3.807s 24%: 0.17+34+0.43 ms clock, 1.0+108/101/0+2.6 ms cpu, 119->134->78 MB, 122 MB goal, 12 P
gc 63 @3.872s 24%: 2.0+38+0.71 ms clock, 20+102/114/0+7.1 ms cpu, 127->145->80 MB, 130 MB goal, 12 P
gc 64 @3.941s 24%: 0.027+32+0.53 ms clock, 0.24+103/97/0+4.8 ms cpu, 124->138->73 MB, 127 MB goal, 12 P

cespare · 2016-03-30T22:42:07Z

@RLH did you mean to close this?

RLH · 2016-03-30T23:10:10Z

No, How do I fix this.

On Wed, Mar 30, 2016 at 6:42 PM, Caleb Spare notifications@github.com
wrote:

@RLH https://github.com/RLH did you mean to close this?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#14951 (comment)

bradfitz · 2016-03-30T23:17:36Z

You don't see the "Reopen Issue" button? I pressed it, in any case. Maybe you're not in the right groups.

Currently when we compute the trigger for the next GC, we do it based on an estimate of the reachable heap size at the start of the GC cycle, which is itself based on an estimate of the floating garbage. This was introduced by 4655aad to fix a bad feedback loop that allowed the heap to grow to many times the true reachable size. However, this estimate gets easily confused by rapidly allocating applications, and, worse it's different than the heap size the trigger controller uses to compute the trigger itself. This results in the trigger controller often thinking that GC finished before it started. Since this would be a pretty great outcome from it's perspective, it sets the trigger for the next cycle as close to the next goal as possible (which is limited to 95% of the goal). Furthermore, the bad feedback loop this estimate originally fixed seems not to happen any more, suggesting it was fixed more correctly by some other change in the mean time. Finally, with the change to allocate black, it shouldn't even be theoretically possible for this bad feedback loop to occur. Hence, eliminate the floating garbage estimate and simply consider the reachable heap to be the marked heap. This harms overall throughput slightly for allocation-heavy benchmarks, but significantly improves mutator availability. Fixes #12204. This brings the average trigger in this benchmark from 0.95 (the cap) to 0.7 and the active GC utilization from ~90% to ~45%. Updates #14951. This makes the trigger controller much better behaved, so it pulls the trigger lower if assists are consuming a lot of CPU like it's supposed to, increasing mutator availability. name old time/op new time/op delta XBenchGarbage-12 2.21ms ± 1% 2.28ms ± 3% +3.29% (p=0.000 n=17+17) Some of this slow down we paid for in earlier commits. Relative to the start of the series to switch to allocate-black (the parent of "count black allocations toward scan work"), the garbage benchmark is 2.62% slower. name old time/op new time/op delta BinaryTree17-12 2.53s ± 3% 2.53s ± 3% ~ (p=0.708 n=20+19) Fannkuch11-12 2.08s ± 0% 2.08s ± 0% -0.22% (p=0.002 n=19+18) FmtFprintfEmpty-12 45.3ns ± 2% 45.2ns ± 3% ~ (p=0.505 n=20+20) FmtFprintfString-12 129ns ± 0% 131ns ± 2% +1.80% (p=0.000 n=16+19) FmtFprintfInt-12 121ns ± 2% 121ns ± 2% ~ (p=0.768 n=19+19) FmtFprintfIntInt-12 186ns ± 1% 188ns ± 3% +0.99% (p=0.000 n=19+19) FmtFprintfPrefixedInt-12 188ns ± 1% 188ns ± 1% ~ (p=0.947 n=18+16) FmtFprintfFloat-12 254ns ± 1% 255ns ± 1% +0.30% (p=0.002 n=19+17) FmtManyArgs-12 763ns ± 0% 770ns ± 0% +0.92% (p=0.000 n=18+18) GobDecode-12 7.00ms ± 1% 7.04ms ± 1% +0.61% (p=0.049 n=20+20) GobEncode-12 5.88ms ± 1% 5.88ms ± 0% ~ (p=0.641 n=18+19) Gzip-12 214ms ± 1% 215ms ± 1% +0.43% (p=0.002 n=18+19) Gunzip-12 37.6ms ± 0% 37.6ms ± 0% +0.11% (p=0.015 n=17+18) HTTPClientServer-12 76.9µs ± 2% 78.1µs ± 2% +1.44% (p=0.000 n=20+18) JSONEncode-12 15.2ms ± 2% 15.1ms ± 1% ~ (p=0.271 n=19+18) JSONDecode-12 53.1ms ± 1% 53.3ms ± 0% +0.49% (p=0.000 n=18+19) Mandelbrot200-12 4.04ms ± 1% 4.03ms ± 0% -0.33% (p=0.005 n=18+18) GoParse-12 3.29ms ± 1% 3.28ms ± 1% ~ (p=0.146 n=16+17) RegexpMatchEasy0_32-12 69.9ns ± 3% 69.5ns ± 1% ~ (p=0.785 n=20+19) RegexpMatchEasy0_1K-12 237ns ± 0% 237ns ± 0% ~ (p=1.000 n=18+18) RegexpMatchEasy1_32-12 69.5ns ± 1% 69.2ns ± 1% -0.44% (p=0.020 n=16+19) RegexpMatchEasy1_1K-12 372ns ± 1% 371ns ± 2% ~ (p=0.086 n=20+19) RegexpMatchMedium_32-12 108ns ± 3% 107ns ± 1% -1.00% (p=0.004 n=19+14) RegexpMatchMedium_1K-12 34.2µs ± 4% 34.0µs ± 2% ~ (p=0.380 n=19+20) RegexpMatchHard_32-12 1.77µs ± 4% 1.76µs ± 3% ~ (p=0.558 n=18+20) RegexpMatchHard_1K-12 53.4µs ± 4% 52.8µs ± 2% -1.10% (p=0.020 n=18+20) Revcomp-12 359ms ± 4% 377ms ± 0% +5.19% (p=0.000 n=20+18) Template-12 63.7ms ± 2% 62.9ms ± 2% -1.27% (p=0.005 n=18+20) TimeParse-12 316ns ± 2% 313ns ± 1% ~ (p=0.059 n=20+16) TimeFormat-12 329ns ± 0% 331ns ± 0% +0.39% (p=0.000 n=16+18) [Geo mean] 51.6µs 51.7µs +0.18% Change-Id: I1dce4640c8205d41717943b021039fffea863c57 Reviewed-on: https://go-review.googlesource.com/21324 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>

gopherbot · 2017-08-29T18:50:29Z

Change https://golang.org/cl/59970 mentions this issue: runtime: separate soft and hard heap limits

gopherbot · 2017-08-29T18:50:30Z

Change https://golang.org/cl/59971 mentions this issue: runtime: reduce background mark work to from 25% to 20%

gopherbot · 2017-08-31T22:19:48Z

Change https://golang.org/cl/60790 mentions this issue: cmd/trace: add minimum mutator utilization (MMU) plot

gopherbot · 2017-10-26T21:47:07Z

Change https://golang.org/cl/73711 mentions this issue: runtime: buffered write barrier implementation

gopherbot · 2017-10-26T21:47:07Z

Change https://golang.org/cl/73712 mentions this issue: cmd/compile: compiler support for buffered write barrier

This implements runtime support for buffered write barriers on amd64. The buffered write barrier has a fast path that simply enqueues pointers in a per-P buffer. Unlike the current write barrier, this fast path is *not* a normal Go call and does not require the compiler to spill general-purpose registers or put arguments on the stack. When the buffer fills up, the write barrier takes the slow path, which spills all general purpose registers and flushes the buffer. We don't allow safe-points or stack splits while this frame is active, so it doesn't matter that we have no type information for the spilled registers in this frame. One minor complication is cgocheck=2 mode, which uses the write barrier to detect Go pointers being written to non-Go memory. We obviously can't buffer this, so instead we set the buffer to its minimum size, forcing the write barrier into the slow path on every call. For this specific case, we pass additional information as arguments to the flush function. This also requires enabling the cgo write barrier slightly later during runtime initialization, after Ps (and the per-P write barrier buffers) have been initialized. The code in this CL is not yet active. The next CL will modify the compiler to generate calls to the new write barrier. This reduces the average cost of the write barrier by roughly a factor of 4, which will pay for the cost of having it enabled more of the time after we make the GC pacer less aggressive. (Benchmarks will be in the next CL.) Updates #14951. Updates #22460. Change-Id: I396b5b0e2c5e5c4acfd761a3235fd15abadc6cb1 Reviewed-on: https://go-review.googlesource.com/73711 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>

This CL implements the compiler support for calling the buffered write barrier added by the previous CL. Since the buffered write barrier is only implemented on amd64 right now, this still supports the old, eager write barrier as well. There's little overhead to supporting both and this way a few tests in test/fixedbugs that expect to have liveness maps at write barrier calls can easily opt-in to the old, eager barrier. This significantly improves the performance of the write barrier: name old time/op new time/op delta WriteBarrier-12 73.5ns ±20% 19.2ns ±27% -73.90% (p=0.000 n=19+18) It also reduces the size of binaries because the write barrier call is more compact: name old object-bytes new object-bytes delta Template 398k ± 0% 393k ± 0% -1.14% (p=0.008 n=5+5) Unicode 208k ± 0% 206k ± 0% -1.00% (p=0.008 n=5+5) GoTypes 1.18M ± 0% 1.15M ± 0% -2.00% (p=0.008 n=5+5) Compiler 4.05M ± 0% 3.88M ± 0% -4.26% (p=0.008 n=5+5) SSA 8.25M ± 0% 8.11M ± 0% -1.59% (p=0.008 n=5+5) Flate 228k ± 0% 224k ± 0% -1.83% (p=0.008 n=5+5) GoParser 295k ± 0% 284k ± 0% -3.62% (p=0.008 n=5+5) Reflect 1.00M ± 0% 0.99M ± 0% -0.70% (p=0.008 n=5+5) Tar 339k ± 0% 333k ± 0% -1.67% (p=0.008 n=5+5) XML 404k ± 0% 395k ± 0% -2.10% (p=0.008 n=5+5) [Geo mean] 704k 690k -2.00% name old exe-bytes new exe-bytes delta HelloSize 1.05M ± 0% 1.04M ± 0% -1.55% (p=0.008 n=5+5) https://perf.golang.org/search?q=upload:20171027.1 (Amusingly, this also reduces compiler allocations by 0.75%, which, combined with the better write barrier, speeds up the compiler overall by 2.10%. See the perf link.) It slightly improves the performance of most of the go1 benchmarks and improves the performance of the x/benchmarks: name old time/op new time/op delta BinaryTree17-12 2.40s ± 1% 2.47s ± 1% +2.69% (p=0.000 n=19+19) Fannkuch11-12 2.95s ± 0% 2.95s ± 0% +0.21% (p=0.000 n=20+19) FmtFprintfEmpty-12 41.8ns ± 4% 41.4ns ± 2% -1.03% (p=0.014 n=20+20) FmtFprintfString-12 68.7ns ± 2% 67.5ns ± 1% -1.75% (p=0.000 n=20+17) FmtFprintfInt-12 79.0ns ± 3% 77.1ns ± 1% -2.40% (p=0.000 n=19+17) FmtFprintfIntInt-12 127ns ± 1% 123ns ± 3% -3.42% (p=0.000 n=20+20) FmtFprintfPrefixedInt-12 152ns ± 1% 150ns ± 1% -1.02% (p=0.000 n=18+17) FmtFprintfFloat-12 211ns ± 1% 209ns ± 0% -0.99% (p=0.000 n=20+16) FmtManyArgs-12 500ns ± 0% 496ns ± 0% -0.73% (p=0.000 n=17+20) GobDecode-12 6.44ms ± 1% 6.53ms ± 0% +1.28% (p=0.000 n=20+19) GobEncode-12 5.46ms ± 0% 5.46ms ± 1% ~ (p=0.550 n=19+20) Gzip-12 220ms ± 1% 216ms ± 0% -1.75% (p=0.000 n=19+19) Gunzip-12 38.8ms ± 0% 38.6ms ± 0% -0.30% (p=0.000 n=18+19) HTTPClientServer-12 79.0µs ± 1% 78.2µs ± 1% -1.01% (p=0.000 n=20+20) JSONEncode-12 11.9ms ± 0% 11.9ms ± 0% -0.29% (p=0.000 n=20+19) JSONDecode-12 52.6ms ± 0% 52.2ms ± 0% -0.68% (p=0.000 n=19+20) Mandelbrot200-12 3.69ms ± 0% 3.68ms ± 0% -0.36% (p=0.000 n=20+20) GoParse-12 3.13ms ± 1% 3.18ms ± 1% +1.67% (p=0.000 n=19+20) RegexpMatchEasy0_32-12 73.2ns ± 1% 72.3ns ± 1% -1.19% (p=0.000 n=19+18) RegexpMatchEasy0_1K-12 241ns ± 0% 239ns ± 0% -0.83% (p=0.000 n=17+16) RegexpMatchEasy1_32-12 68.6ns ± 1% 69.0ns ± 1% +0.47% (p=0.015 n=18+16) RegexpMatchEasy1_1K-12 364ns ± 0% 361ns ± 0% -0.67% (p=0.000 n=16+17) RegexpMatchMedium_32-12 104ns ± 1% 103ns ± 1% -0.79% (p=0.001 n=20+15) RegexpMatchMedium_1K-12 33.8µs ± 3% 34.0µs ± 2% ~ (p=0.267 n=20+19) RegexpMatchHard_32-12 1.64µs ± 1% 1.62µs ± 2% -1.25% (p=0.000 n=19+18) RegexpMatchHard_1K-12 49.2µs ± 0% 48.7µs ± 1% -0.93% (p=0.000 n=19+18) Revcomp-12 391ms ± 5% 396ms ± 7% ~ (p=0.154 n=19+19) Template-12 63.1ms ± 0% 59.5ms ± 0% -5.76% (p=0.000 n=18+19) TimeParse-12 307ns ± 0% 306ns ± 0% -0.39% (p=0.000 n=19+17) TimeFormat-12 325ns ± 0% 323ns ± 0% -0.50% (p=0.000 n=19+19) [Geo mean] 47.3µs 46.9µs -0.67% https://perf.golang.org/search?q=upload:20171026.1 name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.25ms ± 1% 2.20ms ± 1% -2.31% (p=0.000 n=18+18) HTTP-12 12.6µs ± 0% 12.6µs ± 0% -0.72% (p=0.000 n=18+17) JSON-12 11.0ms ± 0% 11.0ms ± 1% -0.68% (p=0.000 n=17+19) https://perf.golang.org/search?q=upload:20171026.2 Updates #14951. Updates #22460. Change-Id: Id4c0932890a1d41020071bec73b8522b1367d3e7 Reviewed-on: https://go-review.googlesource.com/73712 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>

Currently, GC pacing is based on a single hard heap limit computed based on GOGC. In order to achieve this hard limit, assist pacing makes the conservative assumption that the entire heap is live. However, in the steady state (with GOGC=100), only half of the heap is live. As a result, the garbage collector works twice as hard as necessary and finishes half way between the trigger and the goal. Since this is a stable state for the trigger controller, this repeats from cycle to cycle. Matters are even worse if GOGC is higher. For example, if GOGC=200, only a third of the heap is live in steady state, so the GC will work three times harder than necessary and finish only a third of the way between the trigger and the goal. Since this causes the garbage collector to consume ~50% of the available CPU during marking instead of the intended 25%, about 25% of the CPU goes to mutator assists. This high mutator assist cost causes high mutator latency variability. This commit improves the situation by separating the heap goal into two goals: a soft goal and a hard goal. The soft goal is set based on GOGC, just like the current goal is, and the hard goal is set at a 10% larger heap than the soft goal. Prior to the soft goal, assist pacing assumes the heap is in steady state (e.g., only half of it is live). Between the soft goal and the hard goal, assist pacing switches to the current conservative assumption that the entire heap is live. In benchmarks, this nearly eliminates mutator assists. However, since background marking is fixed at 25% CPU, this causes the trigger controller to saturate, which leads to somewhat higher variability in heap size. The next commit will address this. The lower CPU usage of course leads to longer mark cycles, though really it means the mark cycles are as long as they should have been in the first place. This does, however, lead to two potential down-sides compared to the current pacing policy: 1. the total overhead of the write barrier is higher because it's enabled more of the time and 2. the heap size may be larger because there's more floating garbage. We addressed 1 by significantly improving the performance of the write barrier in the preceding commits. 2 can be demonstrated in intense GC benchmarks, but doesn't seem to be a problem in any real applications. Updates #14951. Updates #14812 (fixes?). Fixes #18534. This has no significant effect on the throughput of the github.com/dr2chase/bent benchmarks-50. This has little overall throughput effect on the go1 benchmarks: name old time/op new time/op delta BinaryTree17-12 2.41s ± 0% 2.40s ± 0% -0.22% (p=0.007 n=20+18) Fannkuch11-12 2.95s ± 0% 2.95s ± 0% +0.07% (p=0.003 n=17+18) FmtFprintfEmpty-12 41.7ns ± 3% 42.2ns ± 0% +1.17% (p=0.002 n=20+15) FmtFprintfString-12 66.5ns ± 0% 67.9ns ± 2% +2.16% (p=0.000 n=16+20) FmtFprintfInt-12 77.6ns ± 2% 75.6ns ± 3% -2.55% (p=0.000 n=19+19) FmtFprintfIntInt-12 124ns ± 1% 123ns ± 1% -0.98% (p=0.000 n=18+17) FmtFprintfPrefixedInt-12 151ns ± 1% 148ns ± 1% -1.75% (p=0.000 n=19+20) FmtFprintfFloat-12 210ns ± 1% 212ns ± 0% +0.75% (p=0.000 n=19+16) FmtManyArgs-12 501ns ± 1% 499ns ± 1% -0.30% (p=0.041 n=17+19) GobDecode-12 6.50ms ± 1% 6.49ms ± 1% ~ (p=0.234 n=19+19) GobEncode-12 5.43ms ± 0% 5.47ms ± 0% +0.75% (p=0.000 n=20+19) Gzip-12 216ms ± 1% 220ms ± 1% +1.71% (p=0.000 n=19+20) Gunzip-12 38.6ms ± 0% 38.8ms ± 0% +0.66% (p=0.000 n=18+19) HTTPClientServer-12 78.1µs ± 1% 78.5µs ± 1% +0.49% (p=0.035 n=20+20) JSONEncode-12 12.1ms ± 0% 12.2ms ± 0% +1.05% (p=0.000 n=18+17) JSONDecode-12 53.0ms ± 0% 52.3ms ± 0% -1.27% (p=0.000 n=19+19) Mandelbrot200-12 3.74ms ± 0% 3.69ms ± 0% -1.17% (p=0.000 n=18+19) GoParse-12 3.17ms ± 1% 3.17ms ± 1% ~ (p=0.569 n=19+20) RegexpMatchEasy0_32-12 73.2ns ± 1% 73.7ns ± 0% +0.76% (p=0.000 n=18+17) RegexpMatchEasy0_1K-12 239ns ± 0% 238ns ± 0% -0.27% (p=0.000 n=13+17) RegexpMatchEasy1_32-12 69.0ns ± 2% 69.1ns ± 1% ~ (p=0.404 n=19+19) RegexpMatchEasy1_1K-12 367ns ± 1% 365ns ± 1% -0.60% (p=0.000 n=19+19) RegexpMatchMedium_32-12 105ns ± 1% 104ns ± 1% -1.24% (p=0.000 n=19+16) RegexpMatchMedium_1K-12 34.1µs ± 2% 33.6µs ± 3% -1.60% (p=0.000 n=20+20) RegexpMatchHard_32-12 1.62µs ± 1% 1.67µs ± 1% +2.75% (p=0.000 n=18+18) RegexpMatchHard_1K-12 48.8µs ± 1% 50.3µs ± 2% +3.07% (p=0.000 n=20+19) Revcomp-12 386ms ± 0% 384ms ± 0% -0.57% (p=0.000 n=20+19) Template-12 59.9ms ± 1% 61.1ms ± 1% +2.01% (p=0.000 n=20+19) TimeParse-12 301ns ± 2% 307ns ± 0% +2.11% (p=0.000 n=20+19) TimeFormat-12 323ns ± 0% 323ns ± 0% ~ (all samples are equal) [Geo mean] 47.0µs 47.1µs +0.23% https://perf.golang.org/search?q=upload:20171030.1 Likewise, the throughput effect on the x/benchmarks is minimal (and reasonably positive on the garbage benchmark with a large heap): name old time/op new time/op delta Garbage/benchmem-MB=1024-12 2.40ms ± 4% 2.29ms ± 3% -4.57% (p=0.000 n=19+18) Garbage/benchmem-MB=64-12 2.23ms ± 1% 2.24ms ± 2% +0.59% (p=0.016 n=19+18) HTTP-12 12.5µs ± 1% 12.6µs ± 1% ~ (p=0.326 n=20+19) JSON-12 11.1ms ± 1% 11.3ms ± 2% +2.15% (p=0.000 n=16+17) It does increase the heap size of the garbage benchmarks, but seems to have relatively little impact on more realistic programs. Also, we'll gain some of this back with the next commit. name old peak-RSS-bytes new peak-RSS-bytes delta Garbage/benchmem-MB=1024-12 1.21G ± 1% 1.88G ± 2% +55.59% (p=0.000 n=19+20) Garbage/benchmem-MB=64-12 168M ± 3% 248M ± 8% +48.08% (p=0.000 n=18+20) HTTP-12 45.6M ± 9% 47.0M ±27% ~ (p=0.925 n=20+20) JSON-12 193M ±11% 206M ±11% +7.06% (p=0.001 n=20+20) https://perf.golang.org/search?q=upload:20171030.2 Change-Id: Ic78904135f832b4d64056cbe734ab979f5ad9736 Reviewed-on: https://go-review.googlesource.com/59970 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>

Currently, both the background mark worker and the goal GC CPU are both fixed at 25%. The trigger controller's goal is to achieve the goal CPU usage, and with the previous commit it can actually achieve this. But this means there are *no* assists, which sounds ideal but actually causes problems for the trigger controller. Since the controller can't lower CPU usage below the background mark worker CPU, it saturates at the CPU goal and no longer gets feedback, which translates into higher variability in heap growth. This commit fixes this by allowing assists 5% CPU beyond the 25% fixed background mark. This avoids saturating the trigger controller, since it can now get feedback from both sides of the CPU goal. This leads to low variability in both CPU usage and heap growth, at the cost of reintroducing a low rate of mark assists. We also experimented with 20% background plus 5% assist, but 25%+5% clearly performed better in benchmarks. Updates #14951. Updates #14812. Updates #18534. Combined with the previous CL, this significantly improves tail mutator utilization in the x/bechmarks garbage benchmark. On a sample trace, it increased the 99.9%ile mutator utilization at 10ms from 26% to 59%, and at 5ms from 17% to 52%. It reduced the 99.9%ile zero utilization window from 2ms to 700µs. It also helps the mean mutator utilization: it increased the 10s mutator utilization from 83% to 94%. The minimum mutator utilization is also somewhat improved, though there is still some unknown artifact that causes a miniscule fraction of mutator assists to take 5--10ms (in fact, there was exactly one 10ms mutator assist in my sample trace). This has no significant effect on the throughput of the github.com/dr2chase/bent benchmarks-50. This has little effect on the go1 benchmarks (and the slight overall improvement makes up for the slight overall slowdown from the previous commit): name old time/op new time/op delta BinaryTree17-12 2.40s ± 0% 2.41s ± 1% +0.26% (p=0.010 n=18+19) Fannkuch11-12 2.95s ± 0% 2.93s ± 0% -0.62% (p=0.000 n=18+15) FmtFprintfEmpty-12 42.2ns ± 0% 42.3ns ± 1% +0.37% (p=0.001 n=15+14) FmtFprintfString-12 67.9ns ± 2% 67.2ns ± 3% -1.03% (p=0.002 n=20+18) FmtFprintfInt-12 75.6ns ± 3% 76.8ns ± 2% +1.59% (p=0.000 n=19+17) FmtFprintfIntInt-12 123ns ± 1% 124ns ± 1% +0.77% (p=0.000 n=17+14) FmtFprintfPrefixedInt-12 148ns ± 1% 150ns ± 1% +1.28% (p=0.000 n=20+20) FmtFprintfFloat-12 212ns ± 0% 211ns ± 1% -0.67% (p=0.000 n=16+17) FmtManyArgs-12 499ns ± 1% 500ns ± 0% +0.23% (p=0.004 n=19+16) GobDecode-12 6.49ms ± 1% 6.51ms ± 1% +0.32% (p=0.008 n=19+19) GobEncode-12 5.47ms ± 0% 5.43ms ± 1% -0.68% (p=0.000 n=19+20) Gzip-12 220ms ± 1% 216ms ± 1% -1.66% (p=0.000 n=20+19) Gunzip-12 38.8ms ± 0% 38.5ms ± 0% -0.80% (p=0.000 n=19+20) HTTPClientServer-12 78.5µs ± 1% 78.1µs ± 1% -0.53% (p=0.008 n=20+19) JSONEncode-12 12.2ms ± 0% 11.9ms ± 0% -2.38% (p=0.000 n=17+19) JSONDecode-12 52.3ms ± 0% 53.3ms ± 0% +1.84% (p=0.000 n=19+20) Mandelbrot200-12 3.69ms ± 0% 3.69ms ± 0% -0.19% (p=0.000 n=19+19) GoParse-12 3.17ms ± 1% 3.19ms ± 1% +0.61% (p=0.000 n=20+20) RegexpMatchEasy0_32-12 73.7ns ± 0% 73.2ns ± 1% -0.66% (p=0.000 n=17+20) RegexpMatchEasy0_1K-12 238ns ± 0% 239ns ± 0% +0.32% (p=0.000 n=17+16) RegexpMatchEasy1_32-12 69.1ns ± 1% 69.2ns ± 1% ~ (p=0.669 n=19+13) RegexpMatchEasy1_1K-12 365ns ± 1% 367ns ± 1% +0.49% (p=0.000 n=19+19) RegexpMatchMedium_32-12 104ns ± 1% 105ns ± 1% +1.33% (p=0.000 n=16+20) RegexpMatchMedium_1K-12 33.6µs ± 3% 34.1µs ± 4% +1.67% (p=0.001 n=20+20) RegexpMatchHard_32-12 1.67µs ± 1% 1.62µs ± 1% -2.78% (p=0.000 n=18+17) RegexpMatchHard_1K-12 50.3µs ± 2% 48.7µs ± 1% -3.09% (p=0.000 n=19+18) Revcomp-12 384ms ± 0% 386ms ± 0% +0.59% (p=0.000 n=19+19) Template-12 61.1ms ± 1% 60.5ms ± 1% -1.02% (p=0.000 n=19+20) TimeParse-12 307ns ± 0% 303ns ± 1% -1.23% (p=0.000 n=19+15) TimeFormat-12 323ns ± 0% 323ns ± 0% -0.12% (p=0.011 n=15+20) [Geo mean] 47.1µs 47.0µs -0.20% https://perf.golang.org/search?q=upload:20171030.4 It slightly improve the performance the x/benchmarks: name old time/op new time/op delta Garbage/benchmem-MB=1024-12 2.29ms ± 3% 2.22ms ± 2% -2.97% (p=0.000 n=18+18) Garbage/benchmem-MB=64-12 2.24ms ± 2% 2.21ms ± 2% -1.64% (p=0.000 n=18+18) HTTP-12 12.6µs ± 1% 12.6µs ± 1% ~ (p=0.690 n=19+17) JSON-12 11.3ms ± 2% 11.3ms ± 1% ~ (p=0.163 n=17+18) and fixes some of the heap size bloat caused by the previous commit: name old peak-RSS-bytes new peak-RSS-bytes delta Garbage/benchmem-MB=1024-12 1.88G ± 2% 1.77G ± 2% -5.52% (p=0.000 n=20+18) Garbage/benchmem-MB=64-12 248M ± 8% 226M ± 5% -8.93% (p=0.000 n=20+20) HTTP-12 47.0M ±27% 47.2M ±12% ~ (p=0.512 n=20+20) JSON-12 206M ±11% 206M ±10% ~ (p=0.841 n=20+20) https://perf.golang.org/search?q=upload:20171030.5 Combined with the change to add a soft goal in the previous commit, the achieves a decent performance improvement on the garbage benchmark: name old time/op new time/op delta Garbage/benchmem-MB=1024-12 2.40ms ± 4% 2.22ms ± 2% -7.40% (p=0.000 n=19+18) Garbage/benchmem-MB=64-12 2.23ms ± 1% 2.21ms ± 2% -1.06% (p=0.000 n=19+18) HTTP-12 12.5µs ± 1% 12.6µs ± 1% ~ (p=0.330 n=20+17) JSON-12 11.1ms ± 1% 11.3ms ± 1% +1.87% (p=0.000 n=16+18) https://perf.golang.org/search?q=upload:20171030.6 Change-Id: If04ddb57e1e58ef2fb9eec54c290eb4ae4bea121 Reviewed-on: https://go-review.googlesource.com/59971 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>

aclements · 2017-11-01T00:31:01Z

Commit 03eb948 fixed this issue (not sure why I only marked it "updates"). I have a design doc that I hope to clean up and publish under this issue in the near future.

gopherbot · 2018-07-09T21:21:55Z

Change https://golang.org/cl/122796 mentions this issue: design: soft/hard heap limits design

This is a design document I wrote a long time ago but hadn't published yet (this change was released in Go 1.10). Updates golang/go#14951. Change-Id: Ib5d648eede153d68af503eaada455069c311e27e Reviewed-on: https://go-review.googlesource.com/122796 Reviewed-by: Ian Lance Taylor <iant@golang.org>

This adds an endpoint to the trace tool that plots the minimum mutator utilization curve using information on mark assists and GC pauses from the trace. This commit implements a fairly straightforward O(nm) algorithm for computing the MMU (and tests against an even more direct but slower algorithm). Future commits will extend and optimize this algorithm. This should be useful for debugging and understanding mutator utilization issues like #14951, #14812, #18155. #18534, #21107, particularly once follow-up CLs add trace cross-referencing. Change-Id: Ic2866869e7da1e6c56ba3e809abbcb2eb9c4923a Reviewed-on: https://go-review.googlesource.com/c/60790 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Hyang-Ah Hana Kim <hyangah@gmail.com>

aclements added this to the Unplanned milestone Mar 27, 2016

RLH closed this as completed Mar 30, 2016

bradfitz reopened this Mar 30, 2016

aclements mentioned this issue Jul 18, 2017

runtime: GC causes latency spikes with single processor #18534

Closed

aclements mentioned this issue Jul 31, 2017

runtime: assisting mark and sweep make GC uncontrollable in the critical path #21107

Closed

aclements mentioned this issue Oct 27, 2017

runtime: optimize write barrier #22460

Closed

aclements closed this as completed Nov 1, 2017

golang locked and limited conversation to collaborators Jul 9, 2019

gopherbot added the FrozenDueToAge label Jul 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: mutator assists are over-aggressive, especially at high GOGC #14951

runtime: mutator assists are over-aggressive, especially at high GOGC #14951

aclements commented Mar 24, 2016

aclements commented Mar 24, 2016

randall77 commented Mar 24, 2016

aclements commented Mar 24, 2016

gopherbot commented Mar 30, 2016

RLH commented Mar 30, 2016

cespare commented Mar 30, 2016

RLH commented Mar 30, 2016

bradfitz commented Mar 30, 2016

gopherbot commented Aug 29, 2017

gopherbot commented Aug 29, 2017

gopherbot commented Aug 31, 2017

gopherbot commented Oct 26, 2017

gopherbot commented Oct 26, 2017

aclements commented Nov 1, 2017

gopherbot commented Jul 9, 2018

runtime: mutator assists are over-aggressive, especially at high GOGC #14951

runtime: mutator assists are over-aggressive, especially at high GOGC #14951

Comments

aclements commented Mar 24, 2016

aclements commented Mar 24, 2016

randall77 commented Mar 24, 2016

aclements commented Mar 24, 2016

gopherbot commented Mar 30, 2016

RLH commented Mar 30, 2016

cespare commented Mar 30, 2016

RLH commented Mar 30, 2016

bradfitz commented Mar 30, 2016

gopherbot commented Aug 29, 2017

gopherbot commented Aug 29, 2017

gopherbot commented Aug 31, 2017

gopherbot commented Oct 26, 2017

gopherbot commented Oct 26, 2017

aclements commented Nov 1, 2017

gopherbot commented Jul 9, 2018