Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync: Pool appears to interact poorly with concurrent GC #11834

Closed
jacobsa opened this issue Jul 23, 2015 · 16 comments
Closed

sync: Pool appears to interact poorly with concurrent GC #11834

jacobsa opened this issue Jul 23, 2015 · 16 comments

Comments

@jacobsa
Copy link
Contributor

jacobsa commented Jul 23, 2015

Background: I have a program that has a steady state heap size of <100 MiB, but does a large rate of allocation, all of which is freed quickly. One of the main things I allocate is buffers that can be reused, so I intermediate the allocations through a sync.Pool. It may be relevant to my theory below that I call Get on only a single goroutine, but call Put on multiple goroutines, and the Put goroutines are all distinct from the Get one.

I measured the rate that Get "misses" in the freelist and falls through to the New function, and was surprised by how high it was. Here are some observations I made while debugging:

  • If I set GODEBUG=gcstoptheworld=2 to disable concurrent garbage collection, the number of freelist misses is 25% as high as before, and my program is 50% faster end-to-end.
  • If I set GOGC=off to disable garbage collection altogether, the number of freelist misses is 1% as high as before, and my program is 60% faster end-to-end.

Digging into the implementation, it appears that sync.Pool grows without bound until the stop-the-world phase of GC, at which point it abandons all of the freelist content. I think there may be a race here that makes this poorly tuned for concurrent GC, at least with my load pattern:

  1. Goroutine A calls Get, and finds the freelist empty. So it goes to allocate.
  2. The allocation triggers a GC run. The GC begins happening concurrently.
  3. Goroutine A allocate several more items, and hands them off to Goroutines B1, B2, ...
  4. Goroutines B1, B2, ... call Put, contributing items back to the freelist.
  5. GC reaches the stop-the-world phase and empties the freelist, throwing away the items just added.
  6. Go to (1).

From staring at gctrace=1 and printf debugging output, it appears my program may be repeatedly tickling this issue.

I wonder if it would be better to have sync.Pool grow its freelist to a high water mark size, then decay exponentially in size with each GC round?

go version devel +1421bc1 Wed Jul 22 09:18:33 2015 +0000 linux/amd64
@ianlancetaylor ianlancetaylor added this to the Go1.5Maybe milestone Jul 23, 2015
@ianlancetaylor
Copy link
Contributor

CC @dvyukov @RLH @aclements

@aclements
Copy link
Member

@jacobsa, can you paste your gctrace output?

@jacobsa
Copy link
Contributor Author

jacobsa commented Jul 23, 2015

Here is a snippet with GODEBUG=gctrace=1, interleaved with my function that logs when the freelist misses:

gc #2015/07/23 01:46:49.064631 allocMessage called 1044 times
555 @16.671s 6%: 2.1+0.40+0.004+2.6+0.27 ms clock, 6.3+0.40+0+0/2.5/3.0+0.83 ms cpu, 64->81->49 MB, 66 MB goal, 4 P
2015/07/23 01:46:49.067593 allocMessage called 1045 times
gc #556 @16.677s 6%: 2015/07/23 01:46:49.070735 allocMessage called 1046 times
1.7+0.29+0.29+2.2+0.27 ms clock, 5.3+0.29+0+0/1.4/1.2+0.83 ms cpu, 64->81->49 MB, 66 MB goal, 4 P
2015/07/23 01:46:49.073320 allocMessage called 1047 times
2015/07/23 01:46:49.076236 allocMessage called 1048 times
gc #557 @16.683s 6%: 1.5+0.27+0.003+2.6+0.31 ms clock, 4.5+0.27+0+0/1.4/1.0+0.93 ms cpu, 64->81->49 MB, 66 MB goal, 4 P
2015/07/23 01:46:49.080386 allocMessage called 1049 times
2015/07/23 01:46:49.083039 allocMessage called 1050 times
gc #558 @16.689s 6%: 2.1+0.41+2.2+1.2+0.28 ms clock, 8.4+0.41+0+0/0.98/2.2+1.1 ms cpu, 64->82->49 MB, 66 MB goal, 4 P
2015/07/23 01:46:49.087171 allocMessage called 1051 times
2015/07/23 01:46:49.089865 allocMessage called 1052 times
gc #559 @16.697s 6%: 1.6+0.35+0.004+2.5+0.32 ms clock, 4.8+0.35+0+0.005/1.5/1.1+0.96 ms cpu, 64->82->49 MB, 65 MB goal, 4 P
gc #5602015/07/23 01:46:49.121411 allocMessage called 1053 times
 @16.729s 6%: 0.042+0.32+0.11+2.9+0.30 ms clock, 0.17+0.32+0+0.13/1.6/1.0+1.2 ms cpu, 64->81->49 MB, 66 MB goal, 4 P
2015/07/23 01:46:49.124037 allocMessage called 1054 times
gc #561 @2015/07/23 01:46:49.127282 allocMessage called 1055 times
16.734s 6%: 1.4+0.28+0.003+2.6+0.30 ms clock, 4.4+0.28+0+0/1.3/1.5+0.91 ms cpu, 64->81->49 MB, 65 MB goal, 4 P
2015/07/23 01:46:49.130001 allocMessage called 1056 times
2015/07/23 01:46:49.133795 allocMessage called 1057 times
2015/07/23 01:46:49.136930 allocMessage called 1058 times
gc #562 @16.739s 6%: 1.8+0.43+2.4+4.2+0.34 ms clock, 5.5+0.43+0+0.93/1.8/0.82+1.0 ms cpu, 64->98->65 MB, 66 MB goal, 4 P
2015/07/23 01:46:49.141047 allocMessage called 1059 times
gc #563 @16.749s 6%: 0.11+0.27+0.003+2.7+0.43 ms clock, 0.34+0.27+0+0/1.6/0.62+1.3 ms cpu, 64->82->49 MB, 65 MB goal, 4 P
2015/07/23 01:46:49.144764 allocMessage called 1060 times
gc #564 @16.7542015/07/23 01:46:49.147856 allocMessage called 1061 times
s 6%: 2.0+0.37+0.004+2.4+0.26 ms clock, 6.0+0.37+0+0/1.2/1.5+0.80 ms cpu, 64->82->49 MB, 66 MB goal, 4 P

And here with GODEBUG=gctrace=1,gcstoptheworld=2:

2015/07/23 01:45:43.173046 allocMessage called 377 times
gc #349 @21.007s 3%: 0.018+0+0+0+1.9 ms clock, 0.073+0+0+0/0/0+7.7 ms cpu, 62->62->33 MB, 62 MB goal, 4 P (forced)
2015/07/23 01:45:43.186673 allocMessage called 378 times
gc #350 @21.034s 3%: 0.071+0+0+0+2.0 ms clock, 0.21+0+0+0/0/0+6.2 ms cpu, 62->62->33 MB, 62 MB goal, 4 P (forced)
2015/07/23 01:45:43.214135 allocMessage called 379 times
gc #351 @21.062s 3%: 0.015+0+0+0+1.7 ms clock, 0.063+0+0+0/0/0+7.0 ms cpu, 62->62->33 MB, 62 MB goal, 4 P (forced)
2015/07/23 01:45:43.240878 allocMessage called 380 times
gc #352 @21.090s 3%: 0.031+0+0+0+1.8 ms clock, 0.12+0+0+0/0/0+7.5 ms cpu, 62->62->33 MB, 62 MB goal, 4 P (forced)
2015/07/23 01:45:43.282152 allocMessage called 381 times
gc #353 @21.118s 3%: 0.017+0+0+0+2.0 ms clock, 0.070+0+0+0/0/0+8.0 ms cpu, 62->62->33 MB, 62 MB goal, 4 P (forced)
2015/07/23 01:45:43.301456 allocMessage called 382 times
gc #354 @21.137s 3%: 2.22015/07/23 01:45:43.314280 allocMessage called 383 times
+0+0+0+2.4 ms clock, 8.9+0+0+0/0/0+9.7 ms cpu, 73->73->33 MB, 73 MB goal, 4 P (forced)
2015/07/23 01:45:43.317380 allocMessage called 384 times
gc #355 @21.167s 3%: 0.016+0+0+0+1.8 ms clock, 0.065+0+0+0/0/0+7.4 ms cpu, 62->62->33 MB, 62 MB goal, 4 P (forced)
2015/07/23 01:45:43.359269 allocMessage called 385 times
gc #356 @21.195s 3%: 0.019+0+0+0+1.7 ms clock, 0.079+0+0+0/0/0+6.9 ms cpu, 62->62->33 MB, 62 MB goal, 4 P (forced)
2015/07/23 01:45:43.379051 allocMessage called 386 times
gc #357 @21.209s 3%: 0.016+0+0+0+1.6 ms clock, 0.067+0+0+0/0/0+6.7 ms cpu, 69->69->17 MB, 69 MB goal, 4 P (forced)

The latter looks much more 'orderly', with apparently one freelist item being thrown out each time. The former is more variable. I suspect you could reproduce this with a much smaller test case by creating a program that does Get in a loop in a single goroutine, then immediately does go Put(), but I haven't tried.

For my notes: this is GoogleCloudPlatform/gcsfuse@04f3cfb and jacobsa/fuse@3755e07, running the benchmark like ./sequential_read --dir ~/mp --file_size 67108864 --read_size 1048576 --duration 10s.

@aclements
Copy link
Member

Thanks! Could you explain a bit about when your "allocMessage" logs are printed and what the count is? Also, roughly how large are the objects being allocated to the pool? And do you know roughly how much non-pool allocation is happening? (Given that your live heap size isn't changing at all, I assume there's basically no non-pool allocation, but wanted to confirm this.)

@aclements
Copy link
Member

Note that emptying the pools is actually one of the very first things GC does when triggered, since it wants to collect those objects in that cycle.

Given the heap goal overshoot in the gctrace and the high allocation rate, I suspect this is equivalent to #11677. I should be able to tell better once I understand more about allocMessage.

@jacobsa
Copy link
Contributor Author

jacobsa commented Jul 23, 2015

Thanks.

The allocMessage line is printed each time Pool.New is called, with the total number of times it has ever been called. The objects being allocated for the pool are small: 48 bytes according to a heap profile viewed with -alloc_objects, though the throughput is high (-alloc_space shows 30 GB over about 20 seconds if I'm understanding correctly, which maybe seems even too high, so maybe I'm not).

There are other sources of allocations in the program. -alloc_objects shows allocMessage as being only 1% weight, but -alloc_space shows it as 88% of the space. (This doesn't make any sense to me given that its allocations are small. Do you understand what these flags cause pprof to show you, exactly?)

Regarding the timing of clearing the pools: it appears the code here stops the world before clearing, and my understanding is that that happened last in the GC process. But I am probably wrong; I haven't looked closely, and the only design doc I could find seems like maybe it's out of date?

I will go read #11677, thanks for the pointer.

@jacobsa
Copy link
Contributor Author

jacobsa commented Jul 23, 2015

Okay, the confusion is that -alloc_objects is not showing the allocation of a 16 MiB slice within allocMessage, presumably because it's too infrequent. That explains the crazy disparity I was confused about above.

@dvyukov
Copy link
Member

dvyukov commented Jul 23, 2015

The program with concurrent GC is slower, because GC runs all the time.
If you look at these lines:

gc #562 @16.739s 6%: 1.8+0.43+2.4+4.2+0.34 ms clock, 5.5+0.43+0+0.93/1.8/0.82+1.0 ms cpu, 64->98->65 MB, 66 MB goal, 4 P
gc #563 @16.749s 6%: 0.11+0.27+0.003+2.7+0.43 ms clock, 0.34+0.27+0+0/1.6/0.62+1.3 ms cpu, 64->82->49 MB, 65 MB goal, 4 P

The second GC starts in 10ms after the first one, while the wall clock time of the first is 9.17ms.
Goal of the second GC is set basically to the current heap size. Which leads to overallocation and the cycle repeats.

Because concurrent GC constantly lags behind it also runs more frequently: every ~6ms vs ~28ms for non-concurrent GC. This explains higher poll miss rate.

What makes the situation worse in this case is that most of the overallocated memory is live at the end of GC, because it is added to the pool, and the pool is cleared only at the beginning of GC.

@rsc
Copy link
Contributor

rsc commented Jul 23, 2015

Regarding the pprof question above, --inuse_xxx shows data still live at the end of the last GC. --alloc_xxx shows the total number of things ever allocated, as of the last GC. Go knows the total allocated even since the last GC, but we report "as of the last GC" so that the two numbers are comparable.

@jacobsa
Copy link
Contributor Author

jacobsa commented Jul 26, 2015

@dvyukov: Thanks, that mostly makes sense. The one thing I can't tell: Is "Goal of the second GC is set basically to the current heap size" a bug, an unfortunate side effect of my program in particular, or what?

@rsc: Thanks for the background. I believe the reason I was mystified by this was that the allocation profile didn't show some very large allocations that only happened a few times. (Which makes sense, it's sampled. Although I do sometimes see such effects even when I've set runtime.MemProfileRate to 1.)

@dvyukov
Copy link
Member

dvyukov commented Jul 27, 2015

Thanks, that mostly makes sense. The one thing I can't tell: Is "Goal of the second GC is set basically to the current heap size" a bug, an unfortunate side effect of my program in particular, or what?

Looks like a general issue in garbage collector.

@rsc
Copy link
Contributor

rsc commented Jul 27, 2015

I believe Pool itself is as good as it can get with concurrent GC. (They are somewhat at odds, and I think the implementation does the best it can, clearing the pools at the start of the GC. That means that the objects used during that concurrent GC will not become available for collection until the next one, but so be it.)

I believe the problems you are seeing are #11677 - the allocations are running too far ahead of the GC.

As such, I'm going to close this bug. Please follow #11677, and if you're still seeing problems after that is fixed, feel free to reopen this one (if you can, I don't know what the GitHub permission model is) or open a new one.

@rsc rsc closed this as completed Jul 27, 2015
@aclements
Copy link
Member

@jacobsa, I have a series of CLs up at #11677. Is it possible for you to pull those locally and see if it helps?

@jacobsa
Copy link
Contributor Author

jacobsa commented Jul 28, 2015

@aclements: I assume you're talking about the recently committed ones. They appear to make the problem significantly worse for me (in particular, 510fd13 does). Here are the throughputs reported by my benchmark at a selection of commits. It's slightly noisy so I wouldn't worry about single-digit changes, but it looks like the change at 510fd13 is significant:

(The numbers above are for an old version of the program, detailed below and mostly matching the one in my previous posts, where I was still using sync.Pool. In a modern version where I've stopped using sync.Pool and cut out a bunch of allocations, I get a steady 255 MiB/s at both 4c94645 and c1f7a56.)

Here is a sample of GODEBUG=gctrace=1 output at c1f7a56 (without the logging I had previously added for falling through the freelist):

gc 475 @14.155s 5%: 1.7+1.1+0.021+0.19+0.41 ms clock, 5.1+1.1+0+1.3/0.030/0.66+1.2 ms cpu, 33->33->33 MB, 34 MB goal, 4 P
gc 476 @14.160s 5%: 0.048+0.23+0.33+0.79+0.29 ms clock, 0.097+0.23+0+0/0.60/1.7+0.59 ms cpu, 33->34->33 MB, 66 MB goal, 4 P
gc 477 @14.162s 5%: 2.3+0.47+0.53+0.18+0.27 ms clock, 9.3+0.47+0+0.95/0.001/1.1+1.0 ms cpu, 49->49->33 MB, 66 MB goal, 4 P
gc 478 @14.184s 5%: 0.051+0.33+0.029+2.3+0.29 ms clock, 0.15+0.33+0+0/0.64/1.8+0.87 ms cpu, 58->74->49 MB, 66 MB goal, 4 P
gc 479 @14.189s 5%: 2.0+0.28+0.004+0.48+0.28 ms clock, 6.1+0.28+0+0.10/0.36/1.4+0.84 ms cpu, 66->66->33 MB, 66 MB goal, 4 P
gc 480 @14.220s 5%: 0.071+2.6+0.17+0.22+0.28 ms clock, 0.21+2.6+0+0.023/0.031/2.7+0.86 ms cpu, 65->81->49 MB, 66 MB goal, 4 P
gc 481 @14.225s 5%: 1.7+0.27+0.42+0.22+0.28 ms clock, 7.0+0.27+0+0.13/0.087/1.7+1.1 ms cpu, 65->65->33 MB, 66 MB goal, 4 P
gc 482 @14.233s 5%: 0.71+0.30+0.004+0.64+0.26 ms clock, 2.8+0.30+0+0.042/0.50/1.8+1.0 ms cpu, 65->65->33 MB, 66 MB goal, 4 P
gc 483 @14.238s 5%: 2.0+0.38+0.24+0.67+0.26 ms clock, 8.3+0.38+0+0.76/0.62/1.2+1.0 ms cpu, 65->65->33 MB, 66 MB goal, 4 P
gc 484 @14.245s 5%: 2.0+0.44+0.50+0.29+0.26 ms clock, 8.2+0.44+0+0.48/0.096/1.4+1.0 ms cpu, 65->65->33 MB, 66 MB goal, 4 P
gc 485 @14.251s 5%: 1.9+0.45+0.70+0.33+0.39 ms clock, 7.9+0.45+0+0.60/0.038/1.3+1.5 ms cpu, 65->65->33 MB, 66 MB goal, 4 P
gc 486 @14.261s 5%: 0.068+3.2+0.001+0.18+0.27 ms clock, 0.27+3.2+0+0.39/0.049/1.2+1.0 ms cpu, 66->82->49 MB, 66 MB goal, 4 P
gc 487 @14.266s 5%: 0.044+0.24+0.020+0.60+0.32 ms clock, 0.17+0.24+0+0.009/0.48/1.5+1.2 ms cpu, 49->49->33 MB, 65 MB goal, 4 P
gc 488 @14.278s 5%: 0.033+2.5+0.16+0.27+0.27 ms clock, 0.13+2.5+0+0.17/0.11/2.4+1.0 ms cpu, 53->70->49 MB, 66 MB goal, 4 P
gc 489 @14.282s 5%: 1.9+0.86+0.001+0.22+0.25 ms clock, 7.9+0.86+0+0.90/0.028/1.1+1.0 ms cpu, 65->65->33 MB, 66 MB goal, 4 P
gc 490 @14.288s 5%: 2.0+1.0+0.043+0.27+1.2 ms clock, 8.1+1.0+0+1.2/0.020/1.1+5.0 ms cpu, 65->65->33 MB, 66 MB goal, 4 P
gc 491 @14.296s 5%: 2.1+0.34+0.009+0.51+0.26 ms clock, 8.4+0.34+0+0.095/0.39/1.5+1.0 ms cpu, 65->65->33 MB, 66 MB goal, 4 P
gc 492 @14.302s 5%: 1.9+1.0+0.008+0.30+0.24 ms clock, 7.6+1.0+0+1.1/0.026/0.91+0.98 ms cpu, 65->65->33 MB, 66 MB goal, 4 P
gc 493 @14.310s 5%: 2.0+2.1+0.001+0.21+0.26 ms clock, 8.0+2.1+0+1.9/0.026/0.019+1.0 ms cpu, 65->65->49 MB, 66 MB goal, 4 P
gc 494 @14.315s 5%: 0.20+0.26+0.020+2.4+2.4 ms clock, 0.61+0.26+0+0/0.62/1.3+7.4 ms cpu, 49->65->49 MB, 98 MB goal, 4 P
gc 495 @14.323s 5%: 0.038+2.5+1.7+0.27+0.46 ms clock, 0.15+2.5+0+1.7/0.079/0.086+1.8 ms cpu, 49->65->33 MB, 66 MB goal, 4 P

For my records, I am running the following with GoogleCloudPlatform/gcsfuse@d4bbf69:

go version && go install -v && gcsfuse jacobsa-standard-asia ~/mp
go version && go build ./benchmarks/sequential_read && ./sequential_read --dir ~/mp --file_size 67108864 --read_size 1048576 --duration 10s

@aclements
Copy link
Member

Hmm. Well, the overallocation you were seeing is gone, but the trigger is really close of the goal in most of the cycles, which is a little strange. Roughly how many goroutines does your test have and how many do you expect are runnable at a given moment?

@jacobsa
Copy link
Contributor Author

jacobsa commented Jul 28, 2015

For each request the program receives from fuse in the kernel, it starts a goroutine. The goroutine replies to the kernel and then goes away. This particular benchmark consists of a ton of serial requests from the kernel, so I expect at most a small number runnable goroutines at a time, but high churn in creating and destroying goroutines.

@golang golang locked and limited conversation to collaborators Aug 5, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants