Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: simplify mark termination and eliminate mark 2 #26903

Closed
aclements opened this issue Aug 9, 2018 · 20 comments
Closed

runtime: simplify mark termination and eliminate mark 2 #26903

aclements opened this issue Aug 9, 2018 · 20 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@aclements
Copy link
Member

aclements commented Aug 9, 2018

The garbage collector currently uses a racy and complex algorithm to enter mark termination, which has wide-ranging consequences for the complexity of the mark termination implementation. I propose replacing it with a race-free distributed termination detection-like algorithm. This should open the door to several significant simplifications of the garbage collector algorithm.

The design doc can be viewed here.

/cc @RLH

@aclements aclements added this to the Go1.12 milestone Aug 9, 2018
@aclements aclements self-assigned this Aug 9, 2018
@gopherbot
Copy link

Change https://golang.org/cl/128896 mentions this issue: design: simplify mark termination and eliminate mark 2

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 10, 2018
gopherbot pushed a commit to golang/proposal that referenced this issue Sep 8, 2018
Updates golang/go#26903.

Change-Id: I8b0fd619282f148db2aa3d4e0a18be9f3241d392
Reviewed-on: https://go-review.googlesource.com/128896
Reviewed-by: Austin Clements <austin@google.com>
@gopherbot
Copy link

Change https://golang.org/cl/134319 mentions this issue: runtime: eliminate gcBlackenPromptly mode

@gopherbot
Copy link

Change https://golang.org/cl/134317 mentions this issue: runtime: track whether any buffer has been flushed from gcWork

@gopherbot
Copy link

Change https://golang.org/cl/134316 mentions this issue: runtime: remove GODEBUG=gctrace=2 mode

@gopherbot
Copy link

Change https://golang.org/cl/134320 mentions this issue: runtime: don't disable GC work caching during mark termination

@gopherbot
Copy link

Change https://golang.org/cl/134315 mentions this issue: runtime: flush write barrier buffer to create work

@gopherbot
Copy link

Change https://golang.org/cl/134318 mentions this issue: runtime: eliminate mark 2 and fix mark termination race

@gopherbot
Copy link

Change https://golang.org/cl/134778 mentions this issue: runtime: add a more stable isSystemGoroutine mode

@gopherbot
Copy link

Change https://golang.org/cl/134784 mentions this issue: runtime: eliminate work.markrootdone and second root marking pass

@gopherbot
Copy link

Change https://golang.org/cl/134779 mentions this issue: runtime: support disabling goroutine scheduling by class

@gopherbot
Copy link

Change https://golang.org/cl/134780 mentions this issue: runtime: implement STW GC in terms of concurrent GC

@gopherbot
Copy link

Change https://golang.org/cl/134776 mentions this issue: runtime: avoid using STW GC mechanism for checkmarks mode

@gopherbot
Copy link

Change https://golang.org/cl/134783 mentions this issue: runtime: flush mcaches lazily

@gopherbot
Copy link

Change https://golang.org/cl/134785 mentions this issue: runtime: eliminate gchelper mechanism

@gopherbot
Copy link

Change https://golang.org/cl/134782 mentions this issue: runtime: eliminate blocking GC work drains

@gopherbot
Copy link

Change https://golang.org/cl/134781 mentions this issue: runtime: clean up remaining mark work check

@gopherbot
Copy link

Change https://golang.org/cl/134775 mentions this issue: runtime: remove gcStart's mode argument

@gopherbot
Copy link

Change https://golang.org/cl/134777 mentions this issue: runtime: remove GODEBUG=gcrescanstacks=1 mode

gopherbot pushed a commit that referenced this issue Oct 2, 2018
Currently, if the gcWork runs out of work, we'll fall out of the GC
worker, even though flushing the write barrier buffer could produce
more work. While this is not a correctness issue, it can lead to
premature mark 2 or mark termination.

Fix this by flushing the write barrier buffer if the local gcWork runs
out of work and then checking the local gcWork again.

This reduces the number of premature mark terminations during all.bash
by about a factor of 10.

Updates #26903. This is preparation for eliminating mark 2.

Change-Id: I48577220b90c86bfd28d498e8603bc379a8cd617
Reviewed-on: https://go-review.googlesource.com/c/134315
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
gopherbot pushed a commit that referenced this issue Oct 2, 2018
It turns out if you set GODEBUG=gctrace=2, it enables an obscure
debugging mode that, in addition to printing gctrace statistics, also
does a second STW GC following each regular GC. This debugging mode
has long since lost its value (you could maybe use it to analyze
floating garbage, except that we don't print the gctrace line on the
second GC), and it interferes substantially with the operation of the
GC by messing up the statistics used to schedule GCs.

It's also a source of mark termination GC work when we're in
concurrent GC mode, so it's going to interfere with eliminating mark
2. And it's going to get in the way of unifying STW and concurrent GC.

This CL removes this debugging mode.

Updates #26903. This is preparation for eliminating mark 2 and
unifying STW GC and concurrent GC.

Change-Id: Ib5bce05d8c4d5b6559c89a65165d49532165df07
Reviewed-on: https://go-review.googlesource.com/c/134316
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
gopherbot pushed a commit that referenced this issue Oct 2, 2018
Nothing currently consumes the flag, but we'll use it in the
distributed termination detection algorithm.

Updates #26903. This is preparation for eliminating mark 2.

Change-Id: I5e149a05b1c878fe1009150da21f8bd8ae2b9b6a
Reviewed-on: https://go-review.googlesource.com/c/134317
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
gopherbot pushed a commit that referenced this issue Oct 2, 2018
The mark 2 phase was originally introduced as a way to reduce the
chance of entering STW mark termination while there was still marking
work to do. It works by flushing and disabling all local work caches
so that all enqueued work becomes immediately globally visible.
However, mark 2 is not only slow–disabling caches makes marking and
the write barrier both much more expensive–but also imperfect. There
is still a rare but possible race (~once per all.bash) that can cause
GC to enter mark termination while there is still marking work. This
race is detailed at
https://github.com/golang/proposal/blob/master/design/17503-eliminate-rescan.md#appendix-mark-completion-race
The effect of this is that mark termination must still cope with the
possibility that there may be work remaining after a concurrent mark
phase. Dealing with this increases STW pause time and increases the
complexity of mark termination.

Furthermore, a similar but far more likely race can cause early
transition from mark 1 to mark 2. This is unfortunate because it
causes performance instability because of the cost of mark 2.

This CL fixes this by replacing mark 2 with a distributed termination
detection algorithm. This algorithm is correct, so it eliminates the
mark termination race, and doesn't require disabling local caches. It
ensures that there are no grey objects upon entering mark termination.
With this change, we're one step closer to eliminating marking from
mark termination entirely (it's still used by STW GC and checkmarks
mode).

This CL does not eliminate the gcBlackenPromptly global flag, though
it is always set to false now. It will be removed in a cleanup CL.

This led to only minor variations in the go1 benchmarks
(https://perf.golang.org/search?q=upload:20180909.1) and compilebench
benchmarks (https://perf.golang.org/search?q=upload:20180910.2).

This significantly improves performance of the garbage benchmark, with
no impact on STW times:

name                        old time/op    new time/op   delta
Garbage/benchmem-MB=64-12    2.21ms ± 1%   2.05ms ± 1%   -7.38% (p=0.000 n=18+19)
Garbage/benchmem-MB=1024-12  2.30ms ±16%   2.20ms ± 7%   -4.51% (p=0.001 n=20+20)

name                        old STW-ns/GC  new STW-ns/GC  delta
Garbage/benchmem-MB=64-12      138k ±44%     141k ±23%     ~    (p=0.309 n=19+20)
Garbage/benchmem-MB=1024-12    159k ±25%     178k ±98%     ~    (p=0.798 n=16+18)

name                        old STW-ns/op  new STW-ns/op                delta
Garbage/benchmem-MB=64-12     4.42k ±44%    4.24k ±23%     ~    (p=0.531 n=19+20)
Garbage/benchmem-MB=1024-12     591 ±24%      636 ±111%    ~    (p=0.309 n=16+18)

(https://perf.golang.org/search?q=upload:20180910.1)

Updates #26903.
Updates #17503.

Change-Id: Icbd1e12b7a12a76f423c9bf033b13cb363e4cd19
Reviewed-on: https://go-review.googlesource.com/c/134318
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
gopherbot pushed a commit that referenced this issue Oct 2, 2018
Now that there is no mark 2 phase, gcBlackenPromptly is no longer
used.

Updates #26903. This is a follow-up to eliminating mark 2.

Change-Id: Ib9c534f21b36b8416fcf3cab667f186167b827f8
Reviewed-on: https://go-review.googlesource.com/c/134319
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
gopherbot pushed a commit that referenced this issue Oct 2, 2018
Currently, we disable GC work caching during mark termination. This is
no longer necessary with the new mark completion detection because

1. There's no way for any of the GC mark termination helpers to have
any real work queued and,

2. Mark termination has to explicitly flush every P's buffers anyway
in order to flush Ps that didn't run a GC mark termination helper.

Hence, remove the code that disposes gcWork buffers during mark
termination.

Updates #26903. This is a follow-up to eliminating mark 2.

Change-Id: I81f002ee25d5c10f42afd39767774636519007f9
Reviewed-on: https://go-review.googlesource.com/c/134320
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
gopherbot pushed a commit that referenced this issue Oct 2, 2018
This argument is always gcBackgroundMode since only
debug.gcstoptheworld can trigger a STW GC at this point. Remove the
unnecessary argument.

Updates #26903. This is preparation for unifying STW GC and concurrent
GC.

Change-Id: Icb4ba8f10f80c2b69cf51a21e04fa2c761b71c94
Reviewed-on: https://go-review.googlesource.com/c/134775
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
gopherbot pushed a commit that referenced this issue Oct 2, 2018
Currently, checkmarks mode uses the full STW GC infrastructure to
perform mark checking. We're about to remove that infrastructure and,
furthermore, since checkmarks is about doing the simplest thing
possible to check concurrent GC, it's valuable for it to be simpler.

Hence, this CL makes checkmarks even simpler by making it non-parallel
and divorcing it from the STW GC infrastructure (including the
gchelper mechanism).

Updates #26903. This is preparation for unifying STW GC and concurrent
GC.

Change-Id: Iad21158123e025e3f97d7986d577315e994bd43e
Reviewed-on: https://go-review.googlesource.com/c/134776
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
gopherbot pushed a commit that referenced this issue Oct 2, 2018
Currently, setting GODEBUG=gcrescanstacks=1 enables a debugging mode
where the garbage collector re-scans goroutine stacks during mark
termination. This was introduced in Go 1.8 to debug the hybrid write
barrier, but I don't think we ever used it.

Now it's one of the last sources of mark work during mark termination.
This CL removes it.

Updates #26903. This is preparation for unifying STW GC and concurrent
GC.

Updates #17503.

Change-Id: I6ae04d3738aa9c448e6e206e21857a33ecd12acf
Reviewed-on: https://go-review.googlesource.com/c/134777
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
gopherbot pushed a commit that referenced this issue Oct 2, 2018
Currently, isSystemGoroutine varies on whether it considers the
finalizer goroutine a user goroutine or a system goroutine. For the
next CL, we're going to want to always consider the finalier goroutine
a user goroutine, so add a flag that indicates that.

Updates #26903. This is preparation for unifying STW GC and concurrent
GC.

Change-Id: Iafc92e519c13d9f8d879332cb5f0d12164104c33
Reviewed-on: https://go-review.googlesource.com/c/134778
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
gopherbot pushed a commit that referenced this issue Oct 2, 2018
This adds support for disabling the scheduling of user goroutines
while allowing system goroutines like the garbage collector to
continue running. User goroutines pass through the usual state
transitions, but if we attempt to actually schedule one, it will get
put on a deferred scheduling list.

Updates #26903. This is preparation for unifying STW GC and concurrent
GC.

Updates #25578. This same mechanism can form the basis for disabling
all but a single user goroutine for the purposes of debugger function
call injection.

Change-Id: Ib72a808e00c25613fe6982f5528160d3de3dbbc6
Reviewed-on: https://go-review.googlesource.com/c/134779
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
gopherbot pushed a commit that referenced this issue Oct 2, 2018
Currently, STW GC works very differently from concurrent GC. The
largest differences in that in concurrent GC, all marking work is done
by background mark workers during the mark phase, while in STW GC, all
marking work is done by gchelper during the mark termination phase.

This is a consequence of the evolution of Go's GC from a STW GC by
incrementally moving work from STW mark termination into concurrent
mark. However, at this point, the STW code paths exist only as a
debugging mode. Having separate code paths for this increases the
maintenance burden and complexity of the garbage collector. At the
same time, these code paths aren't tested nearly as well, making it
far more likely that they will bit-rot.

This CL reverses the relationship between STW GC, by re-implementing
STW GC in terms of concurrent GC.

This builds on the new scheduled support for disabling user goroutine
scheduling. During sweep termination, it disables user scheduling, so
when the GC starts the world again for concurrent mark, it's really
only "concurrent" with itself.

There are several code paths that were specific to STW GC that are now
vestigial. We'll remove these in the follow-up CLs.

Updates #26903.

Change-Id: Ia3883d2fcf7ab1d89bdc9c8ee54bf9bffb32c096
Reviewed-on: https://go-review.googlesource.com/c/134780
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
gopherbot pushed a commit that referenced this issue Oct 2, 2018
Now that STW GC marking is unified with concurrent marking, there
should never be mark work remaining in mark termination. Hence, we can
make that check unconditional.

Updates #26903. This is a follow-up to unifying STW GC and concurrent GC.

Change-Id: I43a21df5577635ab379c397a7405ada68d331e03
Reviewed-on: https://go-review.googlesource.com/c/134781
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
gopherbot pushed a commit that referenced this issue Oct 2, 2018
Now work.helperDrainBlock is always false, so we can remove it and
code paths that only ran when it was true. That means we no longer use
the gcDrainBlock mode of gcDrain, so we can eliminate that. That means
we no longer use gcWork.get, so we can eliminate that. That means we
no longer use getfull, so we can eliminate that.

Updates #26903. This is a follow-up to unifying STW GC and concurrent GC.

Change-Id: I8dbcf8ce24861df0a6149e0b7c5cd0eadb5c13f6
Reviewed-on: https://go-review.googlesource.com/c/134782
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
gopherbot pushed a commit that referenced this issue Oct 2, 2018
Currently, all mcaches are flushed during STW mark termination as a
root marking job. This is currently necessary because all spans must
be out of these caches before sweeping begins to avoid races with
allocation and to ensure the spans are in the state expected by
sweeping. We do it as a root marking job because mcache flushing is
somewhat expensive and O(GOMAXPROCS) and this parallelizes the work
across the Ps. However, it's also the last remaining root marking job
performed during mark termination.

This CL moves mcache flushing out of mark termination and performs it
lazily. We keep track of the last sweepgen at which each mcache was
flushed and as each P is woken from STW, it observes that its mcache
is out-of-date and flushes it.

The introduces a complication for spans cached in stale mcaches. These
may now be observed by background or proportional sweeping or when
attempting to add a finalizer, but aren't in a stable state. For
example, they are likely to be on the wrong mcentral list. To fix
this, this CL extends the sweepgen protocol to also capture whether a
span is cached and, if so, whether or not its cache is stale. This
protocol blocks asynchronous sweeping from touching cached spans and
makes it the responsibility of mcache flushing to sweep the flushed
spans.

This eliminates the last mark termination root marking job, which
means we can now eliminate that entire infrastructure.

Updates #26903. This implements lazy mcache flushing.

Change-Id: Iadda7aabe540b2026cffc5195da7be37d5b4125e
Reviewed-on: https://go-review.googlesource.com/c/134783
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
gopherbot pushed a commit that referenced this issue Oct 2, 2018
Before STW and concurrent GC were unified, there could be either one
or two root marking passes per GC cycle. There were several tasks we
had to make sure happened once and only once (whether that was at the
beginning of concurrent mark for concurrent GC or during mark
termination for STW GC). We kept track of this in work.markrootdone.

Now that STW and concurrent GC both use the concurrent marking code
and we've eliminated all work done by the second root marking pass, we
only ever need a single root marking pass. Hence, we can eliminate
work.markrootdone and all of the code that's conditional on it.

Updates #26903.

Change-Id: I654a0f5e21b9322279525560a31e64b8d33b790f
Reviewed-on: https://go-review.googlesource.com/c/134784
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
gopherbot pushed a commit that referenced this issue Oct 2, 2018
Now that we do no mark work during mark termination, we no longer need
the gchelper mechanism.

Updates #26903.
Updates #17503.

Change-Id: Ie94e5c0f918cfa047e88cae1028fece106955c1b
Reviewed-on: https://go-review.googlesource.com/c/134785
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
@aclements
Copy link
Member Author

The only unimplemented step of my original proposal is "Allow safe-points without preemption in dedicated workers". Working on that should probably be folded into a broader re-think of how we schedule and preempt workers.

The primary goal was eliminating mark 2, which is done.

@golang golang locked and limited conversation to collaborators Oct 24, 2019
@aclements
Copy link
Member Author

We've known about and worked around a bug in this protocol for several years, which apparently we never referenced from this issue. To save myself time searching GitHub in the future, this is where I described the problem: #27993 (comment)

To be clear, the work-around we have in place works, it's just kind of unfortunate and very occasionally causes an extra STW.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

3 participants