runtime: add per-G shadows of writeBarrier.enabled #20005

sorear · 2017-04-17T03:29:42Z

A large fraction of static instructions are used to implement the write barrier enabled check, which currently always uses an absolute memory reference.
On supported RISC architectures accessing data at a small offset from a pointer takes fewer instructions than accessing data at an absolute offset. On all(?) supported architectures, it takes fewer bytes.
Reserving a register to point at the runtime.writeBarrier struct is possible, but would be a difficult tradeoff. However, many architectures already reserve a register to point at the executing G. If we could check write barrier status relative to G, it would save instructions on all RISC architectures.
There are many Gs. Since the enabled flag is updated very rarely, it would be possible to update some or all Gs whenever the master flag is updated. There is a tradeoff of which Gs to update: all of them, those that have Ms, or those that have Ps.
I have a proof of concept patch against the riscv tree (https://review.gerrithub.io/#/c/357282/ or sorear/riscv-go@dab0f89). I can rebase it against master if there is interest. This patch takes the approach of keeping Gs updated if they are referenced by any M; thus the additional STW latency scales as the number of Ms, but there is less potential for a race with exitsyscall.
It is far from clear that I have accounted for all possible races, especially with regards to asynchronous cgo callbacks that could(?) create new Ms at any time.
Initial results measured by .text size on cmd/compile:
```
         before   after    %
386      5730855  5731751 +0.016
amd64    6764675  6765155 +0.007
arm      6155060  6081080 -1.202
arm64    5850320  5725184 -2.139
mips64   7297336  7173880 -1.692
mips     7159648  7097940 -0.862
ppc64    6120800  6058392 -1.020
riscv    3986656  3924656 -1.555
s390x    8253200  8343808 +1.098
```
ppc64 and s390x do not benefit from this patch alone as the current backends for those architectures are unable to use the G register as a base register for memory accesses. For ppc64 I did the measurement with a one-line change that enables G as a base register, for s390x I tried to make a similar change but was not able to get it to work.

It may make sense to exclude s390x from the code generation change since s390x can fetch from an absolute address in one instruction; currently the code generation is conditionalized exclusively on hasGReg.

The demonstration patch keeps the per-G shadows updated even on 386 and amd64 where they are not used. There could be conditionals added to the runtime to avoid that overhead.
I do not have any physical user-programmable hardware of the most affected architectures. While a 2.1% reduction in static instructions for arm64 looks nice on paper, it's moot if it turns out to make things slower for whatever reason.
Is this strategically desirable? It makes moving away from STW at GC phase changes marginally more difficult, increases g size, and might cause other problems I'm not considering.

The text was updated successfully, but these errors were encountered:

josharian · 2017-04-17T14:42:50Z

Loosely related: #15245 and #19838.

rsc · 2017-04-17T19:47:29Z

Personally, I would much rather have lower complexity in the GC code than save 1% in text size. It sounds like the proposed changes are quite subtle and difficult to maintain, and that's already true of many other parts of the GC. I'm not convinced this is the right tradeoff, at least not right now.

rsc · 2017-05-15T20:06:35Z

-> @aclements for decision

sorear · 2017-05-15T20:19:29Z

Shall I make a patch against current master so someone can run benchmarks on affected architectures?

aclements · 2017-05-15T20:53:41Z

I'm somewhat concerned about the potential complexity of this, though your current implementation seems pretty simple.

This has interesting interactions with some changes I have on my list to experiment with. Primarily, I want to try an SSB-style write barrier where the code to append to a local write barrier buffer is inlined and only the overflow involves a call. This would need efficient access to a per-P buffer (which could be mirrored into the G or wherever's best), which probably needs to be hooked in at exactly the same places as your change.

There are many Gs. Since the enabled flag is updated very rarely, it would be possible to update some or all Gs whenever the master flag is updated. There is a tradeoff of which Gs to update: all of them, those that have Ms, or those that have Ps.

I don't think your CL's current approach of iterating over the Ms to set the flag during STW is the right approach. There can be a huge number of Ms, especially in syscall- or cgo-heavy code, so this could significantly increase STW time. It would be much better to do this over the Ps. Doing it over Ps also has a nice parallel with the write barriers themselves: code isn't allowed to execute a write barrier unless it has a P. There's probably some subtlety here, but I think the right places to hook this in would be acquirep1 and execute. I don't think you'd have to do anything specifically during STW since starting the world requires re-acquiring all of the Ps.

A large fraction of static instructions are used to implement the write barrier enabled check, which currently always uses an absolute memory reference.

Interesting. Do you have a number for this?

I do not have any physical user-programmable hardware of the most affected architectures. While a 2.1% reduction in static instructions for arm64 looks nice on paper, it's moot if it turns out to make things slower for whatever reason.

Do you have performance numbers for the architectures you do have access to? It would be particularly interesting to know the results for golang.org/x/benchmarks/garbage, which spends roughly 25% of the time with write barriers enabled.

rsc · 2017-06-12T20:22:02Z

Any progress on making a decision here?

aclements · 2017-06-20T19:34:43Z

@sorear, in principle I think we're okay with this, but we'd like to see a CL with the changes I mentioned above and, if possible, performance numbers.

I'm somewhat concerned about updating the shadow on the g0s. I definitely don't want to loop over the Ms because of the impact on STW time. However, code running on an M that isn't associated with a P isn't allowed to have write barriers anyway, so in principle updating the shadow for these g0s only needs to be done in acquirep.

aclements · 2017-06-21T18:19:45Z

@cherrymui points out that, rather than storing the value of the write barrier flag in the G, we could store a pointer to it. That would be far simpler and more robust, but would probably still reduce the inlined check sequence (though not as much).

sorear · 2017-06-21T19:01:20Z

That'd help on mips and possibly ppc64, but for arm, aarch64, and riscv64 global variable access is already 2-instructions.

(My copy of Go is generating a 3-instruction sequence for global variable access ADRP-ADD-LDR, but clang and gcc generate ADRP-LDR with the immediate folded into the LDR and there's no reason in principle the Go aarch64 backend couldn't be taught the same)

I'll make some time later to make an up to date CL and benchmark it on ARM hardware.

rsc · 2017-07-17T20:51:02Z

Marking this accepted, though of course if the performance is no good then it doesn't have to go in.

cherrymui · 2018-02-20T18:31:35Z

Coincidentally or not, architectures where accessing global variable is expensive tend to have a large number of registers. Another idea is that we could globally reserve a register to hold &runtime.writeBarrier. The load will be just a single instruction. This way, we only need to initialize the register in a few entry points (just like we set the G register), without the complexity of maintaining the per-G shadow.

I did a prototype on PPC64 (mostly working, but not complete). It speeds up go1 benchmarks about 1% , and reduces cmd/go text size by 0.6%.

The drawback here is that this is a potentially incompatible change in the aspect of handling user assembly code. The breakage may be rare if we carefully choose a high-numbered register.

Something to think about.

gopherbot added this to the Proposal milestone Apr 17, 2017

gopherbot added the Proposal label Apr 17, 2017

josharian changed the title ~~Proposal: per-G shadows of runtime.writeBarrier.enabled~~ proposal: add per-G shadows of runtime.writeBarrier.enabled Apr 17, 2017

rsc changed the title ~~proposal: add per-G shadows of runtime.writeBarrier.enabled~~ proposal: runtime: add per-G shadows of writeBarrier.enabled Apr 17, 2017

rsc added the Proposal-Accepted label Jul 17, 2017

rsc modified the milestones: Go1.10, Proposal Jul 17, 2017

rsc changed the title ~~proposal: runtime: add per-G shadows of writeBarrier.enabled~~ runtime: add per-G shadows of writeBarrier.enabled Jul 17, 2017

aclements modified the milestones: Go1.10, Go1.11 Nov 8, 2017

bradfitz modified the milestones: Go1.11, Unplanned Jun 20, 2018

josharian added the Performance label Mar 28, 2019

josharian mentioned this issue Apr 2, 2019

cmd/compile, runtime: reduce function prologue overhead #31193

Open

gopherbot added the compiler/runtime label Jul 7, 2022

mknyszek added this to Go Compiler / Runtime Jul 7, 2022

mknyszek removed this from Go Compiler / Runtime Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: add per-G shadows of writeBarrier.enabled #20005

runtime: add per-G shadows of writeBarrier.enabled #20005

sorear commented Apr 17, 2017

josharian commented Apr 17, 2017

rsc commented Apr 17, 2017

rsc commented May 15, 2017

sorear commented May 15, 2017

aclements commented May 15, 2017

rsc commented Jun 12, 2017

aclements commented Jun 20, 2017

aclements commented Jun 21, 2017

sorear commented Jun 21, 2017

rsc commented Jul 17, 2017

cherrymui commented Feb 20, 2018

runtime: add per-G shadows of writeBarrier.enabled #20005

runtime: add per-G shadows of writeBarrier.enabled #20005

Comments

sorear commented Apr 17, 2017

josharian commented Apr 17, 2017

rsc commented Apr 17, 2017

rsc commented May 15, 2017

sorear commented May 15, 2017

aclements commented May 15, 2017

rsc commented Jun 12, 2017

aclements commented Jun 20, 2017

aclements commented Jun 21, 2017

sorear commented Jun 21, 2017

rsc commented Jul 17, 2017

cherrymui commented Feb 20, 2018