Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: group write barrier calls more aggresively #19838

Closed
cherrymui opened this issue Apr 4, 2017 · 9 comments
Closed

cmd/compile: group write barrier calls more aggresively #19838

cherrymui opened this issue Apr 4, 2017 · 9 comments

Comments

@cherrymui
Copy link
Member

Currently, for things like

v0 = some memory
v1 = Store ...  v0 // write barrier
v2 = Store ...  v1 // no write barrier
v3 = Store ...  v2 // write barrier
v4 = Store ...  v3 // no write barrier
v5 = Store ...  v4 // write barrier

The compiler generates separate branches for v1, v3, v5. Something like:

v0 = some memory
if write barrier is on {
    ... // prepare args
    v1a = Call runtime.writebarrierPtr ...
} else {
    v1b = Store ...  v0 // regular store
}
v1 = Phi v1a v1b
v2 = Store ...  v1 // no write barrier
if write barrier is on {
    ... // prepare args
    v3a = Call runtime.writebarrierPtr ...
} else {
    v3b = Store ...  v2 // regular store
}
v3 = Phi v3a v3b
v4 = Store ...  v3 // no write barrier
if write barrier is on {
    ... // prepare args
    v5a = Call runtime.writebarrierPtr ...
} else {
    v5b = Store ...  v4 // regular store
}
v5 = Phi v5a v5b

Sometimes it creates too many blocks. (When building SSA we try to put pointer writes together exactly for this reason.)

It would be better to group them in a single branch, when it is safe. But it needs to know whether it is ok to reorder stores.

Alternatively, maybe we could duplicate v2, v4 to both side of the branch (some conditions needed, at least v2.Uses == 1). But the question is how many of them we want to duplicate. What if there are 5 no-WB stores between two WB stores?

@randall77
Copy link
Contributor

A simple rule of thumb would be to do whatever generates less total code. So if the source is

storeptr
store * N
storeptr

Compare (some approximation of) the code size between

if wbon {
  storeptrwb
} else {
  storeptr
}
store * N
if wbon {
  storeptrwb
} else {
  storeptr
}

and

if wbon {
   storeptrwb
   store *N
   storeptrwb
} else {
   storeptr
   store *N
   storeptr
}

So basically, if the size of the N stores is smaller than an additional condition (load, compare, conditional branch, unconditional branch).

@josharian
Copy link
Contributor

I've started experimenting with this. I hope to some data tomorrow.

@gopherbot
Copy link

CL https://golang.org/cl/42010 mentions this issue.

@gopherbot
Copy link

CL https://golang.org/cl/42011 mentions this issue.

@gopherbot
Copy link

CL https://golang.org/cl/42013 mentions this issue.

@josharian
Copy link
Contributor

The CL series just mailed takes a look at this. The final CL needs some numbers. I'd love suggestions about what other than binary size to look at--e.g. regular benchmarks to run.

Unfortunately, my original personal motivation for digging into this--improving compilation time of large static literals--isn't helped, since that involves duplicating calls, not stores, and that's probably a bad idea. So back to the drawing board on those.

gopherbot pushed a commit that referenced this issue Apr 27, 2017
This CL mainly moves some work to the switch on w.Op,
to make a follow-up change simpler and clearer.

Updates #19838

Change-Id: I86f3181c380dd60960afcc24224f655276b8956c
Reviewed-on: https://go-review.googlesource.com/42010
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
josharian added a commit to josharian/go that referenced this issue May 2, 2017
…ocks

OpVarXXX Values don't generate instructions,
so there's no reason not to duplicate them,
and duplicating them generates better code
(fewer branches).

This requires changing the start/end accounting
to correctly handle the case in which we have run
of Values beginning with an OpVarXXX, e.g.
OpVarDef, OpZeroWB, OpMoveWB.
In that case, the sequence of values should begin
at the OpZeroWB, not the OpVarDef.

This also lays the groundwork for experimenting
with allowing duplication of some scalar stores.

Shrinks function text sizes a tiny amount:

name        old object-bytes  new object-bytes  delta
Template           381k ± 0%         381k ± 0%  -0.01%  (p=0.008 n=5+5)
Unicode            203k ± 0%         203k ± 0%  -0.04%  (p=0.008 n=5+5)
GoTypes           1.17M ± 0%        1.17M ± 0%  -0.01%  (p=0.008 n=5+5)
SSA               8.24M ± 0%        8.24M ± 0%  -0.00%  (p=0.008 n=5+5)
Flate              230k ± 0%         230k ± 0%    ~     (all equal)
GoParser           286k ± 0%         286k ± 0%    ~     (all equal)
Reflect           1.00M ± 0%        1.00M ± 0%    ~     (all equal)
Tar                189k ± 0%         189k ± 0%    ~     (all equal)
XML                415k ± 0%         415k ± 0%  -0.01%  (p=0.008 n=5+5)

Updates golang#19838

Change-Id: Ic5ef30855919f1468066eba08ae5c4bd9a01db27
josharian added a commit to josharian/go that referenced this issue May 2, 2017
DO NOT SUBMIT

[mailing for initial feedback, needs benchmark numbers]

Duplicate some scalar stores on both halves of a
writeBarrier.enabled block, to reduce branching.

There are other memory ops besides OpVarXXX and OpStore
that we could consider here: OpMove, OpZero, Op*Call,
atomic stores, OpSelect1, OpPhi.

However, OpVarXXX and OpStore are by far the most common,
and they correspond to a fixed number of instructions.

The others are all rare and are either tricky
(OpSelect1, OpPhi, atomic stores) or
correspond to potentially lots of code.

Fixes golang#19838

Change-Id: I0d5a68c2dfb3b6a6916b92151b3e64df533d4581
josharian added a commit to josharian/go that referenced this issue May 2, 2017
This CL mainly moves some work to the switch on w.Op,
to make a follow-up change simpler and clearer.

Updates golang#19838

Change-Id: I86f3181c380dd60960afcc24224f655276b8956c
gopherbot pushed a commit that referenced this issue May 9, 2017
…ocks

OpVarXXX Values don't generate instructions,
so there's no reason not to duplicate them,
and duplicating them generates better code
(fewer branches).

This requires changing the start/end accounting
to correctly handle the case in which we have run
of Values beginning with an OpVarXXX, e.g.
OpVarDef, OpZeroWB, OpMoveWB.
In that case, the sequence of values should begin
at the OpZeroWB, not the OpVarDef.

This also lays the groundwork for experimenting
with allowing duplication of some scalar stores.

Shrinks function text sizes a tiny amount:

name        old object-bytes  new object-bytes  delta
Template           381k ± 0%         381k ± 0%  -0.01%  (p=0.008 n=5+5)
Unicode            203k ± 0%         203k ± 0%  -0.04%  (p=0.008 n=5+5)
GoTypes           1.17M ± 0%        1.17M ± 0%  -0.01%  (p=0.008 n=5+5)
SSA               8.24M ± 0%        8.24M ± 0%  -0.00%  (p=0.008 n=5+5)
Flate              230k ± 0%         230k ± 0%    ~     (all equal)
GoParser           286k ± 0%         286k ± 0%    ~     (all equal)
Reflect           1.00M ± 0%        1.00M ± 0%    ~     (all equal)
Tar                189k ± 0%         189k ± 0%    ~     (all equal)
XML                415k ± 0%         415k ± 0%  -0.01%  (p=0.008 n=5+5)

Updates #19838

Change-Id: Ic5ef30855919f1468066eba08ae5c4bd9a01db27
Reviewed-on: https://go-review.googlesource.com/42011
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
josharian added a commit to josharian/go that referenced this issue May 10, 2017
…ocks

OpVarXXX Values don't generate instructions,
so there's no reason not to duplicate them,
and duplicating them generates better code
(fewer branches).

This requires changing the start/end accounting
to correctly handle the case in which we have run
of Values beginning with an OpVarXXX, e.g.
OpVarDef, OpZeroWB, OpMoveWB.
In that case, the sequence of values should begin
at the OpZeroWB, not the OpVarDef.

This also lays the groundwork for experimenting
with allowing duplication of some scalar stores.

Shrinks function text sizes a tiny amount:

name        old object-bytes  new object-bytes  delta
Template           381k ± 0%         381k ± 0%  -0.01%  (p=0.008 n=5+5)
Unicode            203k ± 0%         203k ± 0%  -0.04%  (p=0.008 n=5+5)
GoTypes           1.17M ± 0%        1.17M ± 0%  -0.01%  (p=0.008 n=5+5)
SSA               8.24M ± 0%        8.24M ± 0%  -0.00%  (p=0.008 n=5+5)
Flate              230k ± 0%         230k ± 0%    ~     (all equal)
GoParser           286k ± 0%         286k ± 0%    ~     (all equal)
Reflect           1.00M ± 0%        1.00M ± 0%    ~     (all equal)
Tar                189k ± 0%         189k ± 0%    ~     (all equal)
XML                415k ± 0%         415k ± 0%  -0.01%  (p=0.008 n=5+5)

Updates golang#19838

Change-Id: Ic5ef30855919f1468066eba08ae5c4bd9a01db27
josharian added a commit to josharian/go that referenced this issue May 10, 2017
DO NOT SUBMIT

[mailing for initial feedback, needs benchmark numbers]

Duplicate some scalar stores on both halves of a
writeBarrier.enabled block, to reduce branching.

There are other memory ops besides OpVarXXX and OpStore
that we could consider here: OpMove, OpZero, Op*Call,
atomic stores, OpSelect1, OpPhi.

However, OpVarXXX and OpStore are by far the most common,
and they correspond to a fixed number of instructions.

The others are all rare and are either tricky
(OpSelect1, OpPhi, atomic stores) or
correspond to potentially lots of code.

Fixes golang#19838

Change-Id: I0d5a68c2dfb3b6a6916b92151b3e64df533d4581
@josharian
Copy link
Contributor

Anything more on this front should wait for 1.10.

@josharian josharian modified the milestones: Go1.10, Go1.9 May 15, 2017
@bradfitz bradfitz modified the milestones: Go1.10, Go1.11 Nov 28, 2017
@bradfitz bradfitz modified the milestones: Go1.11, Unplanned May 18, 2018
@aclements
Copy link
Member

This optimization would interact poorly with safe-points everywhere, since we have to disallow safe-points between the write barrier flag check and the last write barrier that depends on it (or, if we can construct the write barrier as a restartable sequence, back up to the flag load, but that doesn't help this optimization). Though I think this optimization may be less important now, too, since we generate significantly less code for a write barrier call than we used to. We could generate even less if we pulled the write out of runtime.gcWriteBarrier.

@josharian
Copy link
Contributor

I never got conclusive improvements when I implemented this. Let’s focus any effort on this front on improving write barriers generally. (E.g. #24686)

@golang golang locked and limited conversation to collaborators May 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants