New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime, cmd/compile: consider write barrier short-circuit when *dst == src #14921
Comments
Just for the kicks, I tried that, showed some mixed results for
Want me to prep a CL? |
By "that", do you mean (1), (2), or (3)? |
We have to be careful not to run afoul of the issue in #14855 . I don't think this optimization would cause that problem, but it probably deserves a closer look. |
I tried 2, modified I'll mess with it later and test against the case in #14855. |
I posted a CL21027 with the benchstat. |
CL https://golang.org/cl/21027 mentions this issue. |
Thinking out loud, one interesting aspect of modifying the calling code (3 above) is that it might allow other optimization passes to remove the code entirely by proving that |
The benchmark results from CL 21027 suggest that append may be the main source of idempotent pointer updates. Josh, do you happen to know what was one level up from the One potential downside of modifying writebarrierptr this way is that it can cause extra coherence traffic by first pulling the pointer's cache line into shared and then upgrading it to modified. Though I doubt that's what causing the slowdowns shown in the CL commit message since those benchmarks are sequential. I don't think these changes run afoul of the races in #14855. In that case we were doing the pointer write, but not doing the write barrier. It should be safe if we don't do either. If we go the route of modifying the calling code, it would probably be better to make it:
so we don't penalize the code with an extra (and potentially poorly predictable) branch when write barriers are off. |
@aclements I'm trying something like that, I'll let you know what happens after the benchmarks finish. |
I did a bit of experimenting. All of these results are for 4c9a470 (two days ago; forgot to sync). First, I reproduced @josharian's result using the following patch:
I get this result:
That is, of 14M write barriers, 10% are writing nil and 23.5% are *dst == src (but not writing nil). I separately checked that almost all of the 10% nil writes also have *dst == src. We should probably skip the write barrier earlier if src is nil. Currently we go through quite a bit of work before checking for a nil pointer. Focusing on the ones where *dst == src, but src != nil, I profiled where they're coming from using this patch:
The result is
Of these, all but dcl.go:58, dcl.go:753, mheap.go:603, and hashmap.go:793 are appends. @randall77, can we optimize the generated append code so that the path that grows the slice in place doesn't write back the pointer if we're writing to the same slice we're growing? I imagine this would be something like the slicing optimization. We can safely skip the write barrier on mheap.go:603. I'm a little surprised that does equal pointer writes so often. There may be also be an algorithmic improvement here. We could put a conditional around the one in hashmap.go. I think that happens when we're iterating over a map and the iterator stays in the same bucket. I don't totally understand this code; there may be better ways to improve it. |
I accidentally uploaded a 2nd CL, https://go-review.googlesource.com/#/c/21138/
|
@randall77, this is a regression in the SSA back end, similar to the reslicing one. The old back end avoided updating the base pointer in the append fast path, and therefore avoided the write barrier.
|
For the hashtable iterator case, we can just do the check to see if we're writing something new back explicitly. |
I remeasured this after CLs 21812, 21813, and 21814, which address the append and the map iteration cases. New results, using Austin's approach above:
So 6% of writes have *dst == src, and 10% have src == nil. The new top lines for *dst == src (line numbers at commit de7ee57) are:
The gopark calls are unnecessary, since all the strings that reach it are constants, but there's no way for the compiler to know that. I've filed #15226 for allocSpanLocked. gcCopySpans might deserve a look as well, but we're close to scraping bottom (yay!). I will investigate whether it is better to exit writebarrierptr early if src == nil or whether it is better to check it in the generated code and send a CL for one of those to close this issue. |
CL https://golang.org/cl/21813 mentions this issue. |
CL https://golang.org/cl/21814 mentions this issue. |
Update #14921 Change-Id: I5c5816d0193757bf7465b1e09c27ca06897df4bf Reviewed-on: https://go-review.googlesource.com/21814 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
Mailed CLs 21820 and 21821 that implement the two "do less work if src==nil" strategies described above. The numbers for adding a check to the wrapper code are better, although it does increase binary size by ~0.1%. It's not obvious to me which of the two (if either) should go in. Feedback or better benchmark numbers welcomed. |
CL https://golang.org/cl/21821 mentions this issue. |
CL https://golang.org/cl/21820 mentions this issue. |
When we are writing the result of an append back to the same slice, we don’t need a write barrier on the fast path. This re-implements an optimization that was present in the old backend. Updates #14921 Fixes #14969 Sample code: var x []byte func p() { x = append(x, 1, 2, 3) } Before: "".p t=1 size=224 args=0x0 locals=0x48 0x0000 00000 (append.go:21) TEXT "".p(SB), $72-0 0x0000 00000 (append.go:21) MOVQ (TLS), CX 0x0009 00009 (append.go:21) CMPQ SP, 16(CX) 0x000d 00013 (append.go:21) JLS 199 0x0013 00019 (append.go:21) SUBQ $72, SP 0x0017 00023 (append.go:21) FUNCDATA $0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB) 0x0017 00023 (append.go:21) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB) 0x0017 00023 (append.go:19) MOVQ "".x+16(SB), CX 0x001e 00030 (append.go:19) MOVQ "".x(SB), DX 0x0025 00037 (append.go:19) MOVQ "".x+8(SB), BX 0x002c 00044 (append.go:19) MOVQ BX, "".autotmp_0+64(SP) 0x0031 00049 (append.go:22) LEAQ 3(BX), BP 0x0035 00053 (append.go:22) CMPQ BP, CX 0x0038 00056 (append.go:22) JGT $0, 131 0x003a 00058 (append.go:22) MOVB $1, (DX)(BX*1) 0x003e 00062 (append.go:22) MOVB $2, 1(DX)(BX*1) 0x0043 00067 (append.go:22) MOVB $3, 2(DX)(BX*1) 0x0048 00072 (append.go:22) MOVQ BP, "".x+8(SB) 0x004f 00079 (append.go:22) MOVQ CX, "".x+16(SB) 0x0056 00086 (append.go:22) MOVL runtime.writeBarrier(SB), AX 0x005c 00092 (append.go:22) TESTB AL, AL 0x005e 00094 (append.go:22) JNE $0, 108 0x0060 00096 (append.go:22) MOVQ DX, "".x(SB) 0x0067 00103 (append.go:23) ADDQ $72, SP 0x006b 00107 (append.go:23) RET 0x006c 00108 (append.go:22) LEAQ "".x(SB), CX 0x0073 00115 (append.go:22) MOVQ CX, (SP) 0x0077 00119 (append.go:22) MOVQ DX, 8(SP) 0x007c 00124 (append.go:22) PCDATA $0, $0 0x007c 00124 (append.go:22) CALL runtime.writebarrierptr(SB) 0x0081 00129 (append.go:23) JMP 103 0x0083 00131 (append.go:22) LEAQ type.[]uint8(SB), AX 0x008a 00138 (append.go:22) MOVQ AX, (SP) 0x008e 00142 (append.go:22) MOVQ DX, 8(SP) 0x0093 00147 (append.go:22) MOVQ BX, 16(SP) 0x0098 00152 (append.go:22) MOVQ CX, 24(SP) 0x009d 00157 (append.go:22) MOVQ BP, 32(SP) 0x00a2 00162 (append.go:22) PCDATA $0, $0 0x00a2 00162 (append.go:22) CALL runtime.growslice(SB) 0x00a7 00167 (append.go:22) MOVQ 40(SP), DX 0x00ac 00172 (append.go:22) MOVQ 48(SP), AX 0x00b1 00177 (append.go:22) MOVQ 56(SP), CX 0x00b6 00182 (append.go:22) ADDQ $3, AX 0x00ba 00186 (append.go:19) MOVQ "".autotmp_0+64(SP), BX 0x00bf 00191 (append.go:22) MOVQ AX, BP 0x00c2 00194 (append.go:22) JMP 58 0x00c7 00199 (append.go:22) NOP 0x00c7 00199 (append.go:21) CALL runtime.morestack_noctxt(SB) 0x00cc 00204 (append.go:21) JMP 0 After: "".p t=1 size=208 args=0x0 locals=0x48 0x0000 00000 (append.go:21) TEXT "".p(SB), $72-0 0x0000 00000 (append.go:21) MOVQ (TLS), CX 0x0009 00009 (append.go:21) CMPQ SP, 16(CX) 0x000d 00013 (append.go:21) JLS 191 0x0013 00019 (append.go:21) SUBQ $72, SP 0x0017 00023 (append.go:21) FUNCDATA $0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB) 0x0017 00023 (append.go:21) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB) 0x0017 00023 (append.go:19) MOVQ "".x+16(SB), CX 0x001e 00030 (append.go:19) MOVQ "".x+8(SB), DX 0x0025 00037 (append.go:19) MOVQ DX, "".autotmp_0+64(SP) 0x002a 00042 (append.go:19) MOVQ "".x(SB), BX 0x0031 00049 (append.go:22) LEAQ 3(DX), BP 0x0035 00053 (append.go:22) MOVQ BP, "".x+8(SB) 0x003c 00060 (append.go:22) CMPQ BP, CX 0x003f 00063 (append.go:22) JGT $0, 84 0x0041 00065 (append.go:22) MOVB $1, (BX)(DX*1) 0x0045 00069 (append.go:22) MOVB $2, 1(BX)(DX*1) 0x004a 00074 (append.go:22) MOVB $3, 2(BX)(DX*1) 0x004f 00079 (append.go:23) ADDQ $72, SP 0x0053 00083 (append.go:23) RET 0x0054 00084 (append.go:22) LEAQ type.[]uint8(SB), AX 0x005b 00091 (append.go:22) MOVQ AX, (SP) 0x005f 00095 (append.go:22) MOVQ BX, 8(SP) 0x0064 00100 (append.go:22) MOVQ DX, 16(SP) 0x0069 00105 (append.go:22) MOVQ CX, 24(SP) 0x006e 00110 (append.go:22) MOVQ BP, 32(SP) 0x0073 00115 (append.go:22) PCDATA $0, $0 0x0073 00115 (append.go:22) CALL runtime.growslice(SB) 0x0078 00120 (append.go:22) MOVQ 40(SP), CX 0x007d 00125 (append.go:22) MOVQ 56(SP), AX 0x0082 00130 (append.go:22) MOVQ AX, "".x+16(SB) 0x0089 00137 (append.go:22) MOVL runtime.writeBarrier(SB), AX 0x008f 00143 (append.go:22) TESTB AL, AL 0x0091 00145 (append.go:22) JNE $0, 168 0x0093 00147 (append.go:22) MOVQ CX, "".x(SB) 0x009a 00154 (append.go:22) MOVQ "".x(SB), BX 0x00a1 00161 (append.go:19) MOVQ "".autotmp_0+64(SP), DX 0x00a6 00166 (append.go:22) JMP 65 0x00a8 00168 (append.go:22) LEAQ "".x(SB), DX 0x00af 00175 (append.go:22) MOVQ DX, (SP) 0x00b3 00179 (append.go:22) MOVQ CX, 8(SP) 0x00b8 00184 (append.go:22) PCDATA $0, $0 0x00b8 00184 (append.go:22) CALL runtime.writebarrierptr(SB) 0x00bd 00189 (append.go:22) JMP 154 0x00bf 00191 (append.go:22) NOP 0x00bf 00191 (append.go:21) CALL runtime.morestack_noctxt(SB) 0x00c4 00196 (append.go:21) JMP 0 Change-Id: I77a41ad3a22557a4bb4654de7d6d24a029efe34a Reviewed-on: https://go-review.googlesource.com/21813 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
So besides src==nil and *dst==src, there's another condition that would be interesting to get stats on: dst points to the stack. It might illuminate code (I'm looking at you, convT2I and friends) where the writes are always within the stack. With might still need the write barrier because of stack barriers, but maybe some rearrangement/optimization would be possible. |
I had a sudden suspicion that I should look again at this. With the move of the write barrier to assembly, it has gotten a bit harder to instrument. Leaving in the vestigial diff --git a/src/runtime/asm_amd64.s b/src/runtime/asm_amd64.s
index 2376fe0aae..f07c0bcbe1 100644
--- a/src/runtime/asm_amd64.s
+++ b/src/runtime/asm_amd64.s
@@ -2384,6 +2384,15 @@ TEXT runtime·gcWriteBarrier(SB),NOSPLIT,$120
// faster than having the caller spill these.
MOVQ R14, 104(SP)
MOVQ R13, 112(SP)
+
+ MOVQ $1, R13
+ CMPQ AX, (DI)
+ JNE 3(PC)
+ LOCK; XADDQ R13, runtime·wbe(SB)
+
+ MOVQ $1, R13
+ LOCK; XADDQ R13, runtime·wb(SB)
+
// TODO: Consider passing g.m.p in as an argument so they can be shared
// across a sequence of write barriers.
get_tls(R13)
diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index f20e77eee5..0e0c1d446c 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -224,9 +224,12 @@ func main() {
}
}
+var wb, wbe, wbz uint64
+
// os_beforeExit is called from os.Exit(0).
//go:linkname os_beforeExit os.runtime_beforeExit
func os_beforeExit() {
+ println("WB", wb, wbe, wbz)
if raceenabled {
racefini()
} The result for make.bash piped through Austin's awk command is: That's >23% of write barriers with |
No, it is not lost. Using @rsc's test case:
I think we also have a test that makes sure that we never write the pointer field in the non-growing case. |
Hmm. Thanks. I wonder where all the *dst == src writebarrier calls are coming from now. |
Maybe we need the check for zero...maybe all the *dst == src when both are nil? Any which way, this seems worth digging into a bit. |
Change https://golang.org/cl/99078 mentions this issue: |
Every time I poke at #14921, the g.waitreason string pointer writes show up. They're not particularly important performance-wise, but it'd be nice to clear the noise away. And it does open up a few extra bytes in the g struct for some future use. Change-Id: I7ffbd52fbc2a286931a2218038fda52ed6473cc9 Reviewed-on: https://go-review.googlesource.com/99078 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
Change https://golang.org/cl/111255 mentions this issue: |
Change https://golang.org/cl/111256 mentions this issue: |
Every time I poke at #14921, the g.waitreason string pointer writes show up. They're not particularly important performance-wise, but it'd be nice to clear the noise away. And it does open up a few extra bytes in the g struct for some future use. This is a re-roll of CL 99078, which was rolled back because of failures on s390x. Those failures were apparently due to an old version of gdb. Change-Id: Icc2c12f449b2934063fd61e272e06237625ed589 Reviewed-on: https://go-review.googlesource.com/111256 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Michael Munday <mike.munday@ibm.com>
Every time I poke at golang#14921, the g.waitreason string pointer writes show up. They're not particularly important performance-wise, but it'd be nice to clear the noise away. And it does open up a few extra bytes in the g struct for some future use. Change-Id: I7ffbd52fbc2a286931a2218038fda52ed6473cc9
I did a quick and dirty instrumentation of
writebarrierptr
and found that when executingcmd/compile
to build std, about 25% of calls on average had*dst == src
. In that case, there's no need to do the actual assignment or gray any objects (I think). I don't know how (a)typical that 25% number is.We should investigate whether checking for this and short-circuiting is a net performance gain, in general. There are multiple places in the stack this could occur:
(1) writebarrierptr (and friends)
Add
to the beginning of the method.
(2) Wrapper routines
Add checks like:
to the runtime routines that call
writebarrierptr
, like the wbfat.go routines, andwritebarrierstring
and friends. This is different than (1) insofar as it skips thewritebarrierptr
function call instead of having it return immediately.(3) Generated code
We currently generate wb-calling code like:
We could instead generate code like:
cc @aclements @randall77
I don't plan to work on this soon, as I need to undistract myself.
The text was updated successfully, but these errors were encountered: