Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime, cmd/compile: consider write barrier short-circuit when *dst == src #14921

Open
josharian opened this issue Mar 22, 2016 · 27 comments
Open
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Performance
Milestone

Comments

@josharian
Copy link
Contributor

I did a quick and dirty instrumentation of writebarrierptr and found that when executing cmd/compile to build std, about 25% of calls on average had *dst == src. In that case, there's no need to do the actual assignment or gray any objects (I think). I don't know how (a)typical that 25% number is.

We should investigate whether checking for this and short-circuiting is a net performance gain, in general. There are multiple places in the stack this could occur:

(1) writebarrierptr (and friends)

Add

if *dst == src {
  return
}

to the beginning of the method.

(2) Wrapper routines

Add checks like:

if dst[1] != src[1] {
    writebarrierptr(&dst[1], src[1])
}

to the runtime routines that call writebarrierptr, like the wbfat.go routines, and writebarrierstring and friends. This is different than (1) insofar as it skips the writebarrierptr function call instead of having it return immediately.

(3) Generated code

We currently generate wb-calling code like:

if writeBarrier.enabled {
  writebarrierptr(dst, src)
} else {
  *dst = src
}

We could instead generate code like:

if *dst != src {
  if writeBarrier.enabled {
    writebarrierptr(dst, src)
  } else {
    *dst = src
  }
}

cc @aclements @randall77

I don't plan to work on this soon, as I need to undistract myself.

@josharian josharian added this to the Unplanned milestone Mar 22, 2016
@OneOfOne
Copy link
Contributor

Just for the kicks, I tried that, showed some mixed results for go test -v -bench=. -run=NONE runtime:

BenchmarkGrowSliceBytes-8        33.3          32.4          -2.70%
BenchmarkGrowSliceInts-8         58.0          59.4          +2.41%
BenchmarkAppend-8                18.6          17.4          -6.45%
BenchmarkAppendGrowByte-8        1903243       1920778       +0.92%
BenchmarkAppendGrowString-8      278920100     143081182     -48.70% # << this is really interesting
BenchmarkAppend1Byte-8           3.13          3.11          -0.64%
BenchmarkAppend4Bytes-8          3.08          3.18          +3.25%
BenchmarkAppend7Bytes-8          3.12          3.19          +2.24%
BenchmarkAppend8Bytes-8          2.79          3.04          +8.96%
BenchmarkAppend15Bytes-8         4.03          4.27          +5.96%
BenchmarkAppend16Bytes-8         4.28          4.18          -2.34%
BenchmarkAppend32Bytes-8         3.16          3.38          +6.96%
BenchmarkAppendStr1Byte-8        3.10          3.62          +16.77%
BenchmarkAppendStr4Bytes-8       3.05          3.23          +5.90%
BenchmarkAppendStr8Bytes-8       3.06          2.83          -7.52%
BenchmarkAppendStr16Bytes-8      4.18          4.00          -4.31%
BenchmarkAppendStr32Bytes-8      3.45          4.38          +26.96%
BenchmarkAppendSpecialCase-8     18.3          19.3          +5.46%
BenchmarkCopy1Byte-8             3.86          4.47          +15.80%
BenchmarkCopy2Byte-8             3.93          4.20          +6.87%

Want me to prep a CL?

@randall77
Copy link
Contributor

By "that", do you mean (1), (2), or (3)?

@randall77
Copy link
Contributor

We have to be careful not to run afoul of the issue in #14855 . I don't think this optimization would cause that problem, but it probably deserves a closer look.

@OneOfOne
Copy link
Contributor

I tried 2, modified writebarrierptr itself.

I'll mess with it later and test against the case in #14855.

@OneOfOne
Copy link
Contributor

I posted a CL21027 with the benchstat.

@gopherbot
Copy link

CL https://golang.org/cl/21027 mentions this issue.

@josharian
Copy link
Contributor Author

Thinking out loud, one interesting aspect of modifying the calling code (3 above) is that it might allow other optimization passes to remove the code entirely by proving that *dst == src. I guess that depends on whether *dst is actually a load or is an ssa.Value.

@aclements
Copy link
Member

The benchmark results from CL 21027 suggest that append may be the main source of idempotent pointer updates. Josh, do you happen to know what was one level up from the *dst==src write barriers? If it's mostly append, I suspect we can optimize just append to avoid the unnecessary pointer updates altogether.

One potential downside of modifying writebarrierptr this way is that it can cause extra coherence traffic by first pulling the pointer's cache line into shared and then upgrading it to modified. Though I doubt that's what causing the slowdowns shown in the CL commit message since those benchmarks are sequential.

I don't think these changes run afoul of the races in #14855. In that case we were doing the pointer write, but not doing the write barrier. It should be safe if we don't do either.

If we go the route of modifying the calling code, it would probably be better to make it:

if writeBarrier.enabled {
  if *dst != src {
    writebarrierptr(dst, src)
  }
} else {
  *dst = src
}

so we don't penalize the code with an extra (and potentially poorly predictable) branch when write barriers are off.

@OneOfOne
Copy link
Contributor

@aclements I'm trying something like that, I'll let you know what happens after the benchmarks finish.

@aclements
Copy link
Member

I did a bit of experimenting. All of these results are for 4c9a470 (two days ago; forgot to sync).

First, I reproduced @josharian's result using the following patch:

diff --git a/src/runtime/mbarrier.go b/src/runtime/mbarrier.go
index 523d890..15244f8 100644
--- a/src/runtime/mbarrier.go
+++ b/src/runtime/mbarrier.go
@@ -14,6 +14,7 @@
 package runtime

 import (
+       "runtime/internal/atomic"
        "runtime/internal/sys"
        "unsafe"
 )
@@ -123,10 +124,19 @@ func writebarrierptr_nostore1(dst *uintptr, src uintptr) {
        releasem(mp)
 }

+var wb, wbe, wbz uint64
+
 // NOTE: Really dst *unsafe.Pointer, src unsafe.Pointer,
 // but if we do that, Go inserts a write barrier on *dst = src.
 //go:nosplit
 func writebarrierptr(dst *uintptr, src uintptr) {
+       if src == 0 {
+               atomic.Xadd64(&wbz, +1)
+       } else if *dst == src {
+               atomic.Xadd64(&wbe, +1)
+       }
+       atomic.Xadd64(&wb, +1)
+
        *dst = src
        if writeBarrier.cgo {
                cgoCheckWriteBarrier(dst, src)
diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index d386797..b146fa4 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -208,6 +208,7 @@ func main() {
 // os_beforeExit is called from os.Exit(0).
 //go:linkname os_beforeExit os.runtime_beforeExit
 func os_beforeExit() {
+       println("WB", wb, wbe, wbz)
        if raceenabled {
                racefini()
        }

I get this result:

$ go build -a |& awk '/^WB/{n+=$2; eq+=$3; z+=$4} END{print n " " eq " " z}'
14525087 3412554 1422002

That is, of 14M write barriers, 10% are writing nil and 23.5% are *dst == src (but not writing nil). I separately checked that almost all of the 10% nil writes also have *dst == src.

We should probably skip the write barrier earlier if src is nil. Currently we go through quite a bit of work before checking for a nil pointer.

Focusing on the ones where *dst == src, but src != nil, I profiled where they're coming from using this patch:

diff --git a/src/runtime/mbarrier.go b/src/runtime/mbarrier.go
index 523d890..e8ce821 100644
--- a/src/runtime/mbarrier.go
+++ b/src/runtime/mbarrier.go
@@ -127,6 +127,13 @@ func writebarrierptr_nostore1(dst *uintptr, src uintptr) {
 // but if we do that, Go inserts a write barrier on *dst = src.
 //go:nosplit
 func writebarrierptr(dst *uintptr, src uintptr) {
+       if src != 0 && *dst == src {
+               cpc := getcallerpc((unsafe.Pointer)(&dst)) - 1
+               f := findfunc(cpc)
+               file, line := funcline(f, cpc)
+               print("WBE ", file, ":", line, " ", funcname(f), "\n")
+       }
+
        *dst = src
        if writeBarrier.cgo {
                cgoCheckWriteBarrier(dst, src)

The result is

$ go build -a std |& awk '/^WBE/{print $2}' logx | sort | uniq -c | sort -nr | head -n 10
  16519 /home/austin/go.dev/src/cmd/compile/internal/gc/typecheck.go:176
  15369 /home/austin/go.dev/src/cmd/compile/internal/gc/dcl.go:58
  15155 /home/austin/go.dev/src/cmd/compile/internal/ssa/sparseset.go:45
  12066 /home/austin/go.dev/src/cmd/compile/internal/gc/dcl.go:753
   8865 /home/austin/go.dev/src/cmd/compile/internal/ssa/value.go:167
   8493 /home/austin/go.dev/src/cmd/compile/internal/ssa/func.go:94
   8206 /home/austin/go.dev/src/cmd/compile/internal/ssa/stackalloc.go:266
   7828 /home/austin/go.dev/src/runtime/mheap.go:603
   7288 /home/austin/go.dev/src/runtime/mheap.go:196
   6871 /home/austin/go.dev/src/runtime/hashmap.go:793

Of these, all but dcl.go:58, dcl.go:753, mheap.go:603, and hashmap.go:793 are appends. @randall77, can we optimize the generated append code so that the path that grows the slice in place doesn't write back the pointer if we're writing to the same slice we're growing? I imagine this would be something like the slicing optimization.

We can safely skip the write barrier on mheap.go:603. I'm a little surprised that does equal pointer writes so often. There may be also be an algorithmic improvement here.

We could put a conditional around the one in hashmap.go. I think that happens when we're iterating over a map and the iterator stays in the same bucket. I don't totally understand this code; there may be better ways to improve it.

@OneOfOne
Copy link
Contributor

I accidentally uploaded a 2nd CL, https://go-review.googlesource.com/#/c/21138/

➜ benchcmp /tmp/master.txt /tmp/wbfat.txt | grep -v 0\. # just show changes > +-0.x%
benchmark                        old ns/op     new ns/op     delta
BenchmarkGrowSliceBytes-8        35.4          32.1          -9.32%
BenchmarkAppend1Byte-8           3.16          3.11          -1.58%
BenchmarkAppend8Bytes-8          2.79          2.69          -3.58%
BenchmarkAppendStr8Bytes-8       3.38          3.23          -4.44%
BenchmarkAppendStr32Bytes-8      3.23          3.18          -1.55%
BenchmarkAppendSpecialCase-8     17.6          18.5          +5.11%
BenchmarkCopy4Byte-8             3.89          3.78          -2.83%
BenchmarkCopy128Byte-8           5.47          5.62          +2.74%
BenchmarkCopy2String-8           3.52          3.62          +2.84%
BenchmarkCopy12String-8          3.68          3.52          -4.35%
BenchmarkCopy32String-8          3.47          3.35          -3.46%
BenchmarkCopy128String-8         5.47          5.33          -2.56%
BenchmarkCopyFat32-8             1.36          1.31          -3.68%
BenchmarkStackGrowthDeep-8       12598         13692         +8.68%

benchmark                     old MB/s     new MB/s     speedup

benchmark                        old allocs     new allocs     delta
BenchmarkAppendGrowByte-8        51             52             +1.96%
BenchmarkAppendGrowString-8      184            188            +2.17%

@rsc
Copy link
Contributor

rsc commented Mar 25, 2016

@randall77, this is a regression in the SSA back end, similar to the reslicing one. The old back end avoided updating the base pointer in the append fast path, and therefore avoided the write barrier.

$ cat /tmp/x.go
package p

func f(x *[]int) []int {
    *x = append(*x, 1)
    return *x
}
$ go tool compile -ssa=0 -S /tmp/x.go | grep :4
    // fast path - no write barrier
    0x0028 00040 (/tmp/x.go:4)  MOVQ    "".x+88(FP), BX
    0x002d 00045 (/tmp/x.go:4)  NOP
    0x002d 00045 (/tmp/x.go:4)  MOVQ    (BX), DX
    0x0030 00048 (/tmp/x.go:4)  MOVQ    8(BX), AX
    0x0034 00052 (/tmp/x.go:4)  MOVQ    16(BX), CX
    0x0038 00056 (/tmp/x.go:4)  MOVQ    AX, BP
    0x003b 00059 (/tmp/x.go:4)  INCQ    BP
    0x003e 00062 (/tmp/x.go:4)  CMPQ    BP, CX
    0x0041 00065 (/tmp/x.go:4)  JHI $1, 128
    0x0043 00067 (/tmp/x.go:4)  MOVQ    BP, 8(BX)
    0x0047 00071 (/tmp/x.go:4)  LEAQ    (DX)(AX*8), BX
    0x004b 00075 (/tmp/x.go:4)  MOVQ    $1, (BX)

    // slow path code at bottom of function
    0x0080 00128 (/tmp/x.go:4)  LEAQ    type.[]int(SB), BX
    0x0087 00135 (/tmp/x.go:4)  MOVQ    BX, (SP)
    0x008b 00139 (/tmp/x.go:4)  MOVQ    DX, 8(SP)
    0x0090 00144 (/tmp/x.go:4)  MOVQ    AX, 16(SP)
    0x0095 00149 (/tmp/x.go:4)  MOVQ    CX, 24(SP)
    0x009a 00154 (/tmp/x.go:4)  MOVQ    BP, 32(SP)
    0x009f 00159 (/tmp/x.go:4)  PCDATA  $0, $0
    0x009f 00159 (/tmp/x.go:4)  CALL    runtime.growslice(SB)
    0x00a4 00164 (/tmp/x.go:4)  MOVQ    40(SP), DX
    0x00a9 00169 (/tmp/x.go:4)  MOVQ    48(SP), AX
    0x00ae 00174 (/tmp/x.go:4)  MOVQ    56(SP), CX
    0x00b3 00179 (/tmp/x.go:4)  MOVQ    "".x+88(FP), BX
    0x00b8 00184 (/tmp/x.go:4)  CMPQ    BX, $0
    0x00bc 00188 (/tmp/x.go:4)  JEQ $1, 260
    0x00be 00190 (/tmp/x.go:4)  MOVQ    AX, BP
    0x00c1 00193 (/tmp/x.go:4)  MOVQ    AX, "".autotmp_0001+64(SP)
    0x00c6 00198 (/tmp/x.go:4)  INCQ    BP
    0x00c9 00201 (/tmp/x.go:4)  MOVQ    BP, 8(BX)
    0x00cd 00205 (/tmp/x.go:4)  MOVQ    CX, 16(BX)
    0x00d1 00209 (/tmp/x.go:4)  MOVQ    DX, "".autotmp_0000+72(SP)
    0x00d6 00214 (/tmp/x.go:4)  CMPB    runtime.writeBarrier(SB), $0
    0x00dd 00221 (/tmp/x.go:4)  JNE $0, 231
    0x00df 00223 (/tmp/x.go:4)  MOVQ    DX, (BX)
    0x00e2 00226 (/tmp/x.go:4)  JMP 71
    0x00e7 00231 (/tmp/x.go:4)  MOVQ    BX, (SP)
    0x00eb 00235 (/tmp/x.go:4)  MOVQ    DX, 8(SP)
    0x00f0 00240 (/tmp/x.go:4)  PCDATA  $0, $1
    0x00f0 00240 (/tmp/x.go:4)  CALL    runtime.writebarrierptr(SB)
    0x00f5 00245 (/tmp/x.go:4)  MOVQ    "".autotmp_0000+72(SP), DX
    0x00fa 00250 (/tmp/x.go:4)  MOVQ    "".autotmp_0001+64(SP), AX
    0x00ff 00255 (/tmp/x.go:4)  JMP 71
    0x0104 00260 (/tmp/x.go:4)  MOVL    AX, (BX)
    0x0106 00262 (/tmp/x.go:4)  JMP 190
    0x0108 00264 (/tmp/x.go:4)  NOP
$ 

$ go tool compile -ssa=1 -S /tmp/x.go | grep :4
    // fast path - unnecessary update of base, including test of write barrier flag + possible call
    0x0017 00023 (/tmp/x.go:4)  MOVQ    "".x+88(FP), CX
    0x001c 00028 (/tmp/x.go:4)  MOVQ    16(CX), DX
    0x0020 00032 (/tmp/x.go:4)  MOVQ    8(CX), BX
    0x0024 00036 (/tmp/x.go:4)  MOVQ    BX, "".autotmp_0000+72(SP)
    0x0029 00041 (/tmp/x.go:4)  MOVQ    (CX), BP
    0x002c 00044 (/tmp/x.go:4)  LEAQ    1(BX), SI
    0x0030 00048 (/tmp/x.go:4)  MOVQ    SI, "".autotmp_0001+64(SP)
    0x0035 00053 (/tmp/x.go:4)  CMPQ    SI, DX
    0x0038 00056 (/tmp/x.go:4)  JGT $0, 140
    0x003a 00058 (/tmp/x.go:4)  MOVQ    $1, (BP)(BX*8)
    0x0043 00067 (/tmp/x.go:4)  MOVQ    SI, 8(CX)
    0x0047 00071 (/tmp/x.go:4)  MOVQ    DX, 16(CX)
    0x004b 00075 (/tmp/x.go:4)  MOVL    runtime.writeBarrier(SB), AX
    0x0051 00081 (/tmp/x.go:4)  TESTB   AL, AL
    0x0053 00083 (/tmp/x.go:4)  JNE $0, 119
    0x0055 00085 (/tmp/x.go:4)  MOVQ    BP, (CX)

    // slow path for write barrier call in append "fast" path
    0x0077 00119 (/tmp/x.go:4)  MOVQ    CX, (SP)
    0x007b 00123 (/tmp/x.go:4)  MOVQ    BP, 8(SP)
    0x0080 00128 (/tmp/x.go:4)  PCDATA  $0, $0
    0x0080 00128 (/tmp/x.go:4)  CALL    runtime.writebarrierptr(SB)

    // slow path for append calling growslice
    0x008c 00140 (/tmp/x.go:4)  LEAQ    type.[]int(SB), AX
    0x0093 00147 (/tmp/x.go:4)  MOVQ    AX, (SP)
    0x0097 00151 (/tmp/x.go:4)  MOVQ    BP, 8(SP)
    0x009c 00156 (/tmp/x.go:4)  MOVQ    BX, 16(SP)
    0x00a1 00161 (/tmp/x.go:4)  MOVQ    DX, 24(SP)
    0x00a6 00166 (/tmp/x.go:4)  MOVQ    SI, 32(SP)
    0x00ab 00171 (/tmp/x.go:4)  PCDATA  $0, $0
    0x00ab 00171 (/tmp/x.go:4)  CALL    runtime.growslice(SB)
    0x00b0 00176 (/tmp/x.go:4)  MOVQ    40(SP), BP
    0x00b5 00181 (/tmp/x.go:4)  MOVQ    56(SP), DX
    0x00bf 00191 (/tmp/x.go:4)  MOVQ    "".autotmp_0000+72(SP), BX
    0x00c4 00196 (/tmp/x.go:4)  MOVQ    "".autotmp_0001+64(SP), SI
    0x00c9 00201 (/tmp/x.go:4)  JMP 58
    0x00ce 00206 (/tmp/x.go:4)  NOP
$ 

@randall77
Copy link
Contributor

For the hashtable iterator case, we can just do the check to see if we're writing something new back explicitly.
I'll open a separate bug for the append writeback.

@josharian
Copy link
Contributor Author

I remeasured this after CLs 21812, 21813, and 21814, which address the append and the map iteration cases.

New results, using Austin's approach above:

29354 1173 2934

So 6% of writes have *dst == src, and 10% have src == nil.

The new top lines for *dst == src (line numbers at commit de7ee57) are:

 37.57%   420 runtime/proc.go:259 runtime.gopark
 14.13%   158 runtime/mheap.go:603 runtime.(*mheap).allocSpanLocked
  5.28%    59 runtime/mheap.go:588 runtime.(*mheap).allocSpanLocked
  4.92%    55 runtime/mgc.go:1725 runtime.gcCopySpans
  4.92%    55 runtime/mgc.go:1726 runtime.gcCopySpans

The gopark calls are unnecessary, since all the strings that reach it are constants, but there's no way for the compiler to know that.

I've filed #15226 for allocSpanLocked. gcCopySpans might deserve a look as well, but we're close to scraping bottom (yay!).

I will investigate whether it is better to exit writebarrierptr early if src == nil or whether it is better to check it in the generated code and send a CL for one of those to close this issue.

@gopherbot
Copy link

CL https://golang.org/cl/21813 mentions this issue.

@gopherbot
Copy link

CL https://golang.org/cl/21814 mentions this issue.

gopherbot pushed a commit that referenced this issue Apr 10, 2016
Update #14921

Change-Id: I5c5816d0193757bf7465b1e09c27ca06897df4bf
Reviewed-on: https://go-review.googlesource.com/21814
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@josharian
Copy link
Contributor Author

Mailed CLs 21820 and 21821 that implement the two "do less work if src==nil" strategies described above. The numbers for adding a check to the wrapper code are better, although it does increase binary size by ~0.1%. It's not obvious to me which of the two (if either) should go in. Feedback or better benchmark numbers welcomed.

@gopherbot
Copy link

CL https://golang.org/cl/21821 mentions this issue.

@gopherbot
Copy link

CL https://golang.org/cl/21820 mentions this issue.

gopherbot pushed a commit that referenced this issue Apr 11, 2016
When we are writing the result of an append back
to the same slice, we don’t need a write barrier
on the fast path.

This re-implements an optimization that was present
in the old backend.

Updates #14921
Fixes #14969

Sample code:

var x []byte

func p() {
	x = append(x, 1, 2, 3)
}

Before:

"".p t=1 size=224 args=0x0 locals=0x48
	0x0000 00000 (append.go:21)	TEXT	"".p(SB), $72-0
	0x0000 00000 (append.go:21)	MOVQ	(TLS), CX
	0x0009 00009 (append.go:21)	CMPQ	SP, 16(CX)
	0x000d 00013 (append.go:21)	JLS	199
	0x0013 00019 (append.go:21)	SUBQ	$72, SP
	0x0017 00023 (append.go:21)	FUNCDATA	$0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x0017 00023 (append.go:21)	FUNCDATA	$1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x0017 00023 (append.go:19)	MOVQ	"".x+16(SB), CX
	0x001e 00030 (append.go:19)	MOVQ	"".x(SB), DX
	0x0025 00037 (append.go:19)	MOVQ	"".x+8(SB), BX
	0x002c 00044 (append.go:19)	MOVQ	BX, "".autotmp_0+64(SP)
	0x0031 00049 (append.go:22)	LEAQ	3(BX), BP
	0x0035 00053 (append.go:22)	CMPQ	BP, CX
	0x0038 00056 (append.go:22)	JGT	$0, 131
	0x003a 00058 (append.go:22)	MOVB	$1, (DX)(BX*1)
	0x003e 00062 (append.go:22)	MOVB	$2, 1(DX)(BX*1)
	0x0043 00067 (append.go:22)	MOVB	$3, 2(DX)(BX*1)
	0x0048 00072 (append.go:22)	MOVQ	BP, "".x+8(SB)
	0x004f 00079 (append.go:22)	MOVQ	CX, "".x+16(SB)
	0x0056 00086 (append.go:22)	MOVL	runtime.writeBarrier(SB), AX
	0x005c 00092 (append.go:22)	TESTB	AL, AL
	0x005e 00094 (append.go:22)	JNE	$0, 108
	0x0060 00096 (append.go:22)	MOVQ	DX, "".x(SB)
	0x0067 00103 (append.go:23)	ADDQ	$72, SP
	0x006b 00107 (append.go:23)	RET
	0x006c 00108 (append.go:22)	LEAQ	"".x(SB), CX
	0x0073 00115 (append.go:22)	MOVQ	CX, (SP)
	0x0077 00119 (append.go:22)	MOVQ	DX, 8(SP)
	0x007c 00124 (append.go:22)	PCDATA	$0, $0
	0x007c 00124 (append.go:22)	CALL	runtime.writebarrierptr(SB)
	0x0081 00129 (append.go:23)	JMP	103
	0x0083 00131 (append.go:22)	LEAQ	type.[]uint8(SB), AX
	0x008a 00138 (append.go:22)	MOVQ	AX, (SP)
	0x008e 00142 (append.go:22)	MOVQ	DX, 8(SP)
	0x0093 00147 (append.go:22)	MOVQ	BX, 16(SP)
	0x0098 00152 (append.go:22)	MOVQ	CX, 24(SP)
	0x009d 00157 (append.go:22)	MOVQ	BP, 32(SP)
	0x00a2 00162 (append.go:22)	PCDATA	$0, $0
	0x00a2 00162 (append.go:22)	CALL	runtime.growslice(SB)
	0x00a7 00167 (append.go:22)	MOVQ	40(SP), DX
	0x00ac 00172 (append.go:22)	MOVQ	48(SP), AX
	0x00b1 00177 (append.go:22)	MOVQ	56(SP), CX
	0x00b6 00182 (append.go:22)	ADDQ	$3, AX
	0x00ba 00186 (append.go:19)	MOVQ	"".autotmp_0+64(SP), BX
	0x00bf 00191 (append.go:22)	MOVQ	AX, BP
	0x00c2 00194 (append.go:22)	JMP	58
	0x00c7 00199 (append.go:22)	NOP
	0x00c7 00199 (append.go:21)	CALL	runtime.morestack_noctxt(SB)
	0x00cc 00204 (append.go:21)	JMP	0

After:

"".p t=1 size=208 args=0x0 locals=0x48
	0x0000 00000 (append.go:21)	TEXT	"".p(SB), $72-0
	0x0000 00000 (append.go:21)	MOVQ	(TLS), CX
	0x0009 00009 (append.go:21)	CMPQ	SP, 16(CX)
	0x000d 00013 (append.go:21)	JLS	191
	0x0013 00019 (append.go:21)	SUBQ	$72, SP
	0x0017 00023 (append.go:21)	FUNCDATA	$0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x0017 00023 (append.go:21)	FUNCDATA	$1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x0017 00023 (append.go:19)	MOVQ	"".x+16(SB), CX
	0x001e 00030 (append.go:19)	MOVQ	"".x+8(SB), DX
	0x0025 00037 (append.go:19)	MOVQ	DX, "".autotmp_0+64(SP)
	0x002a 00042 (append.go:19)	MOVQ	"".x(SB), BX
	0x0031 00049 (append.go:22)	LEAQ	3(DX), BP
	0x0035 00053 (append.go:22)	MOVQ	BP, "".x+8(SB)
	0x003c 00060 (append.go:22)	CMPQ	BP, CX
	0x003f 00063 (append.go:22)	JGT	$0, 84
	0x0041 00065 (append.go:22)	MOVB	$1, (BX)(DX*1)
	0x0045 00069 (append.go:22)	MOVB	$2, 1(BX)(DX*1)
	0x004a 00074 (append.go:22)	MOVB	$3, 2(BX)(DX*1)
	0x004f 00079 (append.go:23)	ADDQ	$72, SP
	0x0053 00083 (append.go:23)	RET
	0x0054 00084 (append.go:22)	LEAQ	type.[]uint8(SB), AX
	0x005b 00091 (append.go:22)	MOVQ	AX, (SP)
	0x005f 00095 (append.go:22)	MOVQ	BX, 8(SP)
	0x0064 00100 (append.go:22)	MOVQ	DX, 16(SP)
	0x0069 00105 (append.go:22)	MOVQ	CX, 24(SP)
	0x006e 00110 (append.go:22)	MOVQ	BP, 32(SP)
	0x0073 00115 (append.go:22)	PCDATA	$0, $0
	0x0073 00115 (append.go:22)	CALL	runtime.growslice(SB)
	0x0078 00120 (append.go:22)	MOVQ	40(SP), CX
	0x007d 00125 (append.go:22)	MOVQ	56(SP), AX
	0x0082 00130 (append.go:22)	MOVQ	AX, "".x+16(SB)
	0x0089 00137 (append.go:22)	MOVL	runtime.writeBarrier(SB), AX
	0x008f 00143 (append.go:22)	TESTB	AL, AL
	0x0091 00145 (append.go:22)	JNE	$0, 168
	0x0093 00147 (append.go:22)	MOVQ	CX, "".x(SB)
	0x009a 00154 (append.go:22)	MOVQ	"".x(SB), BX
	0x00a1 00161 (append.go:19)	MOVQ	"".autotmp_0+64(SP), DX
	0x00a6 00166 (append.go:22)	JMP	65
	0x00a8 00168 (append.go:22)	LEAQ	"".x(SB), DX
	0x00af 00175 (append.go:22)	MOVQ	DX, (SP)
	0x00b3 00179 (append.go:22)	MOVQ	CX, 8(SP)
	0x00b8 00184 (append.go:22)	PCDATA	$0, $0
	0x00b8 00184 (append.go:22)	CALL	runtime.writebarrierptr(SB)
	0x00bd 00189 (append.go:22)	JMP	154
	0x00bf 00191 (append.go:22)	NOP
	0x00bf 00191 (append.go:21)	CALL	runtime.morestack_noctxt(SB)
	0x00c4 00196 (append.go:21)	JMP	0

Change-Id: I77a41ad3a22557a4bb4654de7d6d24a029efe34a
Reviewed-on: https://go-review.googlesource.com/21813
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@randall77
Copy link
Contributor

So besides src==nil and *dst==src, there's another condition that would be interesting to get stats on: dst points to the stack. It might illuminate code (I'm looking at you, convT2I and friends) where the writes are always within the stack. With might still need the write barrier because of stack barriers, but maybe some rearrangement/optimization would be possible.

@josharian
Copy link
Contributor Author

josharian commented Mar 2, 2018

I had a sudden suspicion that I should look again at this. With the move of the write barrier to assembly, it has gotten a bit harder to instrument. Leaving in the vestigial wbe variable so I could re-use Austin's awk command, here is my instrumentation for amd64:

diff --git a/src/runtime/asm_amd64.s b/src/runtime/asm_amd64.s
index 2376fe0aae..f07c0bcbe1 100644
--- a/src/runtime/asm_amd64.s
+++ b/src/runtime/asm_amd64.s
@@ -2384,6 +2384,15 @@ TEXT runtime·gcWriteBarrier(SB),NOSPLIT,$120
        // faster than having the caller spill these.
        MOVQ    R14, 104(SP)
        MOVQ    R13, 112(SP)
+
+       MOVQ    $1, R13
+       CMPQ    AX, (DI)
+       JNE     3(PC)
+       LOCK; XADDQ R13, runtime·wbe(SB)
+
+       MOVQ    $1, R13
+       LOCK; XADDQ R13, runtime·wb(SB)
+
        // TODO: Consider passing g.m.p in as an argument so they can be shared
        // across a sequence of write barriers.
        get_tls(R13)
diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index f20e77eee5..0e0c1d446c 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -224,9 +224,12 @@ func main() {
        }
 }
 
+var wb, wbe, wbz uint64
+
 // os_beforeExit is called from os.Exit(0).
 //go:linkname os_beforeExit os.runtime_beforeExit
 func os_beforeExit() {
+       println("WB", wb, wbe, wbz)
        if raceenabled {
                racefini()
        }

The result for make.bash piped through Austin's awk command is: 70074409 16189273 0.

That's >23% of write barriers with *dst == src, which is suspiciously similar to the amount prior to the append optimization. I suspect that that optimization got lost when the write barrier insertion got moved in to the SSA backend. I've filed #24209 to look at that again.

@cherrymui
Copy link
Member

I suspect that that optimization got lost when the write barrier insertion got moved in to the SSA backend.

No, it is not lost. Using @rsc's test case:

$ cat x.go
package p

func f(x *[]int) []int {
    *x = append(*x, 1)
    return *x
}
$ go tool compile -S x.go | grep :4
	// fast path
	0x0026 00038 (x.go:4)	MOVQ	8(AX), CX
	0x002a 00042 (x.go:4)	MOVQ	16(AX), DX
	0x002e 00046 (x.go:4)	MOVQ	(AX), BX
	0x0031 00049 (x.go:4)	LEAQ	1(CX), SI
	0x0035 00053 (x.go:4)	CMPQ	SI, DX
	0x0038 00056 (x.go:4)	JGT	110
	0x003a 00058 (x.go:4)	LEAQ	1(CX), DX
	0x003e 00062 (x.go:4)	MOVQ	DX, 8(AX)
	0x0042 00066 (x.go:4)	MOVQ	$1, (BX)(CX*8)
	// slow path
	0x006e 00110 (x.go:4)	LEAQ	type.int(SB), AX
	0x0075 00117 (x.go:4)	MOVQ	AX, (SP)
	0x0079 00121 (x.go:4)	MOVQ	BX, 8(SP)
	0x007e 00126 (x.go:4)	MOVQ	CX, 16(SP)
	0x0083 00131 (x.go:4)	MOVQ	DX, 24(SP)
	0x0088 00136 (x.go:4)	MOVQ	SI, 32(SP)
	0x008d 00141 (x.go:4)	PCDATA	$0, $0
	0x008d 00141 (x.go:4)	CALL	runtime.growslice(SB)
	0x0092 00146 (x.go:4)	MOVQ	40(SP), AX
	0x0097 00151 (x.go:4)	MOVQ	48(SP), CX
	0x009c 00156 (x.go:4)	MOVQ	56(SP), DX
	0x00a1 00161 (x.go:4)	MOVQ	"".x+80(SP), DI
	0x00a6 00166 (x.go:4)	MOVQ	DX, 16(DI)
	0x00aa 00170 (x.go:4)	CMPL	runtime.writeBarrier(SB), $0
	0x00b1 00177 (x.go:4)	JNE	193
	0x00b3 00179 (x.go:4)	MOVQ	AX, (DI)
	0x00b6 00182 (x.go:4)	MOVQ	AX, BX
	0x00b9 00185 (x.go:4)	MOVQ	DI, AX
	0x00bc 00188 (x.go:4)	JMP	58
	0x00c1 00193 (x.go:4)	CALL	runtime.gcWriteBarrier(SB)
	0x00c6 00198 (x.go:4)	JMP	182
	0x00c8 00200 (x.go:4)	NOP

I think we also have a test that makes sure that we never write the pointer field in the non-growing case.

@josharian
Copy link
Contributor Author

Hmm. Thanks. I wonder where all the *dst == src writebarrier calls are coming from now.

@josharian
Copy link
Contributor Author

Maybe we need the check for zero...maybe all the *dst == src when both are nil?

Any which way, this seems worth digging into a bit.

@gopherbot
Copy link

Change https://golang.org/cl/99078 mentions this issue: runtime: convert g.waitreason from string to uint8

gopherbot pushed a commit that referenced this issue Mar 12, 2018
Every time I poke at #14921, the g.waitreason string
pointer writes show up.

They're not particularly important performance-wise,
but it'd be nice to clear the noise away.

And it does open up a few extra bytes in the g struct
for some future use.

Change-Id: I7ffbd52fbc2a286931a2218038fda52ed6473cc9
Reviewed-on: https://go-review.googlesource.com/99078
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
@gopherbot
Copy link

Change https://golang.org/cl/111255 mentions this issue: runtime: convert g.waitreason from string to uint8

@gopherbot
Copy link

Change https://golang.org/cl/111256 mentions this issue: runtime: convert g.waitreason from string to uint8

gopherbot pushed a commit that referenced this issue May 3, 2018
Every time I poke at #14921, the g.waitreason string
pointer writes show up.

They're not particularly important performance-wise,
but it'd be nice to clear the noise away.

And it does open up a few extra bytes in the g struct
for some future use.

This is a re-roll of CL 99078, which was rolled
back because of failures on s390x.
Those failures were apparently due to an old version of gdb.

Change-Id: Icc2c12f449b2934063fd61e272e06237625ed589
Reviewed-on: https://go-review.googlesource.com/111256
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Michael Munday <mike.munday@ibm.com>
josharian added a commit to josharian/go that referenced this issue May 8, 2018
Every time I poke at golang#14921, the g.waitreason string
pointer writes show up.

They're not particularly important performance-wise,
but it'd be nice to clear the noise away.

And it does open up a few extra bytes in the g struct
for some future use.

Change-Id: I7ffbd52fbc2a286931a2218038fda52ed6473cc9
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Performance
Projects
None yet
Development

No branches or pull requests

8 participants