Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: optimize append performance #14758

Closed
dsnet opened this issue Mar 10, 2016 · 12 comments
Closed

cmd/compile: optimize append performance #14758

dsnet opened this issue Mar 10, 2016 · 12 comments

Comments

@dsnet
Copy link
Member

dsnet commented Mar 10, 2016

Using go devel +ed4a27a

I ran the following benchmarks:

func BenchmarkFoo(b *testing.B) {
    a := make([]byte, 1000)
    b.SetBytes(int64(len(a)))

    for i := 0; i < b.N; i++ {
        var n int
        for j := 0; j < 1000; j++ {
            a[n] = byte(j)
            n++
        }
        a = a[:n]
    }
}

func BenchmarkBar(b *testing.B) {
    a := make([]byte, 1000)
    b.SetBytes(int64(len(a)))

    for i := 0; i < b.N; i++ {
        a = a[:0]
        for j := 0; j < 1000; j++ {
            a = append(a, byte(j))
        }
    }
}

Currently I see:

BenchmarkFoo-4   3000000           588 ns/op    1700.17 MB/s
BenchmarkBar-4   2000000           871 ns/op    1147.11 MB/s

I expect to see these two to perform about the same.

@klauspost

@bradfitz
Copy link
Contributor

Dup of #11419?

I'm not sure I'd expect them to be equal, though. append is a runtime call and will be more expensive than not making a call.

@martisch
Copy link
Contributor

similar to #14718
Here its slightly different in that the a have the same cap at the end.
But the argument that one is a runtime call i would share and so they can never be equal.
But maybe the gap can be narrowed...

@randall77
Copy link
Contributor

This is unrelated to either #11419 or #14718. It should be fixable.

We inline the easy part of append in the compiler. a = append(a, b) compiles to something like:

if len(a) == cap(a) {
  a = ...some runtime call...
}
a[len(a)] = b
len(a)++

The branch is predicted not taken (and will never be taken in this case), so the grow path is out-of-line.

The inner loop of BenchmarkFoo is:

    0x006b 00107 (/home/khr/go/issue14758.go:12)    CMPQ    BP, AX
    0x006e 00110 (/home/khr/go/issue14758.go:12)    JCC $0, 173
    0x0070 00112 (/home/khr/go/issue14758.go:12)    MOVB    BPB, (BX)(BP*1)
    0x0074 00116 (/home/khr/go/issue14758.go:11)    INCQ    BP
    0x0077 00119 (/home/khr/go/issue14758.go:11)    CMPQ    BP, $1000
    0x007e 00126 (/home/khr/go/issue14758.go:11)    JLT $0, 107

The inner loop of BenchmarkBar is:

    0x0083 00131 (/home/khr/go/issue14758.go:26)    LEAQ    1(BP), SI
    0x0087 00135 (/home/khr/go/issue14758.go:26)    MOVQ    SI, "".autotmp_0012+72(SP)
    0x008c 00140 (/home/khr/go/issue14758.go:26)    CMPQ    SI, AX
    0x008f 00143 (/home/khr/go/issue14758.go:26)    JGT $0, 191
    0x0091 00145 (/home/khr/go/issue14758.go:26)    MOVB    $0, (DX)(BP*1)
    0x0095 00149 (/home/khr/go/issue14758.go:26)    MOVQ    SI, BP
    0x0098 00152 (/home/khr/go/issue14758.go:20)    MOVQ    BP, "".autotmp_0011+80(SP)
    0x009d 00157 (/home/khr/go/issue14758.go:25)    CMPQ    BP, $1000
    0x00a4 00164 (/home/khr/go/issue14758.go:25)    JLT $0, 131

I suspect that most of the time is lost with the extra writes to those autotmp variables. We might be able to lift those writes to after the loop, or get rid of them altogether. David was thinking about this at some point.

@dr2chase

@bradfitz bradfitz added this to the Unplanned milestone Mar 10, 2016
@dsnet
Copy link
Member Author

dsnet commented Mar 10, 2016

Just for context, this came up in CL/20513, and provided a very small performance boost of ~1% when the slice length was tracked manually.

This is not a priority, but a nice to have.

@gopherbot
Copy link

CL https://golang.org/cl/20570 mentions this issue.

gopherbot pushed a commit that referenced this issue Mar 13, 2016
* Shaves about 10k from pkg/tools/linux_amd64.
* Was suggested by drchase before
* Found by looking at ssa output of #14758

Change-Id: If2c4ddf3b2603d4dfd8fb4d9199b9a3dcb05b17d
Reviewed-on: https://go-review.googlesource.com/20570
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: Alexandru Moșoi <alexandru@mosoi.ro>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@ALTree
Copy link
Member

ALTree commented Mar 21, 2017

On tip (to-be-1.9), 10 runs, I see:

name   time/op
Foo-4   1.02µs ± 1%
Bar-4   1.02µs ± 0%

name   speed
Foo-4  977MB/s ± 1%
Bar-4  978MB/s ± 0%

I guess this is fixed, so I'm closing this issue.

@ALTree ALTree closed this as completed Mar 21, 2017
@ALTree
Copy link
Member

ALTree commented Mar 21, 2017

Looks like actually is BenchmarkFoo that got slower on tip; so they're not the same (on tip) because Bar got faster, but because Foo got slower (resp. go1.8). Reopening (and sorry for the noise).

@ALTree ALTree reopened this Mar 21, 2017
@bradfitz bradfitz modified the milestones: Go1.9Maybe, Unplanned Mar 21, 2017
@randall77
Copy link
Contributor

The original bug was fixed recently with better spill code generation: https://go-review.googlesource.com/c/34822/
The regression from 1.8 is #15837

@bradfitz
Copy link
Contributor

@randall77, does that mean this bug should be closed, or ... ?

@randall77
Copy link
Contributor

Well, I suspect the regression is because of #15837. Leave this open so I can double-check after I get a fix together for that issue.

@gopherbot
Copy link

CL https://golang.org/cl/38431 mentions this issue.

gopherbot pushed a commit that referenced this issue May 16, 2017
In Go 1.8.x, panics are generally scheduled at the very end of functions.
That property was lost in Go 1.9; this CL restores it.

This helps with the Fannkuch benchmark:

name          old time/op  new time/op  delta
Fannkuch11-8   2.74s ± 2%   2.55s ± 2%  -7.03%  (p=0.000 n=20+20)

This increases the fannkuch function size from 801 bytes to 831 bytes,
but that is still smaller than Go 1.8.1 at 844 bytes.

It generally increases binary size a tiny amount.
Negligible compiler performance impact.

For the code in #14758:

name   old time/op    new time/op    delta
Foo-8     326ns ± 3%     312ns ± 3%  -4.32%  (p=0.000 n=28+30)
Bar-8     560ns ± 2%     565ns ± 2%  +0.96%  (p=0.002 n=30+27)

Updates #18977

name        old alloc/op      new alloc/op      delta
Template         38.8MB ± 0%       38.8MB ± 0%    ~     (p=0.690 n=5+5)
Unicode          28.7MB ± 0%       28.7MB ± 0%    ~     (p=0.841 n=5+5)
GoTypes           109MB ± 0%        109MB ± 0%    ~     (p=0.690 n=5+5)
Compiler          457MB ± 0%        457MB ± 0%    ~     (p=0.841 n=5+5)
SSA              1.10GB ± 0%       1.10GB ± 0%  +0.03%  (p=0.032 n=5+5)
Flate            24.4MB ± 0%       24.5MB ± 0%    ~     (p=0.690 n=5+5)
GoParser         30.9MB ± 0%       30.9MB ± 0%    ~     (p=0.421 n=5+5)
Reflect          73.3MB ± 0%       73.3MB ± 0%    ~     (p=1.000 n=5+5)
Tar              25.5MB ± 0%       25.5MB ± 0%    ~     (p=0.095 n=5+5)
XML              40.8MB ± 0%       40.9MB ± 0%    ~     (p=0.056 n=5+5)
[Geo mean]       71.6MB            71.6MB       +0.01%

name        old allocs/op     new allocs/op     delta
Template           395k ± 0%         394k ± 1%    ~     (p=1.000 n=5+5)
Unicode            344k ± 0%         344k ± 0%    ~     (p=0.690 n=5+5)
GoTypes           1.16M ± 0%        1.16M ± 0%    ~     (p=0.421 n=5+5)
Compiler          4.41M ± 0%        4.41M ± 0%    ~     (p=0.841 n=5+5)
SSA               9.79M ± 0%        9.79M ± 0%    ~     (p=0.310 n=5+5)
Flate              237k ± 0%         237k ± 0%    ~     (p=0.841 n=5+5)
GoParser           321k ± 0%         321k ± 1%    ~     (p=0.421 n=5+5)
Reflect            956k ± 0%         956k ± 0%    ~     (p=1.000 n=5+5)
Tar                251k ± 1%         252k ± 0%    ~     (p=0.095 n=5+5)
XML                399k ± 0%         400k ± 0%    ~     (p=0.222 n=5+5)
[Geo mean]         741k              741k       +0.03%

name        old object-bytes  new object-bytes  delta
Template           386k ± 0%         386k ± 0%  +0.05%  (p=0.008 n=5+5)
Unicode            202k ± 0%         202k ± 0%  +0.02%  (p=0.008 n=5+5)
GoTypes           1.16M ± 0%        1.16M ± 0%  +0.07%  (p=0.008 n=5+5)
Compiler          3.91M ± 0%        3.91M ± 0%  +0.05%  (p=0.008 n=5+5)
SSA               7.86M ± 0%        7.87M ± 0%  +0.07%  (p=0.008 n=5+5)
Flate              227k ± 0%         227k ± 0%  +0.10%  (p=0.008 n=5+5)
GoParser           283k ± 0%         283k ± 0%  +0.04%  (p=0.008 n=5+5)
Reflect            950k ± 0%         951k ± 0%  +0.04%  (p=0.008 n=5+5)
Tar                187k ± 0%         187k ± 0%  -0.03%  (p=0.008 n=5+5)
XML                406k ± 0%         406k ± 0%  +0.04%  (p=0.008 n=5+5)
[Geo mean]         647k              647k       +0.04%

Change-Id: I2015aa26338b90cf41e47f89564e336dc02608df
Reviewed-on: https://go-review.googlesource.com/43293
Reviewed-by: Keith Randall <khr@golang.org>
gopherbot pushed a commit that referenced this issue May 18, 2017
looprotate finds loop headers and arranges for them to be placed
after the body of the loop. This eliminates a jump from the body.

However, if the loop header is a series of contiguously laid out blocks,
the rotation introduces a new jump in that series.
This CL expands the "loop header" to move to be the entire
run of contiguously laid out blocks in the same loop.

This shrinks object files a little, and actually speeds up
the compiler noticeably. Numbers below.

Fannkuch performance seems to vary a lot by machine. On my laptop:

name          old time/op  new time/op  delta
Fannkuch11-8   2.89s ± 2%   2.85s ± 3%  -1.22%  (p=0.000 n=50+50)

This has a significant affect on the append benchmarks in #14758:

name   old time/op    new time/op    delta
Foo-8     312ns ± 3%     276ns ± 2%  -11.37%  (p=0.000 n=30+29)
Bar-8     565ns ± 2%     456ns ± 2%  -19.27%  (p=0.000 n=27+28)

Updates #18977
Fixes #20355

name        old time/op       new time/op       delta
Template          205ms ± 5%        204ms ± 8%    ~     (p=0.903 n=92+99)
Unicode          85.3ms ± 4%       85.1ms ± 3%    ~     (p=0.191 n=92+94)
GoTypes           512ms ± 4%        507ms ± 4%  -0.93%  (p=0.000 n=95+97)
Compiler          2.38s ± 3%        2.35s ± 3%  -1.27%  (p=0.000 n=98+95)
SSA               4.67s ± 3%        4.64s ± 3%  -0.62%  (p=0.000 n=95+96)
Flate             117ms ± 3%        117ms ± 3%    ~     (p=0.099 n=84+86)
GoParser          139ms ± 4%        137ms ± 4%  -0.90%  (p=0.000 n=97+98)
Reflect           329ms ± 5%        326ms ± 6%  -0.97%  (p=0.002 n=99+98)
Tar               102ms ± 6%        101ms ± 5%  -0.97%  (p=0.006 n=97+97)
XML               198ms ±10%        196ms ±13%    ~     (p=0.087 n=100+100)
[Geo mean]        318ms             316ms       -0.72%

name        old user-time/op  new user-time/op  delta
Template          250ms ± 7%        250ms ± 7%    ~     (p=0.850 n=94+92)
Unicode           107ms ± 8%        106ms ± 5%  -0.76%  (p=0.005 n=98+91)
GoTypes           665ms ± 5%        659ms ± 5%  -0.85%  (p=0.003 n=93+98)
Compiler          3.15s ± 3%        3.10s ± 3%  -1.60%  (p=0.000 n=99+98)
SSA               6.82s ± 3%        6.72s ± 4%  -1.55%  (p=0.000 n=94+98)
Flate             138ms ± 8%        138ms ± 6%    ~     (p=0.369 n=94+92)
GoParser          170ms ± 5%        168ms ± 6%  -1.13%  (p=0.002 n=96+98)
Reflect           412ms ± 8%        416ms ± 8%    ~     (p=0.169 n=100+100)
Tar               123ms ±18%        123ms ±14%    ~     (p=0.896 n=100+100)
XML               236ms ± 9%        234ms ±11%    ~     (p=0.124 n=100+100)
[Geo mean]        401ms             398ms       -0.63%

name        old alloc/op      new alloc/op      delta
Template         38.8MB ± 0%       38.8MB ± 0%    ~     (p=0.222 n=5+5)
Unicode          28.7MB ± 0%       28.7MB ± 0%    ~     (p=0.421 n=5+5)
GoTypes           109MB ± 0%        109MB ± 0%    ~     (p=0.056 n=5+5)
Compiler          457MB ± 0%        457MB ± 0%  +0.07%  (p=0.008 n=5+5)
SSA              1.10GB ± 0%       1.10GB ± 0%  +0.05%  (p=0.008 n=5+5)
Flate            24.5MB ± 0%       24.5MB ± 0%    ~     (p=0.222 n=5+5)
GoParser         30.9MB ± 0%       31.0MB ± 0%  +0.21%  (p=0.016 n=5+5)
Reflect          73.4MB ± 0%       73.4MB ± 0%    ~     (p=0.421 n=5+5)
Tar              25.5MB ± 0%       25.5MB ± 0%    ~     (p=0.548 n=5+5)
XML              40.9MB ± 0%       40.9MB ± 0%    ~     (p=0.151 n=5+5)
[Geo mean]       71.6MB            71.6MB       +0.07%

name        old allocs/op     new allocs/op     delta
Template           394k ± 0%         394k ± 0%    ~     (p=1.000 n=5+5)
Unicode            344k ± 0%         343k ± 0%    ~     (p=0.310 n=5+5)
GoTypes           1.16M ± 0%        1.16M ± 0%    ~     (p=1.000 n=5+5)
Compiler          4.42M ± 0%        4.42M ± 0%    ~     (p=1.000 n=5+5)
SSA               9.80M ± 0%        9.80M ± 0%    ~     (p=0.095 n=5+5)
Flate              237k ± 1%         238k ± 1%    ~     (p=0.310 n=5+5)
GoParser           320k ± 0%         322k ± 1%  +0.50%  (p=0.032 n=5+5)
Reflect            958k ± 0%         957k ± 0%    ~     (p=0.548 n=5+5)
Tar                252k ± 1%         252k ± 0%    ~     (p=1.000 n=5+5)
XML                400k ± 0%         400k ± 0%    ~     (p=0.841 n=5+5)
[Geo mean]         741k              742k       +0.06%

name        old object-bytes  new object-bytes  delta
Template           386k ± 0%         386k ± 0%  -0.05%  (p=0.008 n=5+5)
Unicode            202k ± 0%         202k ± 0%  -0.01%  (p=0.008 n=5+5)
GoTypes           1.16M ± 0%        1.16M ± 0%  -0.06%  (p=0.008 n=5+5)
Compiler          3.91M ± 0%        3.91M ± 0%  -0.06%  (p=0.008 n=5+5)
SSA               7.91M ± 0%        7.92M ± 0%  +0.01%  (p=0.008 n=5+5)
Flate              228k ± 0%         227k ± 0%  -0.04%  (p=0.008 n=5+5)
GoParser           283k ± 0%         283k ± 0%  -0.06%  (p=0.008 n=5+5)
Reflect            952k ± 0%         951k ± 0%  -0.02%  (p=0.008 n=5+5)
Tar                187k ± 0%         187k ± 0%  -0.04%  (p=0.008 n=5+5)
XML                406k ± 0%         406k ± 0%  -0.05%  (p=0.008 n=5+5)
[Geo mean]         648k              648k       -0.04%

Change-Id: I8630c4291a0eb2f7e7927bc04d7cc0efef181094
Reviewed-on: https://go-review.googlesource.com/43491
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
josharian added a commit to josharian/go that referenced this issue May 23, 2017
For the append benchmarks in golang#14758:

name   old time/op    new time/op    delta
Foo-8     276ns ± 2%     275ns ± 1%  -0.52%  (p=0.008 n=29+29)
Bar-8     456ns ± 2%     456ns ± 2%    ~     (p=0.290 n=28+29)

Change-Id: I6bc74ad26ff632388f0ab355ec998f7dc9eaee8d
@gopherbot
Copy link

Change https://golang.org/cl/43492 mentions this issue: cmd/compile: mark exit blocks as very unlikely in regalloc

josharian added a commit to josharian/go that referenced this issue Nov 13, 2017
For the append benchmarks in golang#14758:

name   old time/op    new time/op    delta
Foo-8     276ns ± 2%     275ns ± 1%  -0.52%  (p=0.008 n=29+29)
Bar-8     456ns ± 2%     456ns ± 2%    ~     (p=0.290 n=28+29)

Change-Id: I6bc74ad26ff632388f0ab355ec998f7dc9eaee8d
@golang golang locked and limited conversation to collaborators Aug 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants