Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: performance regression with SSA backend #14606

Closed
dgryski opened this issue Mar 2, 2016 · 13 comments
Closed

cmd/compile: performance regression with SSA backend #14606

dgryski opened this issue Mar 2, 2016 · 13 comments

Comments

@dgryski
Copy link
Contributor

dgryski commented Mar 2, 2016

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?
    go version devel +0d1a98e Wed Mar 2 13:01:44 2016 +0000 linux/amd64
    and
    go version go1.6 linux/amd64
  2. What operating system and processor architecture are you using (go env)?
    linux/amd64
  3. What did you do?
    I ran some benchmarks for a consistent hashing algorithm: (multi-probe consistent hashing, code at https://github.com/dgryski/go-mpchash ) with the new SSA backend and also with 1.6.
 ~/work/src/cvs/go.tip/bin/go test -test.bench=Lookup -test.count=20 >choose.ssa
go test -test.bench=Lookup -test.count=20 >choose.16
  1. What did you expect to see?

Not performance regressions.

  1. What did you see instead?

The number after the benchmark is the number of shards. For 8 and 32 shards, the new SSA backend is faster. For larger numbers of shards, the lookup time is worse.

<dgryski@kamek[go-mpchash] \ʕ◔ϖ◔ʔ/ > benchstat choose.16 choose.ssa 
name          old time/op  new time/op  delta
Lookup8-4      248ns ± 1%   209ns ± 3%  -15.67%  (p=0.000 n=20+17)
Lookup32-4     262ns ± 2%   221ns ± 2%  -15.77%  (p=0.000 n=19+18)
Lookup128-4    410ns ± 1%   410ns ± 1%     ~     (p=0.482 n=20+19)
Lookup512-4    458ns ± 1%   502ns ± 1%   +9.58%  (p=0.000 n=19+20)
Lookup2048-4   501ns ± 1%   548ns ± 1%   +9.41%  (p=0.000 n=20+20)
Lookup8192-4   584ns ± 1%   624ns ± 2%   +6.97%  (p=0.000 n=18+20)

To determine if this was runtime changes or the SSA backend, I also ran it against tip with SSA off (GOSSAHASH=x). Between 1.6 and tip (without SSA), the performance regression is ~5%.

<dgryski@kamek[go-mpchash] \ʕ◔ϖ◔ʔ/ > benchstat choose.16 choose.tip
name          old time/op  new time/op  delta
Lookup8-4      248ns ± 1%   246ns ± 1%  -1.13%  (p=0.000 n=20+20)
Lookup32-4     262ns ± 2%   276ns ± 2%  +5.16%  (p=0.000 n=19+19)
Lookup128-4    410ns ± 1%   431ns ± 2%  +5.31%  (p=0.000 n=20+20)
Lookup512-4    458ns ± 1%   485ns ± 1%  +5.83%  (p=0.000 n=19+20)
Lookup2048-4   501ns ± 1%   540ns ± 3%  +7.77%  (p=0.000 n=20+20)
Lookup8192-4   584ns ± 1%   624ns ± 2%  +7.00%  (p=0.000 n=18+20)

Between tip and tip-with-ssa, it's faster for 8 and 32 shards, but then only slight changes at larger numbers.

<dgryski@kamek[go-mpchash] \ʕ◔ϖ◔ʔ/ > benchstat choose.tip choose.ssa
name          old time/op  new time/op  delta
Lookup8-4      246ns ± 1%   209ns ± 3%  -14.71%  (p=0.000 n=20+17)
Lookup32-4     276ns ± 2%   221ns ± 2%  -19.90%  (p=0.000 n=19+18)
Lookup128-4    431ns ± 2%   410ns ± 1%   -4.95%  (p=0.000 n=20+19)
Lookup512-4    485ns ± 1%   502ns ± 1%   +3.55%  (p=0.000 n=20+20)
Lookup2048-4   540ns ± 3%   548ns ± 1%   +1.52%  (p=0.000 n=20+20)
Lookup8192-4   624ns ± 2%   624ns ± 2%     ~     (p=0.506 n=20+20)
@ianlancetaylor ianlancetaylor changed the title Performance regression with SSA backend cmd/compile: performance regression with SSA backend Mar 2, 2016
@ianlancetaylor ianlancetaylor added this to the Go1.7 milestone Mar 2, 2016
@ianlancetaylor
Copy link
Contributor

CC @randall77 @dr2chase

@randall77
Copy link
Contributor

Definitely strange that the performance is data-dependent. Makes me think there's some code that is slower with SSA and some code that is faster with SSA and the number of shards affects the balance of the two routines that are run.
Can you get a profile from the best and worst tip performers (relative to 1.6)? It might shed some light.

@gopherbot
Copy link

CL https://golang.org/cl/20160 mentions this issue.

@dgryski
Copy link
Contributor Author

dgryski commented Mar 3, 2016

A quick scan of the 4 profiles shows that for the lookups with 8 keys, after the lookup function itself, the next highest is the siphash core (https://github.com/dchest/siphash ). For 8192 shards, the second item in the profile is runtime.mapaccess1_fast64.

@gopherbot
Copy link

CL https://golang.org/cl/20172 mentions this issue.

@randall77
Copy link
Contributor

That makes sense, the map lookup should be more expensive for 8192 shards. It doesn't explain the performance difference though, as the map access only accounts for ~5% of total running time.
Looks like the extra time is in (*Multi).Hash

gopherbot pushed a commit that referenced this issue Mar 9, 2016
* Move lowering into a separate pass.
* SliceLen/SliceCap is now available to various intermediate passes
which use useful for bounds checking.
* Add a second opt pass to handle the new opportunities

Decreases the code size of binaries in pkg/tool/linux_amd64
by ~45K.

Updates #14564 #14606

Change-Id: I5b2bd6202181c50623a3585fbf15c0d6db6d4685
Reviewed-on: https://go-review.googlesource.com/20172
Run-TryBot: Alexandru Moșoi <alexandru@mosoi.ro>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
@dgryski
Copy link
Contributor Author

dgryski commented Mar 11, 2016

I will rerun these benchmarks today against the current tip.

@dgryski
Copy link
Contributor Author

dgryski commented Mar 11, 2016

Still present:

<dgryski@kamek[go-mpchash] \ʕ◔ϖ◔ʔ/ >  ~/work/src/cvs/go.tip/bin/go  version
go version devel +b354f91 Fri Mar 11 06:43:01 2016 +0000 linux/amd64
<dgryski@kamek[go-mpchash] ʕ╯◔ϖ◔ʔ╯ > ~/work/src/cvs/go.tip/bin/go test -test.bench=Lookup -test.count=20 >choose.ssa
<dgryski@kamek[go-mpchash] \ʕ◔ϖ◔ʔ/ > go version
go version go1.6 linux/amd64
<dgryski@kamek[go-mpchash] \ʕ◔ϖ◔ʔ/ > go test -test.bench=Lookup -test.count=20 >choose.16
<dgryski@kamek[go-mpchash] \ʕ◔ϖ◔ʔ/ > benchstat choose.16 choose.ssa 
name          old time/op  new time/op  delta
Lookup8-4      316ns ±15%   288ns ±18%   -8.97%  (p=0.000 n=19+19)
Lookup32-4     346ns ± 7%   304ns ± 9%  -12.06%  (p=0.000 n=17+20)
Lookup128-4    525ns ±12%   528ns ±12%     ~     (p=0.616 n=20+20)
Lookup512-4    608ns ±15%   676ns ± 8%  +11.31%  (p=0.000 n=20+20)
Lookup2048-4   662ns ±11%   752ns ±12%  +13.49%  (p=0.000 n=18+20)
Lookup8192-4   740ns ±11%   867ns ±20%  +17.10%  (p=0.000 n=20+20)

@randall77
Copy link
Contributor

I found one small fix that should help some. Generally the performance difference baffles me. The generated code looks pretty good, certainly as good as the legacy compiler generates.

One problem is that this is a doubly-nested loop where the inner loop count is small. The SSA optimization for loop entry doesn't trigger on the outer loop. That's something I intend to get to, I'll open a bug for tracking.

Another suspect is branch mispredicts - the old and new compiler lay out code differently. The fact that the inner loop count is small may contribute. This is only a suspicion.

@gopherbot
Copy link

CL https://golang.org/cl/20567 mentions this issue.

gopherbot pushed a commit that referenced this issue Mar 12, 2016
We use *24 a lot for pointer arithmetic when accessing slices
of slices ([][]T).  Rewrite to use an LEA and a shift.
The shift will likely be free, as it often gets folded into
an indexed load/store.

Update #14606

Change-Id: Ie0bf6dc1093876efd57e88ce5f62c26a9bf21cec
Reviewed-on: https://go-review.googlesource.com/20567
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Todd Neal <todd@tneal.org>
@gopherbot
Copy link

CL https://golang.org/cl/20660 mentions this issue.

gopherbot pushed a commit that referenced this issue Mar 18, 2016
This seems to help the problem reported in #14606; this
change seems to produce about a 4% improvement (mostly
for the 128-8192 shards).

Fixes #14789.

Change-Id: I1bd52c82d4ca81d9d5e9ab371fdfc860d7e8af50
Reviewed-on: https://go-review.googlesource.com/20660
Run-TryBot: David Chase <drchase@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@randall77
Copy link
Contributor

Looks like recent improvements at tip have fixed this.

@dgryski
Copy link
Contributor Author

dgryski commented Apr 21, 2016

For the record,

<dgryski@kamek[go-mpchash] \ʕ◔ϖ◔ʔ/ > cat run.sh
#!/bin/sh
set -x
git show |grep ^commit
~/work/src/cvs/go.tip/bin/go version
~/work/src/cvs/go.tip/bin/go test -test.bench=Lookup -test.count=40 >choose.ssa
go version
go test -test.bench=Lookup -test.count=40 >choose.16
benchstat choose.16 choose.ssa
<dgryski@kamek[go-mpchash] \ʕ◔ϖ◔ʔ/ > ./run.sh
+ + grep ^commit
git show
commit 7a1df2af545d653c6e01b11220835896fcd5e124
+ /home/dgryski/work/src/cvs/go.tip/bin/go version
go version devel +45522a6 Thu Apr 21 06:35:48 2016 +0000 linux/amd64
+ /home/dgryski/work/src/cvs/go.tip/bin/go test -test.bench=Lookup -test.count=40
+ go version
go version go1.6.2 linux/amd64
+ go test -test.bench=Lookup -test.count=40
+ benchstat choose.16 choose.ssa
name          old time/op  new time/op  delta
Lookup8-4      224ns ± 2%   187ns ± 3%  -16.30%  (p=0.000 n=37+38)
Lookup32-4     243ns ± 3%   205ns ± 2%  -15.51%  (p=0.000 n=38+40)
Lookup128-4    418ns ± 1%   368ns ± 1%  -11.81%  (p=0.000 n=40+36)
Lookup512-4    458ns ± 1%   446ns ± 2%   -2.50%  (p=0.000 n=39+36)
Lookup2048-4   500ns ± 1%   497ns ± 2%   -0.75%  (p=0.000 n=38+39)
Lookup8192-4   580ns ± 3%   557ns ± 3%   -3.83%  (p=0.000 n=37+40)

@golang golang locked and limited conversation to collaborators Apr 21, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants