cmd/compile: performance regression with SSA backend #14606

dgryski · 2016-03-02T21:56:27Z

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?
go version devel +0d1a98e Wed Mar 2 13:01:44 2016 +0000 linux/amd64
and
go version go1.6 linux/amd64
What operating system and processor architecture are you using (go env)?
linux/amd64
What did you do?
I ran some benchmarks for a consistent hashing algorithm: (multi-probe consistent hashing, code at https://github.com/dgryski/go-mpchash ) with the new SSA backend and also with 1.6.

 ~/work/src/cvs/go.tip/bin/go test -test.bench=Lookup -test.count=20 >choose.ssa
go test -test.bench=Lookup -test.count=20 >choose.16

What did you expect to see?

Not performance regressions.

What did you see instead?

The number after the benchmark is the number of shards. For 8 and 32 shards, the new SSA backend is faster. For larger numbers of shards, the lookup time is worse.

<dgryski@kamek[go-mpchash] \ʕ◔ϖ◔ʔ/ > benchstat choose.16 choose.ssa 
name          old time/op  new time/op  delta
Lookup8-4      248ns ± 1%   209ns ± 3%  -15.67%  (p=0.000 n=20+17)
Lookup32-4     262ns ± 2%   221ns ± 2%  -15.77%  (p=0.000 n=19+18)
Lookup128-4    410ns ± 1%   410ns ± 1%     ~     (p=0.482 n=20+19)
Lookup512-4    458ns ± 1%   502ns ± 1%   +9.58%  (p=0.000 n=19+20)
Lookup2048-4   501ns ± 1%   548ns ± 1%   +9.41%  (p=0.000 n=20+20)
Lookup8192-4   584ns ± 1%   624ns ± 2%   +6.97%  (p=0.000 n=18+20)

To determine if this was runtime changes or the SSA backend, I also ran it against tip with SSA off (GOSSAHASH=x). Between 1.6 and tip (without SSA), the performance regression is ~5%.

<dgryski@kamek[go-mpchash] \ʕ◔ϖ◔ʔ/ > benchstat choose.16 choose.tip
name          old time/op  new time/op  delta
Lookup8-4      248ns ± 1%   246ns ± 1%  -1.13%  (p=0.000 n=20+20)
Lookup32-4     262ns ± 2%   276ns ± 2%  +5.16%  (p=0.000 n=19+19)
Lookup128-4    410ns ± 1%   431ns ± 2%  +5.31%  (p=0.000 n=20+20)
Lookup512-4    458ns ± 1%   485ns ± 1%  +5.83%  (p=0.000 n=19+20)
Lookup2048-4   501ns ± 1%   540ns ± 3%  +7.77%  (p=0.000 n=20+20)
Lookup8192-4   584ns ± 1%   624ns ± 2%  +7.00%  (p=0.000 n=18+20)

Between tip and tip-with-ssa, it's faster for 8 and 32 shards, but then only slight changes at larger numbers.

<dgryski@kamek[go-mpchash] \ʕ◔ϖ◔ʔ/ > benchstat choose.tip choose.ssa
name          old time/op  new time/op  delta
Lookup8-4      246ns ± 1%   209ns ± 3%  -14.71%  (p=0.000 n=20+17)
Lookup32-4     276ns ± 2%   221ns ± 2%  -19.90%  (p=0.000 n=19+18)
Lookup128-4    431ns ± 2%   410ns ± 1%   -4.95%  (p=0.000 n=20+19)
Lookup512-4    485ns ± 1%   502ns ± 1%   +3.55%  (p=0.000 n=20+20)
Lookup2048-4   540ns ± 3%   548ns ± 1%   +1.52%  (p=0.000 n=20+20)
Lookup8192-4   624ns ± 2%   624ns ± 2%     ~     (p=0.506 n=20+20)

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2016-03-02T22:10:50Z

CC @randall77 @dr2chase

randall77 · 2016-03-03T01:40:50Z

Definitely strange that the performance is data-dependent. Makes me think there's some code that is slower with SSA and some code that is faster with SSA and the number of shards affects the balance of the two routines that are run.
Can you get a profile from the best and worst tip performers (relative to 1.6)? It might shed some light.

gopherbot · 2016-03-03T04:00:16Z

CL https://golang.org/cl/20160 mentions this issue.

dgryski · 2016-03-03T07:55:00Z

A quick scan of the 4 profiles shows that for the lookups with 8 keys, after the lookup function itself, the next highest is the siphash core (https://github.com/dchest/siphash ). For 8192 shards, the second item in the profile is runtime.mapaccess1_fast64.

gopherbot · 2016-03-03T11:00:15Z

CL https://golang.org/cl/20172 mentions this issue.

randall77 · 2016-03-03T19:24:13Z

That makes sense, the map lookup should be more expensive for 8192 shards. It doesn't explain the performance difference though, as the map access only accounts for ~5% of total running time.
Looks like the extra time is in (*Multi).Hash

* Move lowering into a separate pass. * SliceLen/SliceCap is now available to various intermediate passes which use useful for bounds checking. * Add a second opt pass to handle the new opportunities Decreases the code size of binaries in pkg/tool/linux_amd64 by ~45K. Updates #14564 #14606 Change-Id: I5b2bd6202181c50623a3585fbf15c0d6db6d4685 Reviewed-on: https://go-review.googlesource.com/20172 Run-TryBot: Alexandru Moșoi <alexandru@mosoi.ro> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>

dgryski · 2016-03-11T05:43:18Z

I will rerun these benchmarks today against the current tip.

dgryski · 2016-03-11T09:44:52Z

Still present:

<dgryski@kamek[go-mpchash] \ʕ◔ϖ◔ʔ/ >  ~/work/src/cvs/go.tip/bin/go  version
go version devel +b354f91 Fri Mar 11 06:43:01 2016 +0000 linux/amd64
<dgryski@kamek[go-mpchash] ʕ╯◔ϖ◔ʔ╯ > ~/work/src/cvs/go.tip/bin/go test -test.bench=Lookup -test.count=20 >choose.ssa
<dgryski@kamek[go-mpchash] \ʕ◔ϖ◔ʔ/ > go version
go version go1.6 linux/amd64
<dgryski@kamek[go-mpchash] \ʕ◔ϖ◔ʔ/ > go test -test.bench=Lookup -test.count=20 >choose.16
<dgryski@kamek[go-mpchash] \ʕ◔ϖ◔ʔ/ > benchstat choose.16 choose.ssa 
name          old time/op  new time/op  delta
Lookup8-4      316ns ±15%   288ns ±18%   -8.97%  (p=0.000 n=19+19)
Lookup32-4     346ns ± 7%   304ns ± 9%  -12.06%  (p=0.000 n=17+20)
Lookup128-4    525ns ±12%   528ns ±12%     ~     (p=0.616 n=20+20)
Lookup512-4    608ns ±15%   676ns ± 8%  +11.31%  (p=0.000 n=20+20)
Lookup2048-4   662ns ±11%   752ns ±12%  +13.49%  (p=0.000 n=18+20)
Lookup8192-4   740ns ±11%   867ns ±20%  +17.10%  (p=0.000 n=20+20)

randall77 · 2016-03-12T01:50:22Z

I found one small fix that should help some. Generally the performance difference baffles me. The generated code looks pretty good, certainly as good as the legacy compiler generates.

One problem is that this is a doubly-nested loop where the inner loop count is small. The SSA optimization for loop entry doesn't trigger on the outer loop. That's something I intend to get to, I'll open a bug for tracking.

Another suspect is branch mispredicts - the old and new compiler lay out code differently. The fact that the inner loop count is small may contribute. This is only a suspicion.

gopherbot · 2016-03-12T02:00:21Z

CL https://golang.org/cl/20567 mentions this issue.

We use *24 a lot for pointer arithmetic when accessing slices of slices ([][]T). Rewrite to use an LEA and a shift. The shift will likely be free, as it often gets folded into an indexed load/store. Update #14606 Change-Id: Ie0bf6dc1093876efd57e88ce5f62c26a9bf21cec Reviewed-on: https://go-review.googlesource.com/20567 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Todd Neal <todd@tneal.org>

gopherbot · 2016-03-16T19:00:27Z

CL https://golang.org/cl/20660 mentions this issue.

This seems to help the problem reported in #14606; this change seems to produce about a 4% improvement (mostly for the 128-8192 shards). Fixes #14789. Change-Id: I1bd52c82d4ca81d9d5e9ab371fdfc860d7e8af50 Reviewed-on: https://go-review.googlesource.com/20660 Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>

randall77 · 2016-04-20T21:41:51Z

Looks like recent improvements at tip have fixed this.

dgryski · 2016-04-21T08:46:31Z

For the record,

<dgryski@kamek[go-mpchash] \ʕ◔ϖ◔ʔ/ > cat run.sh
#!/bin/sh
set -x
git show |grep ^commit
~/work/src/cvs/go.tip/bin/go version
~/work/src/cvs/go.tip/bin/go test -test.bench=Lookup -test.count=40 >choose.ssa
go version
go test -test.bench=Lookup -test.count=40 >choose.16
benchstat choose.16 choose.ssa
<dgryski@kamek[go-mpchash] \ʕ◔ϖ◔ʔ/ > ./run.sh
+ + grep ^commit
git show
commit 7a1df2af545d653c6e01b11220835896fcd5e124
+ /home/dgryski/work/src/cvs/go.tip/bin/go version
go version devel +45522a6 Thu Apr 21 06:35:48 2016 +0000 linux/amd64
+ /home/dgryski/work/src/cvs/go.tip/bin/go test -test.bench=Lookup -test.count=40
+ go version
go version go1.6.2 linux/amd64
+ go test -test.bench=Lookup -test.count=40
+ benchstat choose.16 choose.ssa
name          old time/op  new time/op  delta
Lookup8-4      224ns ± 2%   187ns ± 3%  -16.30%  (p=0.000 n=37+38)
Lookup32-4     243ns ± 3%   205ns ± 2%  -15.51%  (p=0.000 n=38+40)
Lookup128-4    418ns ± 1%   368ns ± 1%  -11.81%  (p=0.000 n=40+36)
Lookup512-4    458ns ± 1%   446ns ± 2%   -2.50%  (p=0.000 n=39+36)
Lookup2048-4   500ns ± 1%   497ns ± 2%   -0.75%  (p=0.000 n=38+39)
Lookup8192-4   580ns ± 3%   557ns ± 3%   -3.83%  (p=0.000 n=37+40)

ianlancetaylor changed the title ~~Performance regression with SSA backend~~ cmd/compile: performance regression with SSA backend Mar 2, 2016

ianlancetaylor added this to the Go1.7 milestone Mar 2, 2016

randall77 mentioned this issue Mar 12, 2016

cmd/compile: more robust loop entry detector #14789

Closed

dsymonds added the Performance label Mar 31, 2016

randall77 closed this as completed Apr 20, 2016

golang locked and limited conversation to collaborators Apr 21, 2017

gopherbot added the FrozenDueToAge label Apr 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: performance regression with SSA backend #14606

cmd/compile: performance regression with SSA backend #14606

dgryski commented Mar 2, 2016

ianlancetaylor commented Mar 2, 2016

randall77 commented Mar 3, 2016

gopherbot commented Mar 3, 2016

dgryski commented Mar 3, 2016

gopherbot commented Mar 3, 2016

randall77 commented Mar 3, 2016

dgryski commented Mar 11, 2016

dgryski commented Mar 11, 2016

randall77 commented Mar 12, 2016

gopherbot commented Mar 12, 2016

gopherbot commented Mar 16, 2016

randall77 commented Apr 20, 2016

dgryski commented Apr 21, 2016

cmd/compile: performance regression with SSA backend #14606

cmd/compile: performance regression with SSA backend #14606

Comments

dgryski commented Mar 2, 2016

ianlancetaylor commented Mar 2, 2016

randall77 commented Mar 3, 2016

gopherbot commented Mar 3, 2016

dgryski commented Mar 3, 2016

gopherbot commented Mar 3, 2016

randall77 commented Mar 3, 2016

dgryski commented Mar 11, 2016

dgryski commented Mar 11, 2016

randall77 commented Mar 12, 2016

gopherbot commented Mar 12, 2016

gopherbot commented Mar 16, 2016

randall77 commented Apr 20, 2016

dgryski commented Apr 21, 2016