-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: performance regression with SSA backend #14606
Comments
Definitely strange that the performance is data-dependent. Makes me think there's some code that is slower with SSA and some code that is faster with SSA and the number of shards affects the balance of the two routines that are run. |
CL https://golang.org/cl/20160 mentions this issue. |
A quick scan of the 4 profiles shows that for the lookups with 8 keys, after the lookup function itself, the next highest is the siphash core (https://github.com/dchest/siphash ). For 8192 shards, the second item in the profile is runtime.mapaccess1_fast64. |
CL https://golang.org/cl/20172 mentions this issue. |
That makes sense, the map lookup should be more expensive for 8192 shards. It doesn't explain the performance difference though, as the map access only accounts for ~5% of total running time. |
* Move lowering into a separate pass. * SliceLen/SliceCap is now available to various intermediate passes which use useful for bounds checking. * Add a second opt pass to handle the new opportunities Decreases the code size of binaries in pkg/tool/linux_amd64 by ~45K. Updates #14564 #14606 Change-Id: I5b2bd6202181c50623a3585fbf15c0d6db6d4685 Reviewed-on: https://go-review.googlesource.com/20172 Run-TryBot: Alexandru Moșoi <alexandru@mosoi.ro> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
I will rerun these benchmarks today against the current tip. |
Still present:
|
I found one small fix that should help some. Generally the performance difference baffles me. The generated code looks pretty good, certainly as good as the legacy compiler generates. One problem is that this is a doubly-nested loop where the inner loop count is small. The SSA optimization for loop entry doesn't trigger on the outer loop. That's something I intend to get to, I'll open a bug for tracking. Another suspect is branch mispredicts - the old and new compiler lay out code differently. The fact that the inner loop count is small may contribute. This is only a suspicion. |
CL https://golang.org/cl/20567 mentions this issue. |
We use *24 a lot for pointer arithmetic when accessing slices of slices ([][]T). Rewrite to use an LEA and a shift. The shift will likely be free, as it often gets folded into an indexed load/store. Update #14606 Change-Id: Ie0bf6dc1093876efd57e88ce5f62c26a9bf21cec Reviewed-on: https://go-review.googlesource.com/20567 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Todd Neal <todd@tneal.org>
CL https://golang.org/cl/20660 mentions this issue. |
This seems to help the problem reported in #14606; this change seems to produce about a 4% improvement (mostly for the 128-8192 shards). Fixes #14789. Change-Id: I1bd52c82d4ca81d9d5e9ab371fdfc860d7e8af50 Reviewed-on: https://go-review.googlesource.com/20660 Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
Looks like recent improvements at tip have fixed this. |
For the record,
|
Please answer these questions before submitting your issue. Thanks!
go version
)?go version devel +0d1a98e Wed Mar 2 13:01:44 2016 +0000 linux/amd64
and
go version go1.6 linux/amd64
go env
)?linux/amd64
I ran some benchmarks for a consistent hashing algorithm: (multi-probe consistent hashing, code at https://github.com/dgryski/go-mpchash ) with the new SSA backend and also with 1.6.
Not performance regressions.
The number after the benchmark is the number of shards. For 8 and 32 shards, the new SSA backend is faster. For larger numbers of shards, the lookup time is worse.
To determine if this was runtime changes or the SSA backend, I also ran it against tip with SSA off (
GOSSAHASH=x
). Between 1.6 and tip (without SSA), the performance regression is ~5%.Between tip and tip-with-ssa, it's faster for 8 and 32 shards, but then only slight changes at larger numbers.
The text was updated successfully, but these errors were encountered: