-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: benchmark regression in github.com/cespare/xxhash between 1.9.2 and 1.10beta2 #23424
Comments
hi @cespare, i think that you are right - the slowdown in string to []byte conversion.
befor this change i have (on linux machine)
after
|
Intel i7-4510U, I get (1.9.2 vs tip):
It seems like the
now (tip) we have
I'm not very good at reading @cespare can you confirm that the slowdown was introduced by 9c99512? |
Reverting 9c99512 seems to fix this. |
This is somewhat unrelated, but moving string to [] byte conversion out of the loop results in :
Making it ~2x faster than 1.9 |
I concur, sounds like we need to revert that CL. Or at least the part of it that does the 8->16 byte combination. (Maybe we keep the smaller ones?) Do we understand why using sse registers is so much slower? Does this optimization break store forwarding or something? |
If this were ARM/NEON, I'd say it was a memory hazard. I couldn't say about amd64, though. |
Looks like problem with forwarding 64-bit store to 128-bit load, there is ~1 ld_blocks.store_forward event per iteration. Honestly I'm surprised by performance impact. We did sse loads of 64-bit values in memmove, duffcopy, opcopy etc. for a while without problems. |
The unaligned MOVUPS 0x18(SP), X0 is clearly the issue here. While MOVUPS allows unaligned reads and writes, they are not as fast as aligned read and writes. |
Doesn't look like it. Rsp contains 0xc420034f08, so 0x18(SP) = 0xC420034F20 is a 32-byte aligned address. |
@minaevmike @TocarIP -- somewhat offtopic, but FYI this benchmark was added specifically to check that Sum64 doesn't cause the argument to escape (see cespare/xxhash#2) so doing the string->slice inside the loop is intentional. |
Thanks all for investigating. I can confirm that Go 1.10beta2 + reverting 9c99512 fixes my regression. It also fixes another, similar regression in the same benchmark suite (which I suspected was the same issue.) |
I was also going to file a bug for slowdowns of a bunch of benchmarks in github.com/spaolacci/murmur3, but happily the same revert also fixes those. If you're interested in reproducing those:
Comparing 1.9.2 to 1.10beta2:
Comparing 1.9.2 to [1.10beta2 + revert of 9c99512]:
|
@TocarIP You are right, the load is aligned. But in that case the store is unaligned. So what can be seen in the profile may be an effect of out-of-order execution. In any case I can also confirm that the reversal of 9c99612 fixes the issue with cespare's code. |
Change https://golang.org/cl/88135 mentions this issue: |
I sent a CL which removes the 64-bit combining but leaves 8/16/32-bit in place. That fixes all of my benchmark regressions. |
I have a small benchmark that shows a significant regression between go1.9.2 and go1.10beta2.
You can check out the code at https://github.com/cespare/xxhash. This is done with the latest commit as of writing (e4e2bd419cbedb5b2ec48e4a88e4c993ddb74555).
I'm on linux/amd64. The benchmark in question is BenchmarkStringHash. The code is simply:
Between 1.9.2 and 1.10beta2 I see a 2x slowdown:
Note that the implementation of Sum64 is in assembly. If I use the
noasm
tag to select a pure-Go implementation, I still see a large slowdown:Because of the slowdown in the assembly case where the code for Sum64 should be identical, I infer that the difference is inside the BenchmarkStringHash function itself (related to the string->[]byte conversion or the function call). Perhaps that's an incorrect assumption, though.
I've poked around a profile a bit but I haven't spotted a smoking gun. Perhaps someone else can make more sense of it.
Here's some pprof disasm output for 1.9.2 (fast):
and here's the same output for 1.10beta2 (slow):
The text was updated successfully, but these errors were encountered: