-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: multiplication strength-reduction rules hurting performance #21434
Comments
I don't think your test is actually measuring the multiply latency. It is only measuring multiply throughput.
So when measuring throughput (F1 vs F2) the rewrite hurts, probably because we're substituting 3 instructions for 1. (Side note - why does this hurt? I would expect a reasonable fetch/retire engine to keep up with this loop.) When measuring latency (F3 vs F4), however, the rewrite helps. We're now doing the multiply in two latency 1 instructions instead of 1 latency 3 instruction. I think latency is more important than throughput, so we should keep the rewrite. I'm happy to hear arguments otherwise, though. My processor is Intel(R) Xeon(R) CPU E5-1650 0 @ 3.20GHz, YMMV. |
Ah, the reduction improves latency. Thanks for the explanation.
Well, the other compilers I looked at (GCC, Clang, intel) seem, too, to optimize for latency, so I guess that's a point in favour of keeping those rewrites. |
Ok, I'll close this issue then. Feel free to reopen if you want to continue discussing. |
I noticed that certain strength-reduction rules for multiplication seem to harm performance on my machine.
Take the following rule in AMD64.rules (that reduces
c * n
, whenc
is a constant one less of a power of 2):Two silly benchmarks:
The generated code for
F1
with and without the strength reduction rule:A single
IMULQ
instruction is replaced by aMOVQ
, aSHLQ
, and aSUBQ
; the reduced code also uses one more register. The reduction seems to harm performances. Benchmarks results with tip vs tip+rule-disabledThe rule makes no difference on the microbenchmark, and strongly hurts performances on the more realistic second benchmark.
This is on an Haswell machine, where the rule-of-thumb currently used for this kind of reduction (
addq
,shlq
,leaq
,negq
all cost 1,imulq
costs 3) should be valid. And yet, the reduced code is either as fast or significantly slower than the one usingimul
.The text was updated successfully, but these errors were encountered: