Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: strength reduce floating point #19827

Closed
randall77 opened this issue Apr 3, 2017 · 6 comments
Closed

cmd/compile: strength reduce floating point #19827

randall77 opened this issue Apr 3, 2017 · 6 comments

Comments

@randall77
Copy link
Contributor

In cases where it doesn't affect correctness, we could strength reduce some floating-point ops.

  • x * 2.0 -> x + x
  • x / 2.0 -> x * 0.5

For the divide->multiply reduction, we could do any value whose reciprocal is representable exactly. Is that just powers of two?

We already do x1,x-1,x/1,x/-1.

See issue #19819

@randall77 randall77 added this to the Go1.9 milestone Apr 3, 2017
@randall77 randall77 self-assigned this Apr 3, 2017
@gopherbot
Copy link

CL https://golang.org/cl/39295 mentions this issue.

@robpike
Copy link
Contributor

robpike commented Apr 3, 2017

Let me be ignorant and ask whether turning floating point multiplication into addition is indeed a performance win. One requires touching an exponent; the other requires touching every bit and normalizing.

@philhofer
Copy link
Contributor

For amd64, the Agner Fog tables say that turning an fmul into an fadd might be a small improvement. On Haswell, for example, both are 1 micro-op, but fadd has a 3-cycle result latency, whereas fmul has a 5-cycle result latency. However, these instructions use different ports, so code with interleaved fmul and fadd ops that gets turned into just a series of fadd ops will not execute as quickly.

Conversely, on arm64, there may not be a win. Both FADD and FMUL have a 5-cycle result latency on the Cortex A57 (provided that denormals are flushed to zero).

Division, on the other hand, appears to be consistently more expensive (dozens of cycles of result latency... up to 64 on the A57).

@randall77
Copy link
Contributor Author

randall77 commented Apr 3, 2017

x*2 -> x+x is almost certainly a win.
+ requires futzing with an exponent, * requires a carry-save adder chain of some sort. But that's neither here nor there, in the end we just accept the cycle latency the hardware gives us. The processors I've checked have + faster than or equal to *. In addition, with this rewrite we no longer have to materialize the constant 2.

x*-2 -> -(x+x) is less clear. I could be convinced either way.

@randall77
Copy link
Contributor Author

randall77 commented Apr 3, 2017

name       old time/op  new time/op  delta
Mul2-8      626ns ± 0%   382ns ± 1%  -38.96%  (p=0.000 n=8+10)
MulNeg2-8   635ns ± 1%   759ns ± 1%  +19.57%  (p=0.000 n=9+10)

At least on amd64, *2 is a win. I'll remove the *-2 code from the CL.

@robpike
Copy link
Contributor

robpike commented Apr 3, 2017

Thanks for checking.

lparth pushed a commit to lparth/go that referenced this issue Apr 13, 2017
x*2 -> x+x
x/c, c power of 2 -> x*(1/c)

Fixes golang#19827

Change-Id: I74c9f0b5b49b2ed26c0990314c7d1d5f9631b6f1
Reviewed-on: https://go-review.googlesource.com/39295
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
@golang golang locked and limited conversation to collaborators Apr 3, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants