Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: intrinsify some math functions #17069

Closed
martisch opened this issue Sep 12, 2016 · 9 comments
Closed

cmd/compile: intrinsify some math functions #17069

martisch opened this issue Sep 12, 2016 · 9 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@martisch
Copy link
Contributor

martisch commented Sep 12, 2016

Currently math.sqrt (OSQRT) is lowered to OpSqrt and in case of constants precomputed and otherwise a hardware opcode may be generated (SQRTSD on amd64).
#15543

Other math functions like NaN,Inf,... are "only" inlined.

I started experimenting with a CL to intrinsify NaN, Inf, ... due to the idea in #17063.

At a first quick glance it seems the compiler is able to optimize code better if we would lower the precomputeable values to constants (NaN,...) or in some cases just directly emit efficient code in the backend if the function can be replaced by 1-2 opcodes.

More simplifications and general constant folding can then be handled and infrastructure for it extended in later stages of the backend.

Possible functions that look (at first glance) easy or useful to be lowered or have hardware opcodes (on Amd64):

Abs
Ceil
Copysign
Cos
Exp
Exp2
Float32bits
Float32frombits
Float64bits
Float64frombits
Floor
IsInf
IsNaN
Log
Log10
Log2
Max
Min
Mod
Modf
NaN
Pow
Pow10
Remainder
Signbit
Sin
Sqrt (OSQRT opcode could possibly be removed from frontend when handled as intrinsic)

If this finds support i would like to assign myself to start investigate this further and start CL's that would intrinsify some of the above functions (depending on popularity, usefulness, ease of integration into the backend,...).

@josharian
Copy link
Contributor

cc @randall77

See also #13095.

@martisch
Copy link
Contributor Author

#13095 sounds like we should not lower them but instead improve the inlining and load+stores. We could still consider cases were we have hardware opcodes that can replace these functions such as Sqrt as these can not directly be simplified from go code to the opcode.

@bradfitz
Copy link
Contributor

When intrinsifying more things in the compiler, please do not base the decision on the (package name, symbol name) alone. If I copy/vendor/fork the stdlib elsewhere, the identical package "math" code elsewhere should be just as fast.

... unless the package is some "guts" package like runtime or reflect. But normal packages which are pure Go (or could be pure Go) should not be blessed by the compiler based on whether they live in std or golang.org/x/* or github.

@josharian
Copy link
Contributor

@bradfitz agreed in principle, although it is worth noting that Sqrt currently violates that principle, and I don't see any reasonable way to make that practice agree with principle. Most of the math functions are not like Sqrt, though.

@martisch
Copy link
Contributor Author

martisch commented Sep 12, 2016

@bradfitz good point. As for copying code that is then not lowered to opcodes (because of name change) it might also be the case that it does not produce equal results for some float math.

Sadly math.sqrt already seems to violate what you said. But i guess we can at least move math.Sqrt from frontend as intrinsic to the backend for cleanup since removing its special treatment would otherwise cause a performance regression.

So for now i have a look at lowering sqrt as an intrinsic and leave the rest alone :) (unless other reasons and candidates surface to handle more math functions as opcodes directly). If needed for compiler or runtime performance we could always have internal function names and lower only those.

@dr2chase
Copy link
Contributor

@bradfitz how do we get value out of that, and how does that value compare to improved performance (which has also yet to be measured)? The cases I know where people want to replace libraries wholesale do not usually also require exact performance compatibility; in my experience it's for debugging (have done it myself) and performance is not critical.

That said, it plausible that we could build some flow-graph pattern-matchers for the things we wish to substitute (have also done that myself).

@dr2chase
Copy link
Contributor

Also, sin and cosine are not candidates for this unless their inputs are range-reduced, and they're also x87 FP, not "modern" Intel FP. Very large inputs merely set a flag and return their input unmodified, so sin(2-to-the-64) returns 2-to-the-64. Intel hardware transcendentals use a 68-bit approximation of Pi (66-bits, but the first two missing bits are zero, so 68) which means that sin(machinePi) (for 53-bit mantissa double) is incorrect in its 16th bit, or at least it was back in the x87. (I may have misremembered the number of bits, but the last hex digit is C and it is more than 64)

// use number[m:n] notation to indicate value with same exponent as number, mantissa zero
// except for bits m through n-1 inclusive, where they are equal to their corresponding bits in
// number.  The first bit of the mantissa has index zero.
hwPi = realPi - realpi[68:]
swPi = realPi - realPi[53:]

// mathematical result -- goal is it get correctly rounded 53 bits of this.
sin (swPi) = sin (realPi - realPi[53:]) = sin(realPi[53:]) approx= realPi[53:]

// machine result "sin" = machine sine, sin = mathematical sine
"sin" (swPi) = "sin" (realPi - realPi[53:]) = - sin (realPi - realPi[53:] - hwPi) =
   - sin (realPi - realPi[53:] - (realPi - realpi[68:])) = -sin(realpi[68:]-realPi[53:]) =
  sin(realpi[53:]-realPi[68:]) = sin(realpi[53:68]) approx = realpi[53:68]
// i.e. 15 bits, then zeroes, where the true answer would have its 16th bit set.

I know this well, because Java's hotspot compiler once replaced correct software sine with incorrect hardware sine after the test loop had been executed "enough". Moving the problematic case to the front of the iteration ran it through the software path, and the problem would disappear.

@TocarIP
Copy link
Contributor

TocarIP commented Sep 12, 2016

For max and min math.Max and maxsd have different semantics in NAN case (at least on AMD64).
For ceil and floor math package currently uses roundsd instruction(sse4) + cpuid check, while compiler is limited to sse2, so this may actually cause slowdown.

@quentinmit quentinmit added this to the Go1.8 milestone Sep 13, 2016
@quentinmit quentinmit added the NeedsFix The path to resolution is known, but the work has not been done. label Oct 11, 2016
@rsc
Copy link
Contributor

rsc commented Oct 21, 2016

These break down into:

  1. Functions that sound trivial but are actually complicated once you include the special cases they need to get right (Abs, Ceil, ...) or the complexity of the implementation (Cos, Sin, ...).
  2. Functions that can be written in Go and inline or should inline fine (Copysign, Float32bits, IsInf, IsNaN, ...).
  3. Functions that have trivial, correct, not-possible-to-write-in-Go assembly implementations (Sqrt).

Only category 3 is worth turning into an intrinsic, and then only when it is a demonstrated hotspot in a plausible real program. We waited for years until we found a real demonstration that the call overhead for Sqrt mattered, and then we added the intrinsic.

If any of category 2 shows up as a hotspot and isn't written in the best inlinable way, better to do that. The current Inf and NaN may be examples of that, but it would help to have a demonstration that the current performance is inadequate.

I'm going to close this issue because it is so broad, but please feel free to open new specific ones for Inf, NaN, or any of the other specific cases, provided there's a plausible path to optimization and a plausible need.

@rsc rsc closed this as completed Oct 21, 2016
@golang golang locked and limited conversation to collaborators Oct 21, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
None yet
Development

No branches or pull requests

8 participants