Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: intrinsify more of math/bits on amd64 #28273

Closed
bradfitz opened this issue Oct 18, 2018 · 6 comments
Closed

cmd/compile: intrinsify more of math/bits on amd64 #28273

bradfitz opened this issue Oct 18, 2018 · 6 comments
Labels
FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@bradfitz
Copy link
Contributor

Now that more API has been added to math/bits (#24813), this is a tracking bug to intrinsify more of it in the compiler for amd64.

As @randall77 said in #24813 (comment) ...

Only Mul has been intrinsified so far.
I think we want Add/Sub intrinsified also (or better, recognize the Go code and generate the add-with-carry instruction).

@bradfitz bradfitz added Performance help wanted NeedsFix The path to resolution is known, but the work has not been done. labels Oct 18, 2018
@bradfitz bradfitz added this to the Unplanned milestone Oct 18, 2018
@bmkessler
Copy link
Contributor

I originally had added the Div64 intrinsic for amd64, but there were concerns during review about the behavior when the quotient won't fit into 64-bits. In particular, the pure go implementation ported from math/big returns a value of 1<<64 - 1, 1<<64 - 1 in this case.

if hi >= y {
	return 1<<64 - 1, 1<<64 - 1
}

whereas the DIVQ instruction will cause a divide by zero exception. Additionally, the 32-bit pure go version just truncates quo to 32-bits, which is not consistent either.

quo, rem = uint32(z/uint64(y)), uint32(z%uint64(y))

Currently, the documentation from the original proposal states:

hi must be < y otherwise the behavior is undefined (the quotient won't fit into quo).

The concern was that "undefined" shouldn't encompass both returning a value in one implementation and panic in another.

I propose that the behavior when hi >= y should be well-defined so that an intrinsic can be implemented consistently across platforms; either to panic or return a sensible value. I recommend changing the definition to panic when hi >= y since returning a value when the actual result does not fit in the output variable could be the source of subtle bugs. A panic will force users to handle hi >= y correctly if that is a possibility in their code. It also aligns with the behavior of the DIVQ instruction on amd64.

@griesemer @randall77

@randall77
Copy link
Contributor

I'm ok with defining it to panic.
Divides are expensive enough that a compare and branch before each isn't a big overhead.

We should fix this for 1.12 before the API is set in stone.

@randall77
Copy link
Contributor

Actually, probably deserves its own issue.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/144257 mentions this issue: cmd/compile: intrinsify math/bits.Add on amd64

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/144258 mentions this issue: cmd/compile: intrinsify math/bits.Sub on amd64

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/144378 mentions this issue: cmd/compile: intirnsify math/bits.Div on amd64

gopherbot pushed a commit that referenced this issue Oct 25, 2018
name             old time/op  new time/op  delta
Add-8            1.11ns ± 0%  1.18ns ± 0%   +6.31%  (p=0.029 n=4+4)
Add32-8          1.02ns ± 0%  1.02ns ± 1%     ~     (p=0.333 n=4+5)
Add64-8          1.11ns ± 1%  1.17ns ± 0%   +5.79%  (p=0.008 n=5+5)
Add64multiple-8  4.35ns ± 1%  0.86ns ± 0%  -80.22%  (p=0.000 n=5+4)

The individual ops are a bit slower (but still very fast).
Using the ops in carry chains is very fast.

Update #28273

Change-Id: Id975f76df2b930abf0e412911d327b6c5b1befe5
Reviewed-on: https://go-review.googlesource.com/c/144257
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
gopherbot pushed a commit that referenced this issue Nov 27, 2018
Note that the intrinsic implementation panics separately for overflow and
divide by zero, which matches the behavior of the pure go implementation.
There is a modest performance improvement after intrinsic implementation.

name     old time/op  new time/op  delta
Div-4    53.0ns ± 1%  47.0ns ± 0%  -11.28%  (p=0.008 n=5+5)
Div32-4  18.4ns ± 0%  18.5ns ± 1%     ~     (p=0.444 n=5+5)
Div64-4  53.3ns ± 0%  47.5ns ± 4%  -10.77%  (p=0.008 n=5+5)

Updates #28273

Change-Id: Ic1688ecc0964acace2e91bf44ef16f5fb6b6bc82
Reviewed-on: https://go-review.googlesource.com/c/144378
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@golang golang locked and limited conversation to collaborators Oct 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
None yet
Development

No branches or pull requests

4 participants