New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
math/bits: OnesCount* if non-instrinsified should return 1 ASAP for powers of 2 if the price of branching is acceptable #27892
Comments
On most amd64 machines, OnesCount is intrinsified into a POPCNT and it generally take less than a nanosecond
We even have a test for this: https://github.com/golang/go/blob/master/test/codegen/mathbits.go#L111 I think the way you wrote the benchmark defeated the optimization (either that, or you're actually measuring your benchmarking setup). |
Obviously the optimization you suggested would still benefit architectures where we don't intrinsify OnesCount, my point is just that we likely wouldn't see any benefit on amd64. |
@ALTree in deed, I had only examined that code once ever :) and was wondering why pop count wasn't being used. Oh yes, I see your point as bits.OnesCount64 is used as an argument which for sure defeats the optimization on machines with POPCNT* |
Also note that the OP's benchmark measures mostly the happy case (10 out of 15 cases), but there are only 64 powers of two out of 2^64 numbers. That means the optimization can make the performance actually get actually worse for a random input because of the added test. |
Yes in deed @cznic, great point I think the common case of non-powers beats the esoteric case of just 64 powers of 2. I found the need for this while working on a fast non-intrinsic square rooting algorithm but since the proposal here is marginally applicable useful in the common case, I'll rescind this issue. Thank you all for the input and sorry for the noise. |
I was just studying some Mathematics and while examining some bit patterns thought that perhaps Go could use some optimizations if not already there(we don't have them in). The optimization is that we can return 1 ASAP if OnesCount* gets a power of 2. I checked through the bits.go code and didn't see such a condition.
A power of 2 is a value with only 1 set bit due to the incremental left shift of 1 0b1 and hence the reason why to detect a power of 2 we do:
x&(x-1) == 0
Anyways, running some benchmarks on my Macbook pro show massive improvements for powers of 2 even with a branch to detect if powers of 2 ~29% speed up
Source code
Throwing in a mix of values
Benchmark results
Giving
If this doesn't make sense, please feel free to close it. I just thought that it would useful to mention.
The text was updated successfully, but these errors were encountered: