Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

math/bits: OnesCount* if non-instrinsified should return 1 ASAP for powers of 2 if the price of branching is acceptable #27892

Closed
odeke-em opened this issue Sep 27, 2018 · 6 comments

Comments

@odeke-em
Copy link
Member

I was just studying some Mathematics and while examining some bit patterns thought that perhaps Go could use some optimizations if not already there(we don't have them in). The optimization is that we can return 1 ASAP if OnesCount* gets a power of 2. I checked through the bits.go code and didn't see such a condition.
A power of 2 is a value with only 1 set bit due to the incremental left shift of 1 0b1 and hence the reason why to detect a power of 2 we do: x&(x-1) == 0

Anyways, running some benchmarks on my Macbook pro show massive improvements for powers of 2 even with a branch to detect if powers of 2 ~29% speed up

Source code

Throwing in a mix of values

package main_test

import (
	"math/bits"
	"testing"
)

var tests = []struct {
	v    uint64
	want int
}{
	{1, 1}, {2, 1}, {3, 2}, {4, 1}, {5, 2}, {9, 2}, {8, 1}, {16, 1},
         {32, 1}, {64, 1}, {127, 7}, {256, 1}, {65536, 1},
         {4294967296, 1}, {4294967295, 32},
}

var sink int

func BenchmarkCustomOnesCount(b *testing.B) {
	runBenchmarks(b, customOnesCountForPowerOf2)
}

func BenchmarkStdlibOnesCount(b *testing.B) {
	runBenchmarks(b, bits.OnesCount64)
}

func runBenchmarks(b *testing.B, fn func(uint64) int) {
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		for _, tt := range tests {
			sink = fn(tt.v)
			if sink != tt.want {
				b.Errorf("%d: got %d want %d", tt.v, sink, tt.want)
			}
		}
	}
	if sink == 0 {
		b.Fatal("Did not run the benchmark!")
	}
}

func customOnesCountForPowerOf2(x uint64) int {
	// Simulate absolute branch since powerOf2 check.
	if x > 0 && x&(x-1) == 0 {
		return 1
	}
	const m = 1<<64 - 1
	x = x>>1&(m0&m) + x&(m0&m)
	x = x>>2&(m1&m) + x&(m1&m)
	x = (x>>4 + x) & (m2 & m)
	x += x >> 8
	x += x >> 16
	x += x >> 32
	return int(x) & (1<<7 - 1)
}

// Inlining these constants code here to show what
// the code for OnesCount64 will look like after the alteration.
const m0 = 0x5555555555555555 // 01010101 ...
const m1 = 0x3333333333333333 // 00110011 ...
const m2 = 0x0f0f0f0f0f0f0f0f // 00001111 ...
const m3 = 0x00ff00ff00ff00ff // etc.
const m4 = 0x0000ffff0000ffff

Benchmark results

goos: darwin
goarch: amd64
BenchmarkCustomOnesCount-8   	30000000	        42.1 ns/op
BenchmarkCustomOnesCount-8   	30000000	        41.3 ns/op
BenchmarkCustomOnesCount-8   	30000000	        41.5 ns/op
BenchmarkCustomOnesCount-8   	30000000	        42.7 ns/op
BenchmarkCustomOnesCount-8   	30000000	        42.8 ns/op
BenchmarkStdlibOnesCount-8   	30000000	        57.9 ns/op
BenchmarkStdlibOnesCount-8   	20000000	        61.2 ns/op
BenchmarkStdlibOnesCount-8   	20000000	        56.6 ns/op
BenchmarkStdlibOnesCount-8   	20000000	        59.6 ns/op
BenchmarkStdlibOnesCount-8   	20000000	        59.5 ns/op

Giving

$ benchstat before after 
name               old time/op  new time/op  delta
CustomOnesCount-8  59.0ns ± 4%  42.1ns ± 2%  -28.63%  (p=0.008 n=5+5)

If this doesn't make sense, please feel free to close it. I just thought that it would useful to mention.

@odeke-em
Copy link
Member Author

@ALTree
Copy link
Member

ALTree commented Sep 27, 2018

On most amd64 machines, OnesCount is intrinsified into a POPCNT and it generally take less than a nanosecond

$ cd go/src/math/bits

$ go test -run=XX -bench=BenchmarkOnesCount64
BenchmarkOnesCount64-2          2000000000               0.77 ns/op
PASS
ok      math/bits       1.743s

We even have a test for this:

https://github.com/golang/go/blob/master/test/codegen/mathbits.go#L111

I think the way you wrote the benchmark defeated the optimization (either that, or you're actually measuring your benchmarking setup).

@ALTree
Copy link
Member

ALTree commented Sep 27, 2018

Obviously the optimization you suggested would still benefit architectures where we don't intrinsify OnesCount, my point is just that we likely wouldn't see any benefit on amd64.

@odeke-em
Copy link
Member Author

@ALTree in deed, I had only examined that code once ever :) and was wondering why pop count wasn't being used.

Oh yes, I see your point as bits.OnesCount64 is used as an argument which for sure defeats the optimization on machines with POPCNT*

@odeke-em odeke-em changed the title math/bits: OnesCount* should return 1 ASAP for powers of 2 if the price of branching is acceptable math/bits: OnesCount* if non-instrinsified should return 1 ASAP for powers of 2 if the price of branching is acceptable Sep 27, 2018
@cznic
Copy link
Contributor

cznic commented Sep 27, 2018

Also note that the OP's benchmark measures mostly the happy case (10 out of 15 cases), but there are only 64 powers of two out of 2^64 numbers. That means the optimization can make the performance actually get actually worse for a random input because of the added test.

@odeke-em
Copy link
Member Author

Yes in deed @cznic, great point I think the common case of non-powers beats the esoteric case of just 64 powers of 2. I found the need for this while working on a fast non-intrinsic square rooting algorithm but since the proposal here is marginally applicable useful in the common case, I'll rescind this issue. Thank you all for the input and sorry for the noise.

@golang golang locked and limited conversation to collaborators Sep 27, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants