math/bits: OnesCount* if non-instrinsified should return 1 ASAP for powers of 2 if the price of branching is acceptable #27892

odeke-em · 2018-09-27T07:00:29Z

I was just studying some Mathematics and while examining some bit patterns thought that perhaps Go could use some optimizations if not already there(we don't have them in). The optimization is that we can return 1 ASAP if OnesCount* gets a power of 2. I checked through the bits.go code and didn't see such a condition.
A power of 2 is a value with only 1 set bit due to the incremental left shift of 1 0b1 and hence the reason why to detect a power of 2 we do: x&(x-1) == 0

Anyways, running some benchmarks on my Macbook pro show massive improvements for powers of 2 even with a branch to detect if powers of 2 ~29% speed up

Source code

Throwing in a mix of values

package main_test

import (
	"math/bits"
	"testing"
)

var tests = []struct {
	v    uint64
	want int
}{
	{1, 1}, {2, 1}, {3, 2}, {4, 1}, {5, 2}, {9, 2}, {8, 1}, {16, 1},
         {32, 1}, {64, 1}, {127, 7}, {256, 1}, {65536, 1},
         {4294967296, 1}, {4294967295, 32},
}

var sink int

func BenchmarkCustomOnesCount(b *testing.B) {
	runBenchmarks(b, customOnesCountForPowerOf2)
}

func BenchmarkStdlibOnesCount(b *testing.B) {
	runBenchmarks(b, bits.OnesCount64)
}

func runBenchmarks(b *testing.B, fn func(uint64) int) {
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		for _, tt := range tests {
			sink = fn(tt.v)
			if sink != tt.want {
				b.Errorf("%d: got %d want %d", tt.v, sink, tt.want)
			}
		}
	}
	if sink == 0 {
		b.Fatal("Did not run the benchmark!")
	}
}

func customOnesCountForPowerOf2(x uint64) int {
	// Simulate absolute branch since powerOf2 check.
	if x > 0 && x&(x-1) == 0 {
		return 1
	}
	const m = 1<<64 - 1
	x = x>>1&(m0&m) + x&(m0&m)
	x = x>>2&(m1&m) + x&(m1&m)
	x = (x>>4 + x) & (m2 & m)
	x += x >> 8
	x += x >> 16
	x += x >> 32
	return int(x) & (1<<7 - 1)
}

// Inlining these constants code here to show what
// the code for OnesCount64 will look like after the alteration.
const m0 = 0x5555555555555555 // 01010101 ...
const m1 = 0x3333333333333333 // 00110011 ...
const m2 = 0x0f0f0f0f0f0f0f0f // 00001111 ...
const m3 = 0x00ff00ff00ff00ff // etc.
const m4 = 0x0000ffff0000ffff

Benchmark results

goos: darwin
goarch: amd64
BenchmarkCustomOnesCount-8   	30000000	        42.1 ns/op
BenchmarkCustomOnesCount-8   	30000000	        41.3 ns/op
BenchmarkCustomOnesCount-8   	30000000	        41.5 ns/op
BenchmarkCustomOnesCount-8   	30000000	        42.7 ns/op
BenchmarkCustomOnesCount-8   	30000000	        42.8 ns/op
BenchmarkStdlibOnesCount-8   	30000000	        57.9 ns/op
BenchmarkStdlibOnesCount-8   	20000000	        61.2 ns/op
BenchmarkStdlibOnesCount-8   	20000000	        56.6 ns/op
BenchmarkStdlibOnesCount-8   	20000000	        59.6 ns/op
BenchmarkStdlibOnesCount-8   	20000000	        59.5 ns/op

Giving

$ benchstat before after 
name               old time/op  new time/op  delta
CustomOnesCount-8  59.0ns ± 4%  42.1ns ± 2%  -28.63%  (p=0.008 n=5+5)

If this doesn't make sense, please feel free to close it. I just thought that it would useful to mention.

The text was updated successfully, but these errors were encountered:

odeke-em · 2018-09-27T07:00:51Z

/cc @randall77 @griesemer @bmkessler @cherrymui

ALTree · 2018-09-27T07:28:51Z

On most amd64 machines, OnesCount is intrinsified into a POPCNT and it generally take less than a nanosecond

$ cd go/src/math/bits

$ go test -run=XX -bench=BenchmarkOnesCount64
BenchmarkOnesCount64-2          2000000000               0.77 ns/op
PASS
ok      math/bits       1.743s

We even have a test for this:

https://github.com/golang/go/blob/master/test/codegen/mathbits.go#L111

I think the way you wrote the benchmark defeated the optimization (either that, or you're actually measuring your benchmarking setup).

ALTree · 2018-09-27T07:31:09Z

Obviously the optimization you suggested would still benefit architectures where we don't intrinsify OnesCount, my point is just that we likely wouldn't see any benefit on amd64.

odeke-em · 2018-09-27T07:42:14Z

@ALTree in deed, I had only examined that code once ever :) and was wondering why pop count wasn't being used.

Oh yes, I see your point as bits.OnesCount64 is used as an argument which for sure defeats the optimization on machines with POPCNT*

cznic · 2018-09-27T09:31:29Z

Also note that the OP's benchmark measures mostly the happy case (10 out of 15 cases), but there are only 64 powers of two out of 2^64 numbers. That means the optimization can make the performance actually get actually worse for a random input because of the added test.

odeke-em · 2018-09-27T09:54:40Z

Yes in deed @cznic, great point I think the common case of non-powers beats the esoteric case of just 64 powers of 2. I found the need for this while working on a fast non-intrinsic square rooting algorithm but since the proposal here is marginally applicable useful in the common case, I'll rescind this issue. Thank you all for the input and sorry for the noise.

odeke-em added the Performance label Sep 27, 2018

odeke-em changed the title ~~math/bits: OnesCount* should return 1 ASAP for powers of 2 if the price of branching is acceptable~~ math/bits: OnesCount* if non-instrinsified should return 1 ASAP for powers of 2 if the price of branching is acceptable Sep 27, 2018

odeke-em closed this as completed Sep 27, 2018

golang locked and limited conversation to collaborators Sep 27, 2019

gopherbot added the FrozenDueToAge label Sep 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

math/bits: OnesCount* if non-instrinsified should return 1 ASAP for powers of 2 if the price of branching is acceptable #27892

math/bits: OnesCount* if non-instrinsified should return 1 ASAP for powers of 2 if the price of branching is acceptable #27892

odeke-em commented Sep 27, 2018

odeke-em commented Sep 27, 2018

ALTree commented Sep 27, 2018 •

edited

ALTree commented Sep 27, 2018

odeke-em commented Sep 27, 2018

cznic commented Sep 27, 2018

odeke-em commented Sep 27, 2018

math/bits: OnesCount* if non-instrinsified should return 1 ASAP for powers of 2 if the price of branching is acceptable #27892

math/bits: OnesCount* if non-instrinsified should return 1 ASAP for powers of 2 if the price of branching is acceptable #27892

Comments

odeke-em commented Sep 27, 2018

Source code

Benchmark results

odeke-em commented Sep 27, 2018

ALTree commented Sep 27, 2018 • edited

ALTree commented Sep 27, 2018

odeke-em commented Sep 27, 2018

cznic commented Sep 27, 2018

odeke-em commented Sep 27, 2018

ALTree commented Sep 27, 2018 •

edited