math/bits: faster way to compute OnesCount64 #46188

ttsugriy · 2021-05-15T19:39:18Z

What version of Go are you using (`go version`)?

$ go version
go version go1.16.3 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/ttsugrii/Library/Caches/go-build"
GOENV="/Users/ttsugrii/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/ttsugrii/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/ttsugrii/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.16.3/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.16.3/libexec/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.16.3"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/ttsugrii/projects/go/testing/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/kx/wp6rcc054zv_xyf4lpmlztr00000gn/T/go-build1168578044=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I've ported Wilkes-Wheeler-Gill algorithm from "Faster Population Counts Using AVX2 Instructions" paper:

const c1 = 0x5555555555555555
const c2 = 0x3333333333333333
const c4 = 0x0F0F0F0F0F0F0F0F

//go:noinline
func wilkes_wheeler_gill(x uint64) int {
	x -= (x >> 1) & c1
	x = ((x >> 2) & c2) + (x & c2)
	x = (x + (x >> 4)) & c4
	x *= 0x0101010101010101
	return int(x >> 56)
}

and benchmarked it against the version used by math.bits.OnesCount64 in case POPCNT is not supported:

const m0 = 0x5555555555555555 // 01010101 ...
const m1 = 0x3333333333333333 // 00110011 ...
const m2 = 0x0f0f0f0f0f0f0f0f // 00001111 ...

//go:noinline
func parallel_summing(x uint64) int {
	const m = 1<<64 - 1
	x = x>>1&(m0&m) + x&(m0&m)
	x = x>>2&(m1&m) + x&(m1&m)
	x = (x>>4 + x) & (m2 & m)
	x += x >> 8
	x += x >> 16
	x += x >> 32
	return int(x) & (1<<7 - 1)
}

The benchmarks:

var x uint64 = <the input value>
var result int

func BenchmarkPopCountParallelSumming(b *testing.B) {
	var cnt int
	for i := 0; i < b.N; i++ {
		cnt += parallel_summing(x)
	}
	result += cnt
}

func BenchmarkPopCountWilkesWheelerGill(b *testing.B) {
	var cnt int
	for i := 0; i < b.N; i++ {
		cnt += wilkes_wheeler_gill(x)
	}
	result += cnt
}

For var x uint64 = 0xffffffffffffffff:

cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkPopCountParallelSumming-16      	360195423	         3.094 ns/op
BenchmarkPopCountWilkesWheelerGill-16    	418102376	         2.798 ns/op

and for var x uint64 = 0

cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkPopCountParallelSumming-16      	379682172	         3.126 ns/op
BenchmarkPopCountWilkesWheelerGill-16    	431421498	         2.870 ns/op

What did you expect to see?

math.bits.OnesCount64's version to be faster

What did you see instead?

Wilkes-Wheeler-Gill algorithm was faster by ~10% and, given its implementation is also more concise, it makes sense to replace with it existing implementation.

The text was updated successfully, but these errors were encountered:

This implementation is based on the C function from "Faster Population Counts Using AVX2 Instructions" paper, Figure 3, available at https://arxiv.org/pdf/1612.07612.pdf More details and benchmark results are available in the golang#46188. Closes golang#46188

gopherbot · 2021-05-15T20:05:36Z

Change https://golang.org/cl/320112 mentions this issue: math/bits: use Wilkes-Wheeler-Gill algorithm for OnesCount64

ALTree · 2021-05-15T22:06:11Z

This could be useful in general, but I think it would be better to double-check it's a win on the architectures that actually use the pure Go version.

Suppose the new code is actually 10% slower on some GOARCH without a POPCNT... then this change would be a net loss, because amd64 won't actually benefit from the 10% speedup you measured (since all processors made after 2006 have POPCNT), but the GOARCH that does use the pure Go version will be slowed down.

ttsugriy · 2021-05-16T08:25:14Z

Thank you for the feedback, @ALTree! Do you happen to have any particular architecture in mind? New algorithm assembly is shorter on both ARM:

"".parallel_summing STEXT size=208 args=0xc locals=0x0 funcid=0x0 leaf
        0x0000 00000 (popcount_test.go:20)      TEXT    "".parallel_summing(SB), LEAF|NOFRAME|ABIInternal, $-4-12
        0x0000 00000 (popcount_test.go:20)      FUNCDATA        $0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        0x0000 00000 (popcount_test.go:20)      FUNCDATA        $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        0x0000 00000 (popcount_test.go:22)      MOVW    "".x(FP), R0
        0x0004 00004 (popcount_test.go:22)      SRL     $1, R0, R1
        0x0008 00008 (popcount_test.go:22)      MOVW    "".x+4(FP), R2
        0x000c 00012 (popcount_test.go:22)      ORR     R2<<31, R1, R1
        0x0010 00016 (popcount_test.go:22)      AND     $1431655765, R1, R1
        0x0018 00024 (popcount_test.go:22)      AND     $1431655765, R0, R0
        0x0020 00032 (popcount_test.go:22)      ADD.S   R1, R0, R0
        0x0024 00036 (popcount_test.go:22)      SRL     $1, R2, R1
        0x0028 00040 (popcount_test.go:22)      AND     $1431655765, R1, R1
        0x0030 00048 (popcount_test.go:22)      AND     $1431655765, R2, R2
        0x0038 00056 (popcount_test.go:22)      ADC     R1, R2, R1
        0x003c 00060 (popcount_test.go:23)      SRL     $2, R0, R2
        0x0040 00064 (popcount_test.go:23)      ORR     R1<<30, R2, R2
        0x0044 00068 (popcount_test.go:23)      AND     $858993459, R2, R2
        0x004c 00076 (popcount_test.go:23)      AND     $858993459, R0, R0
        0x0054 00084 (popcount_test.go:23)      ADD.S   R2, R0, R0
        0x0058 00088 (popcount_test.go:23)      SRL     $2, R1, R2
        0x005c 00092 (popcount_test.go:23)      AND     $858993459, R2, R2
        0x0064 00100 (popcount_test.go:23)      AND     $858993459, R1, R1
        0x006c 00108 (popcount_test.go:23)      ADC     R2, R1, R1
        0x0070 00112 (popcount_test.go:24)      SRL     $4, R0, R2
        0x0074 00116 (popcount_test.go:24)      ORR     R1<<28, R2, R2
        0x0078 00120 (popcount_test.go:24)      ADD.S   R0, R2, R0
        0x007c 00124 (popcount_test.go:24)      ADC     R1>>4, R1, R1
        0x0080 00128 (popcount_test.go:24)      AND     $252645135, R1, R1
        0x0088 00136 (popcount_test.go:24)      AND     $252645135, R0, R0
        0x0090 00144 (popcount_test.go:25)      SRL     $8, R0, R2
        0x0094 00148 (popcount_test.go:25)      ORR     R1<<24, R2, R2
        0x0098 00152 (popcount_test.go:25)      ADD.S   R0, R2, R0
        0x009c 00156 (popcount_test.go:25)      ADC     R1>>8, R1, R1
        0x00a0 00160 (popcount_test.go:26)      SRL     $16, R0, R2
        0x00a4 00164 (popcount_test.go:26)      ORR     R1<<16, R2, R2
        0x00a8 00168 (popcount_test.go:26)      ADD.S   R0, R2, R0
        0x00ac 00172 (popcount_test.go:26)      ADC     R1>>16, R1, R1
        0x00b0 00176 (popcount_test.go:27)      ADD.S   R0, R1, R0
        0x00b4 00180 (popcount_test.go:28)      AND     $127, R0, R0
        0x00b8 00184 (popcount_test.go:28)      MOVW    R0, "".~r1+8(FP)
        0x00bc 00188 (popcount_test.go:28)      JMP     (R14)
        0x00c0 00192 (popcount_test.go:28)      JMP     0(PC)
        0x00c4 00196 (popcount_test.go:28)      WORD    $1431655765
        0x00c8 00200 (popcount_test.go:28)      WORD    $858993459
        0x00cc 00204 (popcount_test.go:28)      WORD    $252645135

"".wilkes_wheeler_gill STEXT size=176 args=0xc locals=0x0 funcid=0x0 leaf
        0x0000 00000 (popcount_test.go:60)      TEXT    "".wilkes_wheeler_gill(SB), LEAF|NOFRAME|ABIInternal, $-4-12
        0x0000 00000 (popcount_test.go:60)      FUNCDATA        $0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        0x0000 00000 (popcount_test.go:60)      FUNCDATA        $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        0x0000 00000 (popcount_test.go:61)      MOVW    "".x(FP), R0
        0x0004 00004 (popcount_test.go:61)      SRL     $1, R0, R1
        0x0008 00008 (popcount_test.go:61)      MOVW    "".x+4(FP), R2
        0x000c 00012 (popcount_test.go:61)      ORR     R2<<31, R1, R1
        0x0010 00016 (popcount_test.go:61)      AND     $1431655765, R1, R1
        0x0018 00024 (popcount_test.go:61)      SUB.S   R1, R0, R0
        0x001c 00028 (popcount_test.go:61)      SRL     $1, R2, R1
        0x0020 00032 (popcount_test.go:61)      AND     $1431655765, R1, R1
        0x0028 00040 (popcount_test.go:61)      SBC     R1, R2, R1
        0x002c 00044 (popcount_test.go:62)      SRL     $2, R0, R2
        0x0030 00048 (popcount_test.go:62)      ORR     R1<<30, R2, R2
        0x0034 00052 (popcount_test.go:62)      AND     $858993459, R2, R2
        0x003c 00060 (popcount_test.go:62)      AND     $858993459, R0, R0
        0x0044 00068 (popcount_test.go:62)      ADD.S   R0, R2, R0
        0x0048 00072 (popcount_test.go:62)      SRL     $2, R1, R2
        0x004c 00076 (popcount_test.go:62)      AND     $858993459, R2, R2
        0x0054 00084 (popcount_test.go:62)      AND     $858993459, R1, R1
        0x005c 00092 (popcount_test.go:62)      ADC     R1, R2, R1
        0x0060 00096 (popcount_test.go:63)      SRL     $4, R0, R2
        0x0064 00100 (popcount_test.go:63)      ORR     R1<<28, R2, R2
        0x0068 00104 (popcount_test.go:63)      ADD.S   R0, R2, R0
        0x006c 00108 (popcount_test.go:63)      AND     $252645135, R0, R0
        0x0074 00116 (popcount_test.go:64)      MOVW    $16843009, R2
        0x0078 00120 (popcount_test.go:64)      MULLU   R0, R2, (R3, R4)
        0x007c 00124 (popcount_test.go:63)      ADC     R1>>4, R1, R1
        0x0080 00128 (popcount_test.go:63)      AND     $252645135, R1, R1
        0x0088 00136 (popcount_test.go:64)      MULA    R1, R2, R3, R1
        0x008c 00140 (popcount_test.go:64)      MULA    R0, R2, R1, R0
        0x0090 00144 (popcount_test.go:65)      SRL     $24, R0, R0
        0x0094 00148 (popcount_test.go:65)      MOVW    R0, "".~r1+8(FP)
        0x0098 00152 (popcount_test.go:65)      JMP     (R14)
        0x009c 00156 (popcount_test.go:65)      JMP     0(PC)
        0x00a0 00160 (popcount_test.go:65)      WORD    $1431655765
        0x00a4 00164 (popcount_test.go:65)      WORD    $858993459
        0x00a8 00168 (popcount_test.go:65)      WORD    $252645135
        0x00ac 00172 (popcount_test.go:65)      WORD    $16843009

and ARM64:

"".parallel_summing STEXT size=80 args=0x10 locals=0x0 funcid=0x0 leaf
        0x0000 00000 (popcount_test.go:20)      TEXT    "".parallel_summing(SB), LEAF|NOFRAME|ABIInternal, $0-16
        0x0000 00000 (popcount_test.go:20)      FUNCDATA        ZR, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        0x0000 00000 (popcount_test.go:20)      FUNCDATA        $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        0x0000 00000 (popcount_test.go:22)      MOVD    "".x(FP), R0
        0x0004 00004 (popcount_test.go:22)      LSR     $1, R0, R1
        0x0008 00008 (popcount_test.go:22)      AND     $6148914691236517205, R1, R1
        0x000c 00012 (popcount_test.go:22)      AND     $6148914691236517205, R0, R0
        0x0010 00016 (popcount_test.go:22)      ADD     R1, R0, R0
        0x0014 00020 (popcount_test.go:23)      LSR     $2, R0, R1
        0x0018 00024 (popcount_test.go:23)      AND     $3689348814741910323, R1, R1
        0x001c 00028 (popcount_test.go:23)      AND     $3689348814741910323, R0, R0
        0x0020 00032 (popcount_test.go:23)      ADD     R1, R0, R0
        0x0024 00036 (popcount_test.go:24)      ADD     R0>>4, R0, R0
        0x0028 00040 (popcount_test.go:24)      AND     $1085102592571150095, R0, R0
        0x002c 00044 (popcount_test.go:25)      ADD     R0>>8, R0, R0
        0x0030 00048 (popcount_test.go:26)      ADD     R0>>16, R0, R0
        0x0034 00052 (popcount_test.go:27)      ADD     R0>>32, R0, R0
        0x0038 00056 (popcount_test.go:28)      AND     $127, R0, R0
        0x003c 00060 (popcount_test.go:28)      MOVD    R0, "".~r1+8(FP)
        0x0040 00064 (popcount_test.go:28)      RET     (R30)

"".wilkes_wheeler_gill STEXT size=64 args=0x10 locals=0x0 funcid=0x0 leaf
        0x0000 00000 (popcount_test.go:60)      TEXT    "".wilkes_wheeler_gill(SB), LEAF|NOFRAME|ABIInternal, $0-16
        0x0000 00000 (popcount_test.go:60)      FUNCDATA        ZR, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        0x0000 00000 (popcount_test.go:60)      FUNCDATA        $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        0x0000 00000 (popcount_test.go:61)      MOVD    "".x(FP), R0
        0x0004 00004 (popcount_test.go:61)      LSR     $1, R0, R1
        0x0008 00008 (popcount_test.go:61)      AND     $6148914691236517205, R1, R1
        0x000c 00012 (popcount_test.go:61)      SUB     R1, R0, R0
        0x0010 00016 (popcount_test.go:62)      LSR     $2, R0, R1
        0x0014 00020 (popcount_test.go:62)      AND     $3689348814741910323, R1, R1
        0x0018 00024 (popcount_test.go:62)      AND     $3689348814741910323, R0, R0
        0x001c 00028 (popcount_test.go:62)      ADD     R0, R1, R0
        0x0020 00032 (popcount_test.go:63)      ADD     R0>>4, R0, R0
        0x0024 00036 (popcount_test.go:63)      AND     $1085102592571150095, R0, R0
        0x0028 00040 (popcount_test.go:64)      MOVD    $72340172838076673, R1
        0x002c 00044 (popcount_test.go:64)      MUL     R1, R0, R0
        0x0030 00048 (popcount_test.go:65)      LSR     $56, R0, R0
        0x0034 00052 (popcount_test.go:65)      MOVD    R0, "".~r1+8(FP)
        0x0038 00056 (popcount_test.go:65)      RET     (R30)

The only reason to suspect potential performance regression on these platforms is the usage of multiplication but it's hard to tell for sure without having access to the target platform.

randall77 · 2021-05-16T15:33:31Z

On arm64 we always use the intrinsic, so the speed of this backup code shouldn't matter there (unless it is faster than the intrinsic!).
On arm we never do, so that would be a good place to test. Any mips would be good also.
I'm not particularly concerned about amd64. If someone is using a >15 yr old processor, they probably don't care that much about performance.

greatroar · 2021-05-22T10:52:59Z

On the Raspberry Pi 4B running in 32-bit mode (ARM Cortex-A72), the WWG version is significantly faster:

name        old time/op  new time/op  delta
PopCount-4  21.5ns ± 0%  20.6ns ± 0%  -4.22%  (p=0.000 n=9+9)

OnesCount64 of uint64(i) inside the loop shows similar performance.

~~Then again, a version that just adds the OnesCount32 of both halves of the input is faster still, if I remove the noinline pragmas. The assembler output shows that the function is not optimized away.~~ No, that's only in the uint64(i) benchmark, where the compiler can determine that the upper half is zero.

ttsugriy · 2021-05-22T19:30:12Z

Thank you for checking the 32bit ARM performance, @greatroar! @randall77, does this address your concern about arm perf impact?

randall77 · 2021-05-22T19:54:21Z

Seems fine.

A 4% performance improvement doesn't seem great though. That's well within the "code alignment could have caused it" performance range.

randall77 · 2021-10-13T17:04:55Z

@andig This change shouldn't affect arm64 at all - it already uses an intrinsic, not the fallback code being discussed here.

ttsugriy linked a pull request May 15, 2021 that will close this issue

math/bits: use Wilkes-Wheeler-Gill algorithm for OnesCount64 #46189

Open

ALTree changed the title ~~Faster way to compute math.bits.OnesCount64~~ math/bits: faster way to compute OnesCount64 May 15, 2021

ALTree added the Performance label May 15, 2021

dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 17, 2021

dmitshur added this to the Backlog milestone May 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

math/bits: faster way to compute OnesCount64 #46188

math/bits: faster way to compute OnesCount64 #46188

ttsugriy commented May 15, 2021

gopherbot commented May 15, 2021

ALTree commented May 15, 2021 •

edited

ttsugriy commented May 16, 2021

randall77 commented May 16, 2021

greatroar commented May 22, 2021 •

edited

ttsugriy commented May 22, 2021

randall77 commented May 22, 2021

randall77 commented Oct 13, 2021

math/bits: faster way to compute OnesCount64 #46188

math/bits: faster way to compute OnesCount64 #46188

Comments

ttsugriy commented May 15, 2021

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

gopherbot commented May 15, 2021

ALTree commented May 15, 2021 • edited

ttsugriy commented May 16, 2021

randall77 commented May 16, 2021

greatroar commented May 22, 2021 • edited

ttsugriy commented May 22, 2021

randall77 commented May 22, 2021

randall77 commented Oct 13, 2021

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

ALTree commented May 15, 2021 •

edited

greatroar commented May 22, 2021 •

edited