cmd/compile: slower bit operations (regression in go 1.16.3) #45790

rayjcwu · 2021-04-26T23:04:07Z

What version of Go are you using (`go version`)?

$ go version
1.16.3

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE="auto"
GOARCH="amd64"
GOBIN=""
GOCACHE="/mnt/go-cache"
GOENV="/mnt/ubuntu/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/mnt/jenkins/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/mnt/jenkins"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.16.3"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build4206356499=/tmp/go-build -gno-record-gcc-switches"

What did you do?

We are upgrading from Go 1.15.8 to Go 1.16.3 and found a noticeable benchmark regression. Here is a bit set setting i-th bit

package main

const remMask = uint64(1)<<6 - 1

type bitset struct {
	b []uint64
}

func (b *bitset) set(i int) {
	m := uint64(1) << (uint64(i) & remMask)
	b.b[i>>6] |= m
}

func (b *bitset) unset(i int) {
	m := uint64(1) << (uint64(i) & remMask)
	b.b[i>>6] &^= m
}

package main

import (
	"math/rand"
	"testing"
)

func BenchmarkSet(b *testing.B) {
	const n = 10
	const m = 1<<n - 1
	const numWords = (1 << n + 63) >> 6
	bs := &bitset{
		b: make([]uint64, numWords),
	}
	r := rand.New(rand.NewSource(0))
	bits := make([]bool, 1<<n)
	for i := range bits {
		bits[i] = r.Intn(2) == 1
	}
	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		if bits[i&m] {
			bs.set(i & m)
		} else {
			bs.unset(i & m)
		}
	}
}

Here is the benchmark result comparing 1.15.8 and 1.16.3.

goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz

Set-36     1.64ns ± 1%    2.89ns ± 1%  +76.54%  (p=0.001 n=7+7)

I bisected it, the benchmark regressions started at 96139f2

What did you expect to see?

Less performance regression.

What did you see instead?

+76.54% performance regression.

The text was updated successfully, but these errors were encountered:

randall77 · 2021-04-26T23:54:13Z

I can reproduce. Looks like it was introduced from 1.16.2 -> 1.16.3.

That makes me think it was probably https://go-review.googlesource.com/c/go/+/305069

The regression remains at tip.

Looking at the assembly, set does BTSQ instead of ORQ directly on the memory.
Not sure why the BTSQ is that much slower.

1.16.2

	0x002b 00043 (/Users/khr/gowork/issue45790_test.go:16)	MOVQ	BX, CX
	0x002e 00046 (/Users/khr/gowork/issue45790_test.go:16)	MOVL	$1, SI
	0x0033 00051 (/Users/khr/gowork/issue45790_test.go:16)	SHLQ	CX, SI
	0x0036 00054 (/Users/khr/gowork/issue45790_test.go:17)	ORQ	SI, (DX)(AX*8)

1.16.3

	0x002b 00043 (/Users/khr/gowork/issue45790_test.go:17)	LEAQ	(DX)(AX*8), AX
	0x002f 00047 (/Users/khr/gowork/issue45790_test.go:17)	ANDQ	$63, BX
	0x0033 00051 (/Users/khr/gowork/issue45790_test.go:17)	BTSQ	BX, (AX)

The latter code looks better, so I suspect BTSQ is slow for some reason. The CL referenced above fixed some rules that as a side effect triggered this optimization. I think it would be worth going back to the CLs that introduced BTSQmodify instructions to make sure those are actually performant.

@pgavlin @benshi001

randall77 · 2021-04-27T00:01:26Z

@martisch

ulikunitz · 2021-04-27T06:26:54Z

One observation: the ANDQ is not required, it is done by BTSQ itself.

Another: BTSQ is probably multiple micro-ops including a write of the CF flag. The upper sequence is probably translated 1-to-1 into micro-ops and will have a smaller micro-op count.

ulikunitz · 2021-04-27T06:54:27Z

https://www.agner.org/optimize/optimizing_assembly.pdf states that BTS with a memory operand should be avoided on Intel. So moving the result into a register and writing it might be faster.

pgavlin · 2021-04-28T18:33:38Z

One observation: the ANDQ is not required, it is done by BTSQ itself.

@ulikunitz This is only true if the destination operand is not memory. From the instruction reference manual, page 3-11:

If BitBase is a memory address, the BitOffset can range has different ranges depending on the operand size (see Table 3-2).

Table 3-2. Range of Bit Positions Specified by Bit Offset Operands

Operand Size Immediate BitOffset Register BitOffset

16 0 to 15 −2¹⁵ to 2¹⁵ −1

32 0 to 31 −2³¹ to 2³¹ −1

64 0 to 63 −2⁶³ to 2⁶³ −1

The addressed bit is numbered (Offset MOD 8) within the byte at address (BitBase + (BitOffset DIV 8)) where DIV is signed division with rounding towards negative infinity and MOD returns a positive number (see
Figure 3-2).

The lack of masking was the cause of the issue which prompted the fix.

https://www.agner.org/optimize/optimizing_assembly.pdf states that BTS with a memory operand should be avoided on Intel. So moving the result into a register and writing it might be faster.

Given this, I agree with @randall77:

I think it would be worth going back to the CLs that introduced BTSQmodify instructions to make sure those are actually performant.

ulikunitz · 2021-04-28T20:46:51Z

Agree with both points, I have been wrong regarding the AND.

gopherbot · 2021-05-08T01:09:05Z

Change https://golang.org/cl/318149 mentions this issue: cmd/compile: remove bit operations that modify memory directly

randall77 added the Performance label Apr 27, 2021

randall77 added this to the Go1.17 milestone Apr 27, 2021

cherrymui added the NeedsInvestigation label Apr 27, 2021

gopherbot closed this as completed in b211fe0 May 8, 2021

golang locked and limited conversation to collaborators May 8, 2022

gopherbot added the FrozenDueToAge label May 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: slower bit operations (regression in go 1.16.3) #45790

cmd/compile: slower bit operations (regression in go 1.16.3) #45790

rayjcwu commented Apr 26, 2021

randall77 commented Apr 26, 2021

randall77 commented Apr 27, 2021

ulikunitz commented Apr 27, 2021

ulikunitz commented Apr 27, 2021

pgavlin commented Apr 28, 2021 •

edited

Loading

ulikunitz commented Apr 28, 2021

gopherbot commented May 8, 2021

cmd/compile: slower bit operations (regression in go 1.16.3) #45790

cmd/compile: slower bit operations (regression in go 1.16.3) #45790

Comments

rayjcwu commented Apr 26, 2021

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

randall77 commented Apr 26, 2021

randall77 commented Apr 27, 2021

ulikunitz commented Apr 27, 2021

ulikunitz commented Apr 27, 2021

pgavlin commented Apr 28, 2021 • edited Loading

ulikunitz commented Apr 28, 2021

gopherbot commented May 8, 2021

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

pgavlin commented Apr 28, 2021 •

edited

Loading