Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: slower bit operations (regression in go 1.16.3) #45790

Closed
rayjcwu opened this issue Apr 26, 2021 · 7 comments
Closed

cmd/compile: slower bit operations (regression in go 1.16.3) #45790

rayjcwu opened this issue Apr 26, 2021 · 7 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@rayjcwu
Copy link
Contributor

rayjcwu commented Apr 26, 2021

What version of Go are you using (go version)?

$ go version
1.16.3

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="auto"
GOARCH="amd64"
GOBIN=""
GOCACHE="/mnt/go-cache"
GOENV="/mnt/ubuntu/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/mnt/jenkins/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/mnt/jenkins"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.16.3"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build4206356499=/tmp/go-build -gno-record-gcc-switches"

What did you do?

We are upgrading from Go 1.15.8 to Go 1.16.3 and found a noticeable benchmark regression. Here is a bit set setting i-th bit

package main

const remMask = uint64(1)<<6 - 1

type bitset struct {
	b []uint64
}

func (b *bitset) set(i int) {
	m := uint64(1) << (uint64(i) & remMask)
	b.b[i>>6] |= m
}

func (b *bitset) unset(i int) {
	m := uint64(1) << (uint64(i) & remMask)
	b.b[i>>6] &^= m
}
package main

import (
	"math/rand"
	"testing"
)

func BenchmarkSet(b *testing.B) {
	const n = 10
	const m = 1<<n - 1
	const numWords = (1 << n + 63) >> 6
	bs := &bitset{
		b: make([]uint64, numWords),
	}
	r := rand.New(rand.NewSource(0))
	bits := make([]bool, 1<<n)
	for i := range bits {
		bits[i] = r.Intn(2) == 1
	}
	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		if bits[i&m] {
			bs.set(i & m)
		} else {
			bs.unset(i & m)
		}
	}
}

Here is the benchmark result comparing 1.15.8 and 1.16.3.

goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz

Set-36     1.64ns ± 1%    2.89ns ± 1%  +76.54%  (p=0.001 n=7+7)

I bisected it, the benchmark regressions started at 96139f2

What did you expect to see?

Less performance regression.

What did you see instead?

+76.54% performance regression.

@randall77
Copy link
Contributor

I can reproduce. Looks like it was introduced from 1.16.2 -> 1.16.3.

That makes me think it was probably https://go-review.googlesource.com/c/go/+/305069

The regression remains at tip.

Looking at the assembly, set does BTSQ instead of ORQ directly on the memory.
Not sure why the BTSQ is that much slower.

1.16.2

	0x002b 00043 (/Users/khr/gowork/issue45790_test.go:16)	MOVQ	BX, CX
	0x002e 00046 (/Users/khr/gowork/issue45790_test.go:16)	MOVL	$1, SI
	0x0033 00051 (/Users/khr/gowork/issue45790_test.go:16)	SHLQ	CX, SI
	0x0036 00054 (/Users/khr/gowork/issue45790_test.go:17)	ORQ	SI, (DX)(AX*8)

1.16.3

	0x002b 00043 (/Users/khr/gowork/issue45790_test.go:17)	LEAQ	(DX)(AX*8), AX
	0x002f 00047 (/Users/khr/gowork/issue45790_test.go:17)	ANDQ	$63, BX
	0x0033 00051 (/Users/khr/gowork/issue45790_test.go:17)	BTSQ	BX, (AX)

The latter code looks better, so I suspect BTSQ is slow for some reason. The CL referenced above fixed some rules that as a side effect triggered this optimization. I think it would be worth going back to the CLs that introduced BTSQmodify instructions to make sure those are actually performant.

@pgavlin @benshi001

@randall77 randall77 added this to the Go1.17 milestone Apr 27, 2021
@randall77
Copy link
Contributor

@martisch

@ulikunitz
Copy link
Contributor

One observation: the ANDQ is not required, it is done by BTSQ itself.

Another: BTSQ is probably multiple micro-ops including a write of the CF flag. The upper sequence is probably translated 1-to-1 into micro-ops and will have a smaller micro-op count.

@ulikunitz
Copy link
Contributor

https://www.agner.org/optimize/optimizing_assembly.pdf states that BTS with a memory operand should be avoided on Intel. So moving the result into a register and writing it might be faster.

@cherrymui cherrymui added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 27, 2021
@pgavlin
Copy link
Contributor

pgavlin commented Apr 28, 2021

One observation: the ANDQ is not required, it is done by BTSQ itself.

@ulikunitz This is only true if the destination operand is not memory. From the instruction reference manual, page 3-11:

If BitBase is a memory address, the BitOffset can range has different ranges depending on the operand size (see Table 3-2).

Table 3-2. Range of Bit Positions Specified by Bit Offset Operands

Operand Size Immediate BitOffset Register BitOffset
16 0 to 15 −215 to 215 −1
32 0 to 31 −231 to 231 −1
64 0 to 63 −263 to 263 −1

The addressed bit is numbered (Offset MOD 8) within the byte at address (BitBase + (BitOffset DIV 8)) where DIV is signed division with rounding towards negative infinity and MOD returns a positive number (see
Figure 3-2).

The lack of masking was the cause of the issue which prompted the fix.

https://www.agner.org/optimize/optimizing_assembly.pdf states that BTS with a memory operand should be avoided on Intel. So moving the result into a register and writing it might be faster.

Given this, I agree with @randall77:

I think it would be worth going back to the CLs that introduced BTSQmodify instructions to make sure those are actually performant.

@ulikunitz
Copy link
Contributor

Agree with both points, I have been wrong regarding the AND.

@gopherbot
Copy link

Change https://golang.org/cl/318149 mentions this issue: cmd/compile: remove bit operations that modify memory directly

@golang golang locked and limited conversation to collaborators May 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

6 participants