Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: dynamic checks for atomics should be removed for ARM64 targets that support LSE #66131

Open
andreybokhanko opened this issue Mar 6, 2024 · 6 comments
Assignees
Labels
arch-arm64 compiler/runtime Issues related to the Go compiler and/or runtime. FixPending Issues that have a fix which has not yet been reviewed or submitted.
Milestone

Comments

@andreybokhanko
Copy link
Contributor

Go version

upstream

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/abokhanko/.cache/go-build'
GOENV='/home/abokhanko/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/abokhanko/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/abokhanko/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/home/abokhanko/goroot'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/home/abokhanko/goroot/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='devel go1.23-e9ab25bdd0 Mon Feb 19 22:21:38 2024 +0300'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build2057756495=/tmp/go-build -gno-record-gcc-switches'

What did you do?

test.go:

package main

import "sync/atomic"

type Counter struct {
        counter int32
}

func main() {
        x := Counter{10}
        atomic.AddInt32(&x.counter, 1)
}

Build:

GOOS=linux GOARCH=arm64 GOARM64=v8.2 go build test.go

Get disassembly:

go tool objdump -s main\.main test

What did you see happen?

Atomic add includes checks for presence of atomic instructions and a loop for ARM64 architectures that don't support them:

  test.go:11            0x708cc                 b00007db                ADRP 1019904(PC), R27
  test.go:11            0x708d0                 3961af61                MOVBU 2155(R27), R1
  test.go:11            0x708d4                 360000a1                TBZ $0, R1, 5(PC)
  test.go:11            0x708d8                 b24003e1                ORR $1, ZR, R1
  test.go:11            0x708dc                 b8e10002                ?
  test.go:11            0x708e0                 8b010042                ADD R1, R2, R2
  test.go:11            0x708e4                 14000006                JMP 6(PC)
  test.go:11            0x708e8                 b24003e1                ORR $1, ZR, R1
  test.go:11            0x708ec                 885ffc02                LDAXRW (R0), R2
  test.go:11            0x708f0                 8b010042                ADD R1, R2, R2
  test.go:11            0x708f4                 881bfc02                STLXRW R2, (R0), R27
  test.go:11            0x708f8                 b5ffffbb                CBNZ R27, -3(PC)

What did you expect to see?

With GOARCH=arm64 GOARM64=v8.2 we can eliminate runtime checks and thus, improve performance.

@andreybokhanko
Copy link
Contributor Author

Related to #60905.

@andreybokhanko
Copy link
Contributor Author

@aktau @mauri870 @cherrymui @mengzhuo @zhangfannie I plan to work on fixing this; please assign this ticket on me.

@ALTree ALTree changed the title cmd/go: dynamic checks for atomics should be removed for ARM64 targets that support LSE cmd/compile: dynamic checks for atomics should be removed for ARM64 targets that support LSE Mar 6, 2024
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Mar 6, 2024
@ALTree ALTree added arch-arm64 and removed compiler/runtime Issues related to the Go compiler and/or runtime. labels Mar 6, 2024
@ALTree ALTree added this to the Unplanned milestone Mar 6, 2024
@ALTree ALTree added the NeedsFix The path to resolution is known, but the work has not been done. label Mar 6, 2024
@gopherbot
Copy link

Change https://go.dev/cl/569536 mentions this issue: cmd/compile,cmd/go,cmd/internal,runtime: remove dynamic checks for

@ALTree ALTree added the compiler/runtime Issues related to the Go compiler and/or runtime. label Mar 6, 2024
@Jorropo Jorropo added FixPending Issues that have a fix which has not yet been reviewed or submitted. and removed NeedsFix The path to resolution is known, but the work has not been done. labels Mar 6, 2024
gopherbot pushed a commit that referenced this issue Mar 21, 2024
…omics for ARM64 targets that support LSE

Remove dynamic checks for atomic instructions for ARM64 targets that support LSE extension.

For #66131

Change-Id: I0ec1b183a3f4ea4c8a537430646e6bc4b4f64271
Reviewed-on: https://go-review.googlesource.com/c/go/+/569536
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Fannie Zhang <Fannie.Zhang@arm.com>
Reviewed-by: Shu-Chun Weng <scw@google.com>
@zhangfannie
Copy link
Contributor

Recently, I tested the internal/runtime/atomic benchmarks with the setting GOARM64 = v8.1. It was strange that the performance of And64Parallel and Or64Parallel has some regression. Please see the performance data below.

oos: linux
goarch: arm64
pkg: internal/runtime/atomic
                  │   old.txt    │               new.txt                │
                  │    sec/op    │    sec/op     vs base                │
And32-128            8.753n ± 1%    8.745n ± 0%        ~ (p=0.955 n=10)
And32Parallel-128    14.97n ± 1%    15.00n ± 2%        ~ (p=0.469 n=10)
And64-128            8.810n ± 1%    8.739n ± 0%   -0.81% (p=0.002 n=10)
And64Parallel-128    14.99n ± 1%    15.43n ± 2%   +2.94% (p=0.000 n=10)
Or32-128             8.724n ± 0%    8.730n ± 0%        ~ (p=0.469 n=10)
Or32Parallel-128     15.00n ± 1%    15.14n ± 2%        ~ (p=0.288 n=10)
Or64-128             8.720n ± 0%    8.722n ± 0%        ~ (p=0.446 n=10)
Or64Parallel-128     15.01n ± 3%    15.69n ± 5%   +4.46% (p=0.000 n=10)
AtomicLoad64-128    0.3648n ± 0%   0.3648n ± 0%        ~ (p=0.773 n=10)
AtomicStore64-128   0.3650n ± 0%   0.3648n ± 0%   -0.05% (p=0.002 n=10)
AtomicLoad-128      0.3643n ± 0%   0.3643n ± 0%        ~ (p=0.650 n=10)
AtomicStore-128     0.4369n ± 0%   0.4367n ± 0%        ~ (p=0.066 n=10)
And8-128             5.138n ± 0%    4.971n ± 0%   -3.26% (p=0.000 n=10)
And-128              5.150n ± 0%    4.946n ± 0%   -3.96% (p=0.000 n=10)
And8Parallel-128     12.21n ± 1%    11.87n ± 1%   -2.78% (p=0.000 n=10)
AndParallel-128      12.20n ± 2%    11.88n ± 1%   -2.58% (p=0.000 n=10)
Or8-128              5.043n ± 0%    4.934n ± 0%   -2.16% (p=0.000 n=10)
Or-128               5.099n ± 0%    4.922n ± 0%   -3.47% (p=0.000 n=10)
Or8Parallel-128      12.15n ± 1%    11.88n ± 1%   -2.22% (p=0.000 n=10)
OrParallel-128       13.09n ± 1%    11.84n ± 1%   -9.51% (p=0.000 n=10)
Xadd-128             12.76n ± 1%    11.59n ± 4%   -9.13% (p=0.000 n=10)
Xadd64-128           12.74n ± 1%    11.61n ± 1%   -8.91% (p=0.000 n=10)
Cas-128              24.71n ± 1%    20.32n ± 0%  -17.75% (p=0.000 n=10)
Cas64-128            25.46n ± 1%    20.34n ± 0%  -20.09% (p=0.000 n=10)
Xchg-128             12.52n ± 1%    12.23n ± 2%   -2.32% (p=0.005 n=10)
Xchg64-128           12.58n ± 1%    12.19n ± 1%   -3.06% (p=0.000 n=10)
geomean              6.544n         6.322n        -3.39%

The Go compiler do not intrinsify the atomic.And64 and the atomic.Or64. Setting GOARM64 to v8.1, their code will lack the following two instructions. Judging from the number of running codes, there should be no regression in performance.

#ifndef GOARM64_LSE     
  MOVBU   internal∕cpu·ARM64+const_offsetARM64HasATOMICS(SB), R4     
  CBZ     R4, load_store_loop 
#endif

Any ideas? Thank you.

@mauri870
Copy link
Member

Thanks for bringing this up. The And/Or atomic primitives were added very recently, there is not even the public sync/atomic apis exposed yet. I don't think we have any compiler optimized implementation yet, so that might relate with the regression.

@zhangfannie
Copy link
Contributor

zhangfannie commented Apr 17, 2024

@mauri870 Yes, the compiler does not intrinsify them. They are all implemented in assmbly codes. Take the implmentation of And64 as an example.

// func And64(addr *uint64, v uint64) old uint64
TEXT ·And64(SB), NOSPLIT, $0-24
	MOVD	ptr+0(FP), R0
	MOVD	val+8(FP), R1
#ifndef GOARM64_LSE
	MOVBU	internal∕cpu·ARM64+const_offsetARM64HasATOMICS(SB), R4
	CBZ 	R4, load_store_loop
#endif
	MVN 	R1, R2
	LDCLRALD	R2, (R0), R3
	MOVD	R3, ret+16(FP)
	RET
#ifndef GOARM64_LSE
load_store_loop:
	LDAXR	(R0), R2
	AND	R1, R2, R3
	STLXR	R3, (R0), R4
	CBNZ	R4, load_store_loop
	MOVD 	R2, ret+16(FP)
	RET
#endif

If GOARM64 is set to v8.1, then GOARM64_LSE is true, which will reduce two instructions and there should be no regression in performance. This is where I am confused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm64 compiler/runtime Issues related to the Go compiler and/or runtime. FixPending Issues that have a fix which has not yet been reviewed or submitted.
Projects
None yet
Development

No branches or pull requests

6 participants