runtime: allocation performance worse on two socket server #47831

gangdeng-intel · 2021-08-20T06:01:30Z

What version of Go are you using (`go version`)?

$ go version
1.16.4

Does this issue reproduce with the latest release?

Did not try on latest release

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE="on"

GOARCH="amd64"

GOBIN=""

GOCACHE="/root/.cache/go-build"

GOENV="/root/.config/go/env"

GOEXE=""

GOFLAGS=""

GOHOSTARCH="amd64"

GOHOSTOS="linux"

GOINSECURE=""

GOMODCACHE="/root/go/pkg/mod"

GONOPROXY=""

GONOSUMDB=""

GOOS="linux"

GOPATH="/root/go"

GOPRIVATE=""

GOPROXY="https://proxy.golang.org,direct"

GOROOT="/usr/lib/golang"

GOSUMDB="sum.golang.org"

GOTMPDIR=""

GOTOOLDIR="/usr/lib/golang/pkg/tool/linux_amd64"

GCCGO="gccgo"

AR="ar"

CC="gcc"

CXX="g++"

CGO_ENABLED="1"

GOMOD="/dev/null"

CGO_CFLAGS="-g -O2"

CGO_CPPFLAGS=""

CGO_CXXFLAGS="-g -O2"

CGO_FFLAGS="-g -O2"

CGO_LDFLAGS="-g -O2"

PKG_CONFIG="pkg-config"

GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build394973932=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I ran go based application (TiDB v5.1, which is compiled by go 1.16.4) on 2 socket server (24c per socket), and compared its performance with on 1 socket (I used numactl to bind application to a single socket).

What did you expect to see?

Application has better performance on 2 socket comparing with on 1 socket. As 2 sockets have double number of cpu cores.

What did you see instead?

I found application has worse performance on 2 socket comparing with on a single socket. More specifically, application has 93% performance on 2 socket comparing with on 1 socket.

I used perf top to check hotspots of application. The top hot function is runtime.HeapBitsSetType. The function took 6.45% overhead on 1 socket while took 17.25% overhead on 2 sockets.

Then I used perf c2c to check if the performance degeneration was caused by cache sharing issue. With the tool, I found runtime.heapBitsSetType is the major source of HITM. And major source of store came from runtime.(*mspan).sweep (79.16%), runtime.(*mheap).relcaim (13.34%) and runtime.sweepone(7.48%). More detail data as below.

----- HITM ------ -- Store REfs -- -------- CL ------ Functions
RmtHitm LclHitm L1 Hit Off Node PA Cnt
2.03% 0.38% 79.16% 0x0 1 1 runtime.(*mspan).sweep
0.86% 0.07% 7.48% 0x30 1 1 runtime.sweepone
0.18% 0.03% 13.34% 0x30 1 1 runtime.(*mheap).relcaim
89.98% 92.52% 0.00% 0x38 1 1 runtime.heapBitsSetType

Then I located the lines of code in above functions:
For runtime.heapBitsSetType, the related line of code is: "ha := mheap_.arenas[arena.l1()][arena.l2()]" (in heapBitsForAddr func of mbitmap.go)
For runtime.(*mspan).sweep, the related line of code is: "atomic.Xadd64(&mheap_.pagesSwept, int64(s.npages))" (in sweep func of mgcsweep.go)
For runtime.sweepone, the related line of code is "atomic.Xadduintptr(&mheap_.reclaimCredit, npages)" (in sweepone func of mgcsweep.go)
For runtime.(*mheap).relcaim, the related line of code is "if atomic.Casuintptr(&h.reclaimCredit, credit, credit-take) {" (in reclaim func of mheap.go)

According to above code, I located the relatived variables defined in mheap struct:pagesSwept, reclaimCredit, arenas. And suppose pagesSwept is the start of a cacheline at 0x00, then reclaimCredit will locate at 0x30, arenas will locate at 0x38.
It will exactly match output of perf c2c tool (CL off).

0x00 pagesSwept uint64
0x08 pagesSweptBasis uint64
0x10 sweepHeapLiveBasis uint64
0x18 sweepPagesPerByte float64
0x20 scavengeGoal uint64
0x28 reclaimIndex uint64
0x30 reclaimCredit uintptr
0x38 arenas [1 << arenaL1Bits]*[1 << arenaL2Bits]*heapArena

According to above analysis, I think the performance issue on 2 socket server can be solved by either adding padding before arenas or changing order of the related variable definition in mheap struct to ensure they locate at different cachelines.

The text was updated successfully, but these errors were encountered:

mknyszek · 2021-08-20T13:25:55Z

This is somewhat expected. The Go runtime doesn't do anything special for NUMA nodes, so there's probably a lot of cross-socket traffic it's producing that actively makes things worse.

CC @prattmic @aclements

gopherbot · 2021-09-08T03:07:26Z

Change https://golang.org/cl/348230 mentions this issue: add 64 bytes padding to fix HITM issue across CPU sockets

nightlyone · 2021-09-26T09:17:05Z

Also known as False Sharing.

mknyszek changed the title ~~Go runtime has performance issue on 2 socket server~~ runtime: performance worse on two socket server Aug 20, 2021

mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 20, 2021

mknyszek added this to the Backlog milestone Aug 20, 2021

mknyszek changed the title ~~runtime: performance worse on two socket server~~ runtime: allocation performance worse on two socket server Aug 20, 2021

This was referenced Sep 6, 2021

[Release branch.go1.16] runtime/mheap.go add 64 bytes padding to avoid cache line conflict #48208

Closed

Add padding to fix HITM issue across CPU sockets #48236

Open

heyuanliu-intel added a commit to heyuanliu-intel/go that referenced this issue Sep 8, 2021

Add padding to fix HITM issue across CPU sockets. Fix golang#47831.

53b846c

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: allocation performance worse on two socket server #47831

runtime: allocation performance worse on two socket server #47831

gangdeng-intel commented Aug 20, 2021 •

edited

mknyszek commented Aug 20, 2021

gopherbot commented Sep 8, 2021

nightlyone commented Sep 26, 2021

runtime: allocation performance worse on two socket server #47831

runtime: allocation performance worse on two socket server #47831

Comments

gangdeng-intel commented Aug 20, 2021 • edited

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

mknyszek commented Aug 20, 2021

gopherbot commented Sep 8, 2021

nightlyone commented Sep 26, 2021

gangdeng-intel commented Aug 20, 2021 •

edited

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?