cmd/compile: poor register allocation with PGO #58298

Deleplace · 2023-02-03T16:59:42Z

What version of Go are you using (`go version`)?

$ go version
go version go1.20 darwin/arm64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/Users/deleplace/Library/Caches/go-build"
GOENV="/Users/deleplace/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/deleplace/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/deleplace/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="go1.20"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/deleplace/Documents/2023/02/03/go.mod"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/p4/g7fnpss96g5708_mmv3cxzgh0000gn/T/go-build3787014677=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I wanted to try the amazing PGO (Profile Guided Optimization) new capability of Go 1.20.

For this I wrote a program that converts all JPEG files in a directory into PNG. It is mostly CPU/memory-bound, with some I/O to read and write the local filesystem.

I wanted to observe:

What % of perf gain PGO would provide,
If the perf gain would be consistent when running multiple times,
If the perf gain would be sensitive to the original sample inputs used to create the profiles (when executed with new inputs),
If the perf gain would be sensitive to the level of concurrency.

I tried 6 settings: {sequential, concurrent} x process {3, 10, 82} images.

Source at https://github.com/Deleplace/pgo-test

To create the 6 profiles easily, I executed main() inside a test:

for n in 3 10 82; do
	for conc in 0 1; do
		go test -cpuprofile=conc${conc}_${n}.prof -args -concurrent=$conc ./${n}images
	done
done

To create the 6 PGO-optimized executables:

for n in 3 10 82; do
	for conc in 0 1; do
		go build -pgo=conc${conc}_${n}.prof -o process_conc${conc}_${n}
	done
done

What did you expect to see?

A 2-4% perf gain

What did you see instead?

A consistent ~5% perf loss

The PGO-enabled programs always seem a bit slower than the original program, regardless of the exact inputs used to produce the profiles.

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2023-02-03T17:18:25Z

CC @golang/runtime

prattmic · 2023-02-03T18:37:41Z

Thanks, I can reproduce this:

I modified the repo slightly to disable logging and turn it into a benchmark: Deleplace/pgo-test@main...prattmic:deleplace-pgo-test:prattmic

I used 82 copies of https://en.wikipedia.org/wiki/File:Example.jpg as the input.

$ go test -bench=. -count=10 -args -concurrent=1 ./82images | tee before.txt
$ go test -bench=. -count=10 -cpuprofile=before.pprof -args -concurrent=1 ./82images
$ go test -bench=. -count=10 -pgo=before.pprof -args -concurrent=1 ./82images | tee after.txt
$ benchstat before.txt after.txt
goos: linux
goarch: amd64
pkg: github.com/Deleplace/pgo-test
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
           │ before.txt  │             after.txt              │
           │   sec/op    │   sec/op     vs base               │
Process-12   299.0m ± 6%   320.1m ± 2%  +7.06% (p=0.001 n=10)

As an aside (I don't think this should affect PGO), your included profiles spend a lot of time in runtime.allocSpan -> runtime.sysUsed -> runtime.madvise. conc1_82.prof spends 44% of cycles in madvise! This might be #36603? (cc @mknyszek) (My profile doesn't show any time in madvise, but I'm on Linux, not Darwin)

mknyszek · 2023-02-03T18:42:31Z

Thinking about this more, I think it's actually a Darwin-specific issue where newly-mapped memory has sysUsed called on it, but it's unnecessary (and maybe even incorrect). My best guess is Darwin is going down some slow error path. :( That would make this distinct from #36603.

If I'm right, what's really annoying is Darwin goes against every other platform in this respect, where we want to sysUsed every time we access not-paged-in memory except the first time.

prattmic · 2023-02-03T18:52:40Z

My before/after profiles (for 10x runs of the benchmark): profiles.tar.gz

After PGO, the benchmark takes ~3 cpu-s extra. ~2 of those are from flate.(*compressor).deflate, which had flate.(*compressor).findMatch inlined due to PGO.

I haven't looked closely yet, but I'm guessing inlining has some bad effects on further optimizations which result in a worse output.

Deleplace · 2023-02-06T08:10:44Z

Thanks Michael and Michael, I wasn't aware yet that (1) Darwin was special here and (2) inlining could have bad effects on performance, besides increasing the binary size.

Deleplace · 2023-02-06T09:01:09Z

@mknyszek I relaunched my original scripts on Linux amd64, and observed the same ~5% perf loss. My best guess is that the cause is not specific to Darwin.

Deleplace · 2023-02-06T11:50:52Z

@prattmic I can confirm your awesome findings on my Linux amd64:

a large part of the extra seconds are spent in flate.(*compressor).deflate
inlining is kicking in on findMatch, only after PGO
findMatch has a cost 208, which exceeds the standard budget 80 but may still be get inlined thanks to PGO.

The binary size after PGO remains almost the same, it does not seem that inlining has occurred recklessly at too many call sites.

We do seem to spend 200% of normal time inside matchLen. matchLen has always been inlined, but it could be that the extra upstream inlining of findMatch drives it confused about either BCE (bound check elimination), or about some magical decision to pass slice contents as arguments by ref or by value, given that the slice args are only read, not written to.

mknyszek · 2023-02-06T14:52:36Z

To be clear, what I mentioned has nothing to do with PGO; Darwin isn't special for PGO. I was just investigating something else that looked off in the profile you provided.

prattmic · 2023-02-06T22:27:43Z

I spent a while today looking at profiles and binary diffs (which was fun!) for deflate/findMatch.

In summary:

The hottest part of this code (by number of times basic blocks executed) is this loop, in the case where all of the if conditions in the loop body are false.
- Both before and after PGO both cases have exactly 26 instructions (excluding one NOP that appears only in the before case). The code (shown below) is fairly different due to different spills.
The next hottest case has the first if in the loop taken, and then the hot portion is this loop in the matchLen call.
- matchLen is inlined both before and after PGO.
- There are around 50% as many total iterations of this loop as there are of the findMatch loop.
- This loop got proportionally much longer (code below): 9 instructions before PGO, 14 instructions after. All the new instructions are related to spills (before 2 instructions loaded from the stack, after 6 load from the stack, one stores to the stack).

findMatch loop:

Before:

Samples|
   624 │ c5:   mov    0x28(%rsp),%rcx
   676 │       dec    %rcx
       │     return
   660 │       mov    %r11,0x58(%rsp)
       │     for i := prevHead; tries > 0; tries-- {
   664 │       mov    %rcx,%r11
       │     if wEnd == win[i+length] {
   660 │       mov    %r15,%rcx
       │     // matchLen returns the number of matching bytes in a and b
       │     // up to length 'max'. Both slices must be at least 'max'
       │     // bytes in size.
       │     func matchLen(a, b []byte, max int) int {
       │     a = a[:max]
       │     b = b[:len(a)]
   659 │       mov    0x20(%rsp),%r15
       │     compress/flate.matchLen():
   673 │       nop    
       │     compress/flate.(*compressor).findMatch():
       │     for i := prevHead; tries > 0; tries-- {
   681 │ e0:   test   %r11,%r11
   650 │     ↓ jle    170
   666 │       mov    %r11,0x28(%rsp)
       │     if wEnd == win[i+length] {
   654 │       lea    (%rcx,%rdi,1),%r11
   672 │       cmp    %r9,%r11
   678 │     ↓ jae    2b1
   646 │       movzbl (%r8,%r11,1),%r11d
   664 │       cmp    %r10b,%r11b
   692 │     ↓ jne    13f
...    
   652 │13f:   mov    0x58(%rsp),%r11
       │     if i == minIndex {
   670 │144:   cmp    %rsi,%rcx
   691 │     ↓ je     194
       │     i = int(d.hashPrev[i&windowMask]) - d.hashOffset
   655 │       and    $0x7fff,%ecx
   688 │       mov    0x80068(%rax,%rcx,4),%r15d
   627 │       sub    0xa0068(%rax),%r15
   696 │       xchg   %ax,%ax
       │     if i < minIndex || i < 0 {
   622 │       cmp    %rsi,%r15
   655 │     ↓ jl     18a
   690 │       test   %r15,%r15
   679 │     ↑ jge    c5

After:

Samples|
   722 │4b1bde:   dec    %rdi 
       │       compress/flate.(*compressor).deflate():
       │       if newLength, newOffset, ok := d.findMatch(d.index, d.chainHead-d.hashOffset, minMatchLength-1, lookahead); ok {
   638 │4b1be1:   mov    %r11,0x88(%rsp)
   653 │4b1be9:   mov    %r8,0x70(%rsp)
       │       if b[i] != av {
   618 │4b1bee:   mov    0xd8(%rsp),%r8
       │       compress/flate.(*compressor).findMatch():
       │       if n > length && (n > minMatchLength || pos-i <= 4096) {
   700 │4b1bf6:   mov    0xc8(%rsp),%r11
   680 │4b1bfe:   xchg   %ax,%ax
       │       for i := prevHead; tries > 0; tries-- {
   658 │4b1c00:   test   %rdi,%rdi
   684 │4b1c03: ↓ jle    4b1ccb <compress/flate.(*compressor).deflate+0xa0b>
       │       if wEnd == win[i+length] {
   644 │4b1c09:   mov    0x88(%rsp),%r11
   667 │4b1c11:   lea    0x0(%r13,%r11,1),%r8
   700 │4b1c16:   cmp    %r10,%r8
   681 │4b1c19: ↓ jae    4b1e86 <compress/flate.(*compressor).deflate+0xbc6>
   699 │4b1c1f:   movzbl (%r8,%r12,1),%r8d
   619 │4b1c24:   cmp    %r9b,%r8b
   663 │4b1c27: ↓ jne    4b1c9b <compress/flate.(*compressor).deflate+0x9db>
...
   591 │4b1c9b:   mov    0x70(%rsp),%r8
       │       compress/flate.(*compressor).findMatch():
       │       if i == minIndex {
   640 │4b1ca0:   cmp    %rcx,%r13
   656 │4b1ca3: ↓ je     4b1d0a <compress/flate.(*compressor).deflate+0xa4a>
       │       i = int(d.hashPrev[i&windowMask]) - d.hashOffset
   724 │4b1ca5:   and    $0x7fff,%r13d
   668 │4b1cac:   mov    0x80068(%rax,%r13,4),%r13d
   705 │4b1cb4:   sub    %r15,%r13
       │       if i < minIndex || i < 0 {
   595 │4b1cb7:   cmp    %rcx,%r13
   633 │4b1cba: ↓ jl     4b1cfd <compress/flate.(*compressor).deflate+0xa3d>
   625 │4b1cbc:   nopl   0x0(%rax)
   713 │4b1cc0:   test   %r13,%r13
   688 │4b1cc3: ↑ jge    4b1bde <compress/flate.(*compressor).deflate+0x91e>

matchLen loop:

Before:

Samples|
   358 │19e:   inc    %r11
       │     compress/flate.(*compressor).findMatch():
       │     if i == minIndex {
   344 │       mov    0x38(%rsp),%rsi
       │     compress/flate.matchLen():
       │     b = b[:len(a)]
   358 │       mov    0x20(%rsp),%r15
       │     for i, av := range a {
   394 │1ab:   cmp    %r12,%r11
   366 │     ↓ jge    1d4
   384 │       movzbl (%r11,%r13,1),%r15d
       │     if b[i] != av {
   413 │       movzbl (%r11,%rdx,1),%esi
   400 │       cmp    %r15b,%sil
   381 │     ↑ je     19e

After:

Samples|
   322 │4b1d17:   lea    0x1(%r15),%r12
       │       compress/flate.(*compressor).findMatch():
       │       wEnd = win[pos+n]
   330 │4b1d1b:   mov    0x50(%rsp),%r10
       │       if n > length && (n > minMatchLength || pos-i <= 4096) {
   336 │4b1d20:   mov    0x90(%rsp),%r13
       │       compress/flate.matchLen():
       │       b = b[:len(a)]
   343 │4b1d28:   mov    0x60(%rsp),%r15
       │       for i, av := range a {
   365 │4b1d2d:   mov    %r12,0x48(%rsp)
       │       compress/flate.(*compressor).findMatch():
       │       wEnd = win[pos+n]
   354 │4b1d32:   mov    0xd0(%rsp),%r12
       │       compress/flate.matchLen():
       │       for i, av := range a {
   373 │4b1d3a:   mov    0x48(%rsp),%r15
   413 │4b1d3f:   nop
   354 │4b1d40:   cmp    %rdx,%r15
   368 │4b1d43: ↓ jge    4b1d79 <compress/flate.(*compressor).deflate+0xab9>
   414 │4b1d45:   movzbl (%r15,%r8,1),%r12d
       │       if b[i] != av {
   379 │4b1d4a:   mov    0xd8(%rsp),%r10
   373 │4b1d52:   movzbl (%r15,%r10,1),%r13d
   399 │4b1d57:   cmp    %r12b,%r13b
   411 │4b1d5a: ↑ je     4b1d17 <compress/flate.(*compressor).deflate+0xa57>

prattmic · 2023-02-06T22:34:00Z

At a high level, I'll say that this application isn't a great fit for PGO today because it isn't "diverse" enough. That is, it spends almost all of its time in a very small number of functions. Right now, PGO works best with programs with hundreds of hot functions, allowing it to accumulate small wins from many places.

That said, we've said that we don't want PGO causing regressions, so I still would like to get this back closer to no regression.

prattmic · 2023-02-06T22:39:54Z

I suspect that the second bullet above (the matchLen loop) would be more-or-less resolved if we hoisted the spill load/stores out of the loop (is this in scope of #15808?).

The profile does not should the entry to matchLen as hot, which implies that this loop typically has many iterations which could amortize the cost of loading several stack values beforehand.

randall77 · 2023-02-06T23:29:59Z

Yes, that's basically #15808.

prattmic · 2023-02-10T21:29:17Z

I spent more time on this today. While hoisting could perhaps help here, the ultimate problem seems to be completely useless loads.

Here is the final assembly of the core loop:

v588 00539 (+320) INCQ R10
v1091 00540 (256) MOVBLZX compress/flate.wEnd-150(SP), R12
v1096 00541 (267) MOVQ compress/flate.win.len-136(SP), R15
v734 00542 (320) MOVQ R10, compress/flate.~R0-144(SP)
v891 00543 (319) MOVQ compress/flate.wPos.cap-120(SP), R10
// end of b103
v481 00544 (+320) MOVQ compress/flate.~R0-144(SP), R10
v343 00545 (320) CMPQ R10, R9
     00546 (320) JGE 556
// end of b101
v574 00547 (320) MOVBLZX (R10)(R11*1), R15
v214 00548 (+321) MOVBLZX (R10)(R8*1), R12
v337 00549 (321) CMPB R12, R15
     00550 (321) JEQ 539
// end of b107

The loads of R12, R15, and R10 at 540, 541, and 543 are useless; they are never used. In this analysis I looked specifically at the R10 load which is obviously useless without any extra context, as it is immediately overwritten by the next instruction.

Why are we loading this?

Here are the basic blocks after regalloc:

b101: ← b99 b103
  v565 (320) = Phi <int> v941 v734 : ~R0[int] (i[int], ~R0[int])
  v481 (320) = LoadReg <int> v565 : R10
  v343 (+320) = CMPQ <flags> v481 v976
LT v343 → b107 b104 (likely) (320)

b107: ← b101
  v574 (320) = MOVBloadidx1 <byte> v481 v531 v847 : R15 (av[byte])
  v214 (321) = MOVBloadidx1 <uint8> v481 v494 v847 : R12
  v337 (+321) = CMPB <flags> v214 v574
NE v337 → b106 b103 (unlikely) (321)

b103: ← b107
  v588 (+320) = ADDQconst <int> [1] v481 : R10
  v1091 (256) = LoadReg <byte> v489 : R12
  v1096 (267) = LoadReg <int> v482 : R15
  v734 (320) = StoreReg <int> v588 : ~R0[int]
  v891 (319) = LoadReg <int> v1187 : R10
Plain → b101 (320)

And just before regalloc:

b101: ← b99 b103
  v565 (320) = Phi <int> v926 v588 (~R0[int], i[int])
  v343 (+320) = CMPQ <flags> v565 v416
LT v343 → b107 b104 (likely) (320)

b107: ← b101
  v574 (320) = MOVBloadidx1 <byte> v565 v531 v847 (av[byte])
  v214 (321) = MOVBloadidx1 <uint8> v565 v494 v847
  v337 (+321) = CMPB <flags> v214 v574
NE v337 → b106 b103 (unlikely) (321)

b103: ← b107
  v588 (+320) = ADDQconst <int> [1] v565
Plain → b101 (320)

(I've omitted b101 predecessor b99, but you can see it in the screenshot).

From this diff it isn't clear why b103 adds a load to R10. We can get more context from -gcflags=compress/flate=-d=ssa/regalloc/debug=100:

edge b103->b101
...
src R10: v588 cache=v588
...
dst R10: v490
...
moving v588 to ~R0[int]
sources of v588: R10:v588 [use R10]
v734 = StoreReg <int> v588 : ~R0[int]
v588 now available in ~R0[int]:v734
moving v490 to R10
sources of v490: wPos.cap[int]:v1187 [use wPos.cap[int]]
v588 no longer available in R10:v588
v891 = LoadReg <int> v1187 : R10
v490 now available in R10:v891

This says that on entry (src) this block has v588 in R10, and on exit (dst), it should have v490 in R10. Thus v588 gets spilled to the stack and v490 is loaded from its stack slot (v490 happens to be the capacity of b).

Why does this block think it needs v490 in R10? The expected dst values come from the src values of the successor block. Taking a look at successor b101, we see:

Begin processing block b101
...
starting merge block b101 with end state of b99:
...
  R10: orig:v490 cache:v1138
...
value v343 = CMPQ <flags> v565 v416
  out:
  in0:
  in1: CX
freeReg R10 (dump v490/v1138)
assignReg R10 v565/v481
...
live-at-end spill for v490 at b101

b101 wants v490 in R10 on entry because its other predecessor b99 ends with v490 in R10 (because b99's predecessor b95 loaded it into R10. You can see this in the image).

However, b101 proceeds to immediately dump v490 and use R10 for its own purpose. Note that b101 doesn't need to write R10 back to the v490 stack slot (v1138). I think this is because at this point every SSA value still has its own "stack slot", so SSA semantics still apply, but I'm not certain.

It seems to me that it should be possible for regalloc to remove R10 from its requested src state if it knows it will dump the register without needing a spill store.

cc @dr2chase

gopherbot · 2023-02-24T20:29:33Z

Change https://go.dev/cl/471158 mentions this issue: cmd/compile/internal/ssa: drop overwritten regalloc basic block input requirements

… requirements For the following description, consider the following basic block graph: b1 ───┐┌──── b2 ││ ││ ▼▼ b3 For register allocator transitions between basic blocks, there are two key passes (significant paraphrasing): First, each basic block is visited in some predetermined visit order. This is the core visitOrder range loop in regAllocState.regalloc. The specific ordering heuristics aren't important here, except that the order guarantees that when visiting a basic block at least one of its predecessors has already been visited. Upon visiting a basic block, that block sets its expected starting register state (regAllocState.startRegs) based on the ending register state (regAlloc.State.endRegs) of one of its predecessors. (How it chooses which predecessor to use is not important here.) From that starting state, registers are assigned for all values in the block, ultimately resulting in some ending register state. After all blocks have been visited, the shuffle pass (regAllocState.shuffle) ensures that for each edge, endRegs of the predecessor == startRegs of the successor. That is, it makes sure that the startRegs assumptions actually hold true for each edge. It does this by adding moves to the end of the predecessor block to place values in the expected register for the successor block. These may be moves from other registers, or from memory if the value is spilled. Now on to the actual problem: Assume that b1 places some value v1 into register R10, and thus ends with endRegs containing R10 = v1. When b3 is visited, it selects b1 as its model predecessor and sets startRegs with R10 = v1. b2 does not have v1 in R10, so later in the shuffle pass, we will add a move of v1 into R10 to the end of b2 to ensure it is available for b3. This is all perfectly fine and exactly how things should work. Now suppose that b3 does not use v1. It does need to use some other value v2, which is not currently in a register. When assigning v2 to a register, it finds all registers are already in use and it needs to dump a value. Ultimately, it decides to dump v1 from R10 and replace it with v2. This is fine, but it has downstream effects on shuffle in b2. b3's startRegs still state that R10 = v1, so b2 will add a move to R10 even though b3 will unconditionally overwrite it. i.e., the move at the end of b2 is completely useless and can result in code like: // end of b2 MOV n(SP), R10 // R10 = v1 <-- useless // start of b3 MOV m(SP), R10 // R10 = v2 This is precisely what happened in #58298. This CL addresses this problem by dropping registers from startRegs if they are never used in the basic block prior to getting dumped. This allows the shuffle pass to avoid placing those useless values into the register. There is a significant limitation to this CL, which is that it only impacts the immediate predecessors of an overwriting block. We can discuss this by zooming out a bit on the previous graph: b4 ───┐┌──── b5 ││ ││ ▼▼ b1 ───┐┌──── b2 ││ ││ ▼▼ b3 Here we have the same graph, except we can see the two predecessors of b1. Now suppose that rather than b1 assigning R10 = v1 as above, the assignment is done in b4. b1 has startRegs R10 = v1, doesn't use the value at all, and simply passes it through to endRegs R10 = v1. Now the shuffle pass will require both b2 and b5 to add a move to assigned R10 = v1, because that is specified in their successor startRegs. With this CL, b3 drops R10 = v1 from startRegs, but there is no backwards propagation, so b1 still has R10 = v1 in startRegs, and b5 still needs to add a useless move. Extending this CL with such propagation may significantly increase the number of useless moves we can remove, though it will add complexity to maintenance and could potentially impact build performance depending on how efficiently we could implement the propagation (something I haven't considered carefully). As-is, this optimization does not impact much code. In bent .text size geomean is -0.02%. In the container/heap test binary, 18 of ~2500 functions are impacted by this CL. Bent and sweet do not show a noticeable performance impact one way or another, however #58298 does show a case where this can have impact if the useless instructions end up in the hot path of a tight loop. For #58298. Change-Id: I2fcef37c955159d068fa0725f995a1848add8a5f Reviewed-on: https://go-review.googlesource.com/c/go/+/471158 Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: David Chase <drchase@google.com>

gopherbot · 2023-04-12T20:55:38Z

Change https://go.dev/cl/484215 mentions this issue: cmd/compile: expose ir.Func to ssa

ssagen.ssafn already holds the ir.Func, and ssa.Frontend.SetWBPos and ssa.Frontend.Lsym are simple wrappers around parts of the ir.Func. Expose the ir.Func through ssa.Frontend, allowing us to remove these wrapper methods and allowing future access to additional features of the ir.Func if needed. While we're here, drop ssa.Frontend.Line, which is unused. For #58298. Change-Id: I30c4cbd2743e9ad991d8c6b388484a7d1e95f3ae Reviewed-on: https://go-review.googlesource.com/c/go/+/484215 Auto-Submit: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> Run-TryBot: Michael Pratt <mpratt@google.com>

prattmic · 2023-06-05T20:04:54Z

https://go.dev/cl/471158 targeted one of the problems in this workload, but it didn't make a huge difference.

Here is the current status:

goos: linux
goarch: amd64
pkg: github.com/Deleplace/pgo-test
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
           │ 120_nopgo.txt │           121_nopgo.txt            │            120_pgo.txt             │            121_pgo.txt             │
           │    sec/op     │   sec/op     vs base               │   sec/op     vs base               │   sec/op     vs base               │
Process-12     294.8m ± 1%   287.2m ± 1%  -2.57% (p=0.000 n=10)   319.9m ± 1%  +8.51% (p=0.000 n=10)   310.6m ± 1%  +5.38% (p=0.000 n=10)

Comparing just 1.21, we see there is still ~8% regression with PGO:

goos: linux
goarch: amd64
pkg: github.com/Deleplace/pgo-test
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
           │ 121_nopgo.txt │            121_pgo.txt             │
           │    sec/op     │   sec/op     vs base               │
Process-12     287.2m ± 1%   310.6m ± 1%  +8.16% (p=0.000 n=10)

Deleplace · 2023-06-06T07:09:26Z

Thank you Michael

Deleplace changed the title ~~affected/package:~~ PGO makes my program slower Feb 3, 2023

ianlancetaylor changed the title ~~PGO makes my program slower~~ cmd/compile: PGO makes my program slower Feb 3, 2023

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Feb 3, 2023

prattmic self-assigned this Feb 3, 2023

prattmic added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 3, 2023

prattmic added this to the Go1.21 milestone Feb 3, 2023

prattmic changed the title ~~cmd/compile: PGO makes my program slower~~ cmd/compile: poor register allocation with PG Jun 5, 2023

prattmic changed the title ~~cmd/compile: poor register allocation with PG~~ cmd/compile: poor register allocation with PGO Jun 5, 2023

prattmic modified the milestones: Go1.21, Go1.22 Jun 5, 2023

gopherbot removed this from the Go1.22 milestone Feb 6, 2024

gopherbot added this to the Go1.23 milestone Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: poor register allocation with PGO #58298

cmd/compile: poor register allocation with PGO #58298

Deleplace commented Feb 3, 2023 •

edited

ianlancetaylor commented Feb 3, 2023

prattmic commented Feb 3, 2023

mknyszek commented Feb 3, 2023

prattmic commented Feb 3, 2023

Deleplace commented Feb 6, 2023

Deleplace commented Feb 6, 2023

Deleplace commented Feb 6, 2023 •

edited

mknyszek commented Feb 6, 2023

prattmic commented Feb 6, 2023

prattmic commented Feb 6, 2023

prattmic commented Feb 6, 2023

randall77 commented Feb 6, 2023

prattmic commented Feb 10, 2023

gopherbot commented Feb 24, 2023

gopherbot commented Apr 12, 2023

prattmic commented Jun 5, 2023

Deleplace commented Jun 6, 2023

cmd/compile: poor register allocation with PGO #58298

cmd/compile: poor register allocation with PGO #58298

Comments

Deleplace commented Feb 3, 2023 • edited

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

ianlancetaylor commented Feb 3, 2023

prattmic commented Feb 3, 2023

mknyszek commented Feb 3, 2023

prattmic commented Feb 3, 2023

Deleplace commented Feb 6, 2023

Deleplace commented Feb 6, 2023

Deleplace commented Feb 6, 2023 • edited

mknyszek commented Feb 6, 2023

prattmic commented Feb 6, 2023

prattmic commented Feb 6, 2023

prattmic commented Feb 6, 2023

randall77 commented Feb 6, 2023

prattmic commented Feb 10, 2023

gopherbot commented Feb 24, 2023

gopherbot commented Apr 12, 2023

prattmic commented Jun 5, 2023

Deleplace commented Jun 6, 2023

Deleplace commented Feb 3, 2023 •

edited

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

Deleplace commented Feb 6, 2023 •

edited