Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: poor register allocation with PGO #58298

Open
Deleplace opened this issue Feb 3, 2023 · 17 comments
Open

cmd/compile: poor register allocation with PGO #58298

Deleplace opened this issue Feb 3, 2023 · 17 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@Deleplace
Copy link
Contributor

Deleplace commented Feb 3, 2023

What version of Go are you using (go version)?

$ go version
go version go1.20 darwin/arm64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/Users/deleplace/Library/Caches/go-build"
GOENV="/Users/deleplace/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/deleplace/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/deleplace/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="go1.20"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/deleplace/Documents/2023/02/03/go.mod"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/p4/g7fnpss96g5708_mmv3cxzgh0000gn/T/go-build3787014677=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I wanted to try the amazing PGO (Profile Guided Optimization) new capability of Go 1.20.

For this I wrote a program that converts all JPEG files in a directory into PNG. It is mostly CPU/memory-bound, with some I/O to read and write the local filesystem.

I wanted to observe:

  • What % of perf gain PGO would provide,
  • If the perf gain would be consistent when running multiple times,
  • If the perf gain would be sensitive to the original sample inputs used to create the profiles (when executed with new inputs),
  • If the perf gain would be sensitive to the level of concurrency.

I tried 6 settings: {sequential, concurrent} x process {3, 10, 82} images.

Source at https://github.com/Deleplace/pgo-test

To create the 6 profiles easily, I executed main() inside a test:

for n in 3 10 82; do
	for conc in 0 1; do
		go test -cpuprofile=conc${conc}_${n}.prof -args -concurrent=$conc ./${n}images
	done
done

To create the 6 PGO-optimized executables:

for n in 3 10 82; do
	for conc in 0 1; do
		go build -pgo=conc${conc}_${n}.prof -o process_conc${conc}_${n}
	done
done

What did you expect to see?

  • A 2-4% perf gain

What did you see instead?

  • A consistent ~5% perf loss

The PGO-enabled programs always seem a bit slower than the original program, regardless of the exact inputs used to produce the profiles.

@Deleplace Deleplace changed the title affected/package: PGO makes my program slower Feb 3, 2023
@ianlancetaylor ianlancetaylor changed the title PGO makes my program slower cmd/compile: PGO makes my program slower Feb 3, 2023
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Feb 3, 2023
@ianlancetaylor
Copy link
Contributor

CC @golang/runtime

@prattmic prattmic self-assigned this Feb 3, 2023
@prattmic prattmic added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 3, 2023
@prattmic prattmic added this to the Go1.21 milestone Feb 3, 2023
@prattmic
Copy link
Member

prattmic commented Feb 3, 2023

Thanks, I can reproduce this:

I modified the repo slightly to disable logging and turn it into a benchmark: Deleplace/pgo-test@main...prattmic:deleplace-pgo-test:prattmic

I used 82 copies of https://en.wikipedia.org/wiki/File:Example.jpg as the input.

$ go test -bench=. -count=10 -args -concurrent=1 ./82images | tee before.txt
$ go test -bench=. -count=10 -cpuprofile=before.pprof -args -concurrent=1 ./82images
$ go test -bench=. -count=10 -pgo=before.pprof -args -concurrent=1 ./82images | tee after.txt
$ benchstat before.txt after.txt
goos: linux
goarch: amd64
pkg: github.com/Deleplace/pgo-test
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
           │ before.txt  │             after.txt              │
           │   sec/op    │   sec/op     vs base               │
Process-12   299.0m ± 6%   320.1m ± 2%  +7.06% (p=0.001 n=10)

As an aside (I don't think this should affect PGO), your included profiles spend a lot of time in runtime.allocSpan -> runtime.sysUsed -> runtime.madvise. conc1_82.prof spends 44% of cycles in madvise! This might be #36603? (cc @mknyszek) (My profile doesn't show any time in madvise, but I'm on Linux, not Darwin)

@mknyszek
Copy link
Contributor

mknyszek commented Feb 3, 2023

Thinking about this more, I think it's actually a Darwin-specific issue where newly-mapped memory has sysUsed called on it, but it's unnecessary (and maybe even incorrect). My best guess is Darwin is going down some slow error path. :( That would make this distinct from #36603.

If I'm right, what's really annoying is Darwin goes against every other platform in this respect, where we want to sysUsed every time we access not-paged-in memory except the first time.

@prattmic
Copy link
Member

prattmic commented Feb 3, 2023

My before/after profiles (for 10x runs of the benchmark): profiles.tar.gz

After PGO, the benchmark takes ~3 cpu-s extra. ~2 of those are from flate.(*compressor).deflate, which had flate.(*compressor).findMatch inlined due to PGO.

I haven't looked closely yet, but I'm guessing inlining has some bad effects on further optimizations which result in a worse output.

@Deleplace
Copy link
Contributor Author

Thanks Michael and Michael, I wasn't aware yet that (1) Darwin was special here and (2) inlining could have bad effects on performance, besides increasing the binary size.

@Deleplace
Copy link
Contributor Author

@mknyszek I relaunched my original scripts on Linux amd64, and observed the same ~5% perf loss. My best guess is that the cause is not specific to Darwin.

@Deleplace
Copy link
Contributor Author

Deleplace commented Feb 6, 2023

@prattmic I can confirm your awesome findings on my Linux amd64:

  • a large part of the extra seconds are spent in flate.(*compressor).deflate
  • inlining is kicking in on findMatch, only after PGO
  • findMatch has a cost 208, which exceeds the standard budget 80 but may still be get inlined thanks to PGO.

image

The binary size after PGO remains almost the same, it does not seem that inlining has occurred recklessly at too many call sites.

We do seem to spend 200% of normal time inside matchLen. matchLen has always been inlined, but it could be that the extra upstream inlining of findMatch drives it confused about either BCE (bound check elimination), or about some magical decision to pass slice contents as arguments by ref or by value, given that the slice args are only read, not written to.

@mknyszek
Copy link
Contributor

mknyszek commented Feb 6, 2023

To be clear, what I mentioned has nothing to do with PGO; Darwin isn't special for PGO. I was just investigating something else that looked off in the profile you provided.

@prattmic
Copy link
Member

prattmic commented Feb 6, 2023

I spent a while today looking at profiles and binary diffs (which was fun!) for deflate/findMatch.

In summary:

  • The hottest part of this code (by number of times basic blocks executed) is this loop, in the case where all of the if conditions in the loop body are false.
    • Both before and after PGO both cases have exactly 26 instructions (excluding one NOP that appears only in the before case). The code (shown below) is fairly different due to different spills.
  • The next hottest case has the first if in the loop taken, and then the hot portion is this loop in the matchLen call.
    • matchLen is inlined both before and after PGO.
    • There are around 50% as many total iterations of this loop as there are of the findMatch loop.
    • This loop got proportionally much longer (code below): 9 instructions before PGO, 14 instructions after. All the new instructions are related to spills (before 2 instructions loaded from the stack, after 6 load from the stack, one stores to the stack).

findMatch loop:

Before:

Samples|
   624 │ c5:   mov    0x28(%rsp),%rcx
   676 │       dec    %rcx
       │     return
   660 │       mov    %r11,0x58(%rsp)
       │     for i := prevHead; tries > 0; tries-- {
   664 │       mov    %rcx,%r11
       │     if wEnd == win[i+length] {
   660 │       mov    %r15,%rcx
       │     // matchLen returns the number of matching bytes in a and b
       │     // up to length 'max'. Both slices must be at least 'max'
       │     // bytes in size.
       │     func matchLen(a, b []byte, max int) int {
       │     a = a[:max]
       │     b = b[:len(a)]
   659 │       mov    0x20(%rsp),%r15
       │     compress/flate.matchLen():
   673 │       nop    
       │     compress/flate.(*compressor).findMatch():
       │     for i := prevHead; tries > 0; tries-- {
   681 │ e0:   test   %r11,%r11
   650 │     ↓ jle    170
   666 │       mov    %r11,0x28(%rsp)
       │     if wEnd == win[i+length] {
   654 │       lea    (%rcx,%rdi,1),%r11
   672 │       cmp    %r9,%r11
   678 │     ↓ jae    2b1
   646 │       movzbl (%r8,%r11,1),%r11d
   664 │       cmp    %r10b,%r11b
   692 │     ↓ jne    13f
...    
   652 │13f:   mov    0x58(%rsp),%r11
       │     if i == minIndex {
   670 │144:   cmp    %rsi,%rcx
   691 │     ↓ je     194
       │     i = int(d.hashPrev[i&windowMask]) - d.hashOffset
   655 │       and    $0x7fff,%ecx
   688 │       mov    0x80068(%rax,%rcx,4),%r15d
   627 │       sub    0xa0068(%rax),%r15
   696 │       xchg   %ax,%ax
       │     if i < minIndex || i < 0 {
   622 │       cmp    %rsi,%r15
   655 │     ↓ jl     18a
   690 │       test   %r15,%r15
   679 │     ↑ jge    c5

After:

Samples|
   722 │4b1bde:   dec    %rdi 
       │       compress/flate.(*compressor).deflate():
       │       if newLength, newOffset, ok := d.findMatch(d.index, d.chainHead-d.hashOffset, minMatchLength-1, lookahead); ok {
   638 │4b1be1:   mov    %r11,0x88(%rsp)
   653 │4b1be9:   mov    %r8,0x70(%rsp)
       │       if b[i] != av {
   618 │4b1bee:   mov    0xd8(%rsp),%r8
       │       compress/flate.(*compressor).findMatch():
       │       if n > length && (n > minMatchLength || pos-i <= 4096) {
   700 │4b1bf6:   mov    0xc8(%rsp),%r11
   680 │4b1bfe:   xchg   %ax,%ax
       │       for i := prevHead; tries > 0; tries-- {
   658 │4b1c00:   test   %rdi,%rdi
   684 │4b1c03: ↓ jle    4b1ccb <compress/flate.(*compressor).deflate+0xa0b>
       │       if wEnd == win[i+length] {
   644 │4b1c09:   mov    0x88(%rsp),%r11
   667 │4b1c11:   lea    0x0(%r13,%r11,1),%r8
   700 │4b1c16:   cmp    %r10,%r8
   681 │4b1c19: ↓ jae    4b1e86 <compress/flate.(*compressor).deflate+0xbc6>
   699 │4b1c1f:   movzbl (%r8,%r12,1),%r8d
   619 │4b1c24:   cmp    %r9b,%r8b
   663 │4b1c27: ↓ jne    4b1c9b <compress/flate.(*compressor).deflate+0x9db>
...
   591 │4b1c9b:   mov    0x70(%rsp),%r8
       │       compress/flate.(*compressor).findMatch():
       │       if i == minIndex {
   640 │4b1ca0:   cmp    %rcx,%r13
   656 │4b1ca3: ↓ je     4b1d0a <compress/flate.(*compressor).deflate+0xa4a>
       │       i = int(d.hashPrev[i&windowMask]) - d.hashOffset
   724 │4b1ca5:   and    $0x7fff,%r13d
   668 │4b1cac:   mov    0x80068(%rax,%r13,4),%r13d
   705 │4b1cb4:   sub    %r15,%r13
       │       if i < minIndex || i < 0 {
   595 │4b1cb7:   cmp    %rcx,%r13
   633 │4b1cba: ↓ jl     4b1cfd <compress/flate.(*compressor).deflate+0xa3d>
   625 │4b1cbc:   nopl   0x0(%rax)
   713 │4b1cc0:   test   %r13,%r13
   688 │4b1cc3: ↑ jge    4b1bde <compress/flate.(*compressor).deflate+0x91e>

matchLen loop:

Before:

Samples|
   358 │19e:   inc    %r11
       │     compress/flate.(*compressor).findMatch():
       │     if i == minIndex {
   344 │       mov    0x38(%rsp),%rsi
       │     compress/flate.matchLen():
       │     b = b[:len(a)]
   358 │       mov    0x20(%rsp),%r15
       │     for i, av := range a {
   394 │1ab:   cmp    %r12,%r11
   366 │     ↓ jge    1d4
   384 │       movzbl (%r11,%r13,1),%r15d
       │     if b[i] != av {
   413 │       movzbl (%r11,%rdx,1),%esi
   400 │       cmp    %r15b,%sil
   381 │     ↑ je     19e

After:

Samples|
   322 │4b1d17:   lea    0x1(%r15),%r12
       │       compress/flate.(*compressor).findMatch():
       │       wEnd = win[pos+n]
   330 │4b1d1b:   mov    0x50(%rsp),%r10
       │       if n > length && (n > minMatchLength || pos-i <= 4096) {
   336 │4b1d20:   mov    0x90(%rsp),%r13
       │       compress/flate.matchLen():
       │       b = b[:len(a)]
   343 │4b1d28:   mov    0x60(%rsp),%r15
       │       for i, av := range a {
   365 │4b1d2d:   mov    %r12,0x48(%rsp)
       │       compress/flate.(*compressor).findMatch():
       │       wEnd = win[pos+n]
   354 │4b1d32:   mov    0xd0(%rsp),%r12
       │       compress/flate.matchLen():
       │       for i, av := range a {
   373 │4b1d3a:   mov    0x48(%rsp),%r15
   413 │4b1d3f:   nop
   354 │4b1d40:   cmp    %rdx,%r15
   368 │4b1d43: ↓ jge    4b1d79 <compress/flate.(*compressor).deflate+0xab9>
   414 │4b1d45:   movzbl (%r15,%r8,1),%r12d
       │       if b[i] != av {
   379 │4b1d4a:   mov    0xd8(%rsp),%r10
   373 │4b1d52:   movzbl (%r15,%r10,1),%r13d
   399 │4b1d57:   cmp    %r12b,%r13b
   411 │4b1d5a: ↑ je     4b1d17 <compress/flate.(*compressor).deflate+0xa57>

@prattmic
Copy link
Member

prattmic commented Feb 6, 2023

At a high level, I'll say that this application isn't a great fit for PGO today because it isn't "diverse" enough. That is, it spends almost all of its time in a very small number of functions. Right now, PGO works best with programs with hundreds of hot functions, allowing it to accumulate small wins from many places.

That said, we've said that we don't want PGO causing regressions, so I still would like to get this back closer to no regression.

@prattmic
Copy link
Member

prattmic commented Feb 6, 2023

I suspect that the second bullet above (the matchLen loop) would be more-or-less resolved if we hoisted the spill load/stores out of the loop (is this in scope of #15808?).

The profile does not should the entry to matchLen as hot, which implies that this loop typically has many iterations which could amortize the cost of loading several stack values beforehand.

@randall77
Copy link
Contributor

Yes, that's basically #15808.

@prattmic
Copy link
Member

I spent more time on this today. While hoisting could perhaps help here, the ultimate problem seems to be completely useless loads.

Here is the final assembly of the core loop:

v588 00539 (+320) INCQ R10
v1091 00540 (256) MOVBLZX compress/flate.wEnd-150(SP), R12
v1096 00541 (267) MOVQ compress/flate.win.len-136(SP), R15
v734 00542 (320) MOVQ R10, compress/flate.~R0-144(SP)
v891 00543 (319) MOVQ compress/flate.wPos.cap-120(SP), R10
// end of b103
v481 00544 (+320) MOVQ compress/flate.~R0-144(SP), R10
v343 00545 (320) CMPQ R10, R9
     00546 (320) JGE 556
// end of b101
v574 00547 (320) MOVBLZX (R10)(R11*1), R15
v214 00548 (+321) MOVBLZX (R10)(R8*1), R12
v337 00549 (321) CMPB R12, R15
     00550 (321) JEQ 539
// end of b107

image

The loads of R12, R15, and R10 at 540, 541, and 543 are useless; they are never used. In this analysis I looked specifically at the R10 load which is obviously useless without any extra context, as it is immediately overwritten by the next instruction.

Why are we loading this?

Here are the basic blocks after regalloc:

b101: ← b99 b103
  v565 (320) = Phi <int> v941 v734 : ~R0[int] (i[int], ~R0[int])
  v481 (320) = LoadReg <int> v565 : R10
  v343 (+320) = CMPQ <flags> v481 v976
LT v343 → b107 b104 (likely) (320)

b107: ← b101
  v574 (320) = MOVBloadidx1 <byte> v481 v531 v847 : R15 (av[byte])
  v214 (321) = MOVBloadidx1 <uint8> v481 v494 v847 : R12
  v337 (+321) = CMPB <flags> v214 v574
NE v337 → b106 b103 (unlikely) (321)

b103: ← b107
  v588 (+320) = ADDQconst <int> [1] v481 : R10
  v1091 (256) = LoadReg <byte> v489 : R12
  v1096 (267) = LoadReg <int> v482 : R15
  v734 (320) = StoreReg <int> v588 : ~R0[int]
  v891 (319) = LoadReg <int> v1187 : R10
Plain → b101 (320)

And just before regalloc:

b101: ← b99 b103
  v565 (320) = Phi <int> v926 v588 (~R0[int], i[int])
  v343 (+320) = CMPQ <flags> v565 v416
LT v343 → b107 b104 (likely) (320)

b107: ← b101
  v574 (320) = MOVBloadidx1 <byte> v565 v531 v847 (av[byte])
  v214 (321) = MOVBloadidx1 <uint8> v565 v494 v847
  v337 (+321) = CMPB <flags> v214 v574
NE v337 → b106 b103 (unlikely) (321)

b103: ← b107
  v588 (+320) = ADDQconst <int> [1] v565
Plain → b101 (320)

(I've omitted b101 predecessor b99, but you can see it in the screenshot).

From this diff it isn't clear why b103 adds a load to R10. We can get more context from -gcflags=compress/flate=-d=ssa/regalloc/debug=100:

edge b103->b101
...
src R10: v588 cache=v588
...
dst R10: v490
...
moving v588 to ~R0[int]
sources of v588: R10:v588 [use R10]
v734 = StoreReg <int> v588 : ~R0[int]
v588 now available in ~R0[int]:v734
moving v490 to R10
sources of v490: wPos.cap[int]:v1187 [use wPos.cap[int]]
v588 no longer available in R10:v588
v891 = LoadReg <int> v1187 : R10
v490 now available in R10:v891

This says that on entry (src) this block has v588 in R10, and on exit (dst), it should have v490 in R10. Thus v588 gets spilled to the stack and v490 is loaded from its stack slot (v490 happens to be the capacity of b).

Why does this block think it needs v490 in R10? The expected dst values come from the src values of the successor block. Taking a look at successor b101, we see:

Begin processing block b101
...
starting merge block b101 with end state of b99:
...
  R10: orig:v490 cache:v1138
...
value v343 = CMPQ <flags> v565 v416
  out:
  in0:
  in1: CX
freeReg R10 (dump v490/v1138)
assignReg R10 v565/v481
...
live-at-end spill for v490 at b101

b101 wants v490 in R10 on entry because its other predecessor b99 ends with v490 in R10 (because b99's predecessor b95 loaded it into R10. You can see this in the image).

However, b101 proceeds to immediately dump v490 and use R10 for its own purpose. Note that b101 doesn't need to write R10 back to the v490 stack slot (v1138). I think this is because at this point every SSA value still has its own "stack slot", so SSA semantics still apply, but I'm not certain.

It seems to me that it should be possible for regalloc to remove R10 from its requested src state if it knows it will dump the register without needing a spill store.

cc @dr2chase

@gopherbot
Copy link

Change https://go.dev/cl/471158 mentions this issue: cmd/compile/internal/ssa: drop overwritten regalloc basic block input requirements

gopherbot pushed a commit that referenced this issue Mar 17, 2023
… requirements

For the following description, consider the following basic block graph:

      b1 ───┐┌──── b2
            ││
            ││
            ▼▼
            b3

For register allocator transitions between basic blocks, there are two
key passes (significant paraphrasing):

First, each basic block is visited in some predetermined visit order.
This is the core visitOrder range loop in regAllocState.regalloc. The
specific ordering heuristics aren't important here, except that the
order guarantees that when visiting a basic block at least one of its
predecessors has already been visited.

Upon visiting a basic block, that block sets its expected starting
register state (regAllocState.startRegs) based on the ending register
state (regAlloc.State.endRegs) of one of its predecessors. (How it
chooses which predecessor to use is not important here.)

From that starting state, registers are assigned for all values in the
block, ultimately resulting in some ending register state.

After all blocks have been visited, the shuffle pass
(regAllocState.shuffle) ensures that for each edge, endRegs of the
predecessor == startRegs of the successor. That is, it makes sure that
the startRegs assumptions actually hold true for each edge. It does this
by adding moves to the end of the predecessor block to place values in
the expected register for the successor block. These may be moves from
other registers, or from memory if the value is spilled.

Now on to the actual problem:

Assume that b1 places some value v1 into register R10, and thus ends
with endRegs containing R10 = v1.

When b3 is visited, it selects b1 as its model predecessor and sets
startRegs with R10 = v1.

b2 does not have v1 in R10, so later in the shuffle pass, we will add a
move of v1 into R10 to the end of b2 to ensure it is available for b3.

This is all perfectly fine and exactly how things should work.

Now suppose that b3 does not use v1. It does need to use some other
value v2, which is not currently in a register. When assigning v2 to a
register, it finds all registers are already in use and it needs to dump
a value. Ultimately, it decides to dump v1 from R10 and replace it with
v2.

This is fine, but it has downstream effects on shuffle in b2. b3's
startRegs still state that R10 = v1, so b2 will add a move to R10 even
though b3 will unconditionally overwrite it. i.e., the move at the end
of b2 is completely useless and can result in code like:

// end of b2
MOV n(SP), R10 // R10 = v1 <-- useless
// start of b3
MOV m(SP), R10 // R10 = v2

This is precisely what happened in #58298.

This CL addresses this problem by dropping registers from startRegs if
they are never used in the basic block prior to getting dumped. This
allows the shuffle pass to avoid placing those useless values into the
register.

There is a significant limitation to this CL, which is that it only
impacts the immediate predecessors of an overwriting block. We can
discuss this by zooming out a bit on the previous graph:

b4 ───┐┌──── b5
      ││
      ││
      ▼▼
      b1 ───┐┌──── b2
            ││
            ││
            ▼▼
            b3

Here we have the same graph, except we can see the two predecessors of
b1.

Now suppose that rather than b1 assigning R10 = v1 as above, the
assignment is done in b4. b1 has startRegs R10 = v1, doesn't use the
value at all, and simply passes it through to endRegs R10 = v1.

Now the shuffle pass will require both b2 and b5 to add a move to
assigned R10 = v1, because that is specified in their successor
startRegs.

With this CL, b3 drops R10 = v1 from startRegs, but there is no
backwards propagation, so b1 still has R10 = v1 in startRegs, and b5
still needs to add a useless move.

Extending this CL with such propagation may significantly increase the
number of useless moves we can remove, though it will add complexity to
maintenance and could potentially impact build performance depending on
how efficiently we could implement the propagation (something I haven't
considered carefully).

As-is, this optimization does not impact much code. In bent .text size
geomean is -0.02%. In the container/heap test binary, 18 of ~2500
functions are impacted by this CL. Bent and sweet do not show a
noticeable performance impact one way or another, however #58298 does
show a case where this can have impact if the useless instructions end
up in the hot path of a tight loop.

For #58298.

Change-Id: I2fcef37c955159d068fa0725f995a1848add8a5f
Reviewed-on: https://go-review.googlesource.com/c/go/+/471158
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: David Chase <drchase@google.com>
@gopherbot
Copy link

Change https://go.dev/cl/484215 mentions this issue: cmd/compile: expose ir.Func to ssa

gopherbot pushed a commit that referenced this issue Apr 20, 2023
ssagen.ssafn already holds the ir.Func, and ssa.Frontend.SetWBPos and
ssa.Frontend.Lsym are simple wrappers around parts of the ir.Func.

Expose the ir.Func through ssa.Frontend, allowing us to remove these
wrapper methods and allowing future access to additional features of the
ir.Func if needed.

While we're here, drop ssa.Frontend.Line, which is unused.

For #58298.

Change-Id: I30c4cbd2743e9ad991d8c6b388484a7d1e95f3ae
Reviewed-on: https://go-review.googlesource.com/c/go/+/484215
Auto-Submit: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
@prattmic prattmic changed the title cmd/compile: PGO makes my program slower cmd/compile: poor register allocation with PG Jun 5, 2023
@prattmic prattmic changed the title cmd/compile: poor register allocation with PG cmd/compile: poor register allocation with PGO Jun 5, 2023
@prattmic
Copy link
Member

prattmic commented Jun 5, 2023

https://go.dev/cl/471158 targeted one of the problems in this workload, but it didn't make a huge difference.

Here is the current status:

goos: linux
goarch: amd64
pkg: github.com/Deleplace/pgo-test
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
           │ 120_nopgo.txt │           121_nopgo.txt            │            120_pgo.txt             │            121_pgo.txt             │
           │    sec/op     │   sec/op     vs base               │   sec/op     vs base               │   sec/op     vs base               │
Process-12     294.8m ± 1%   287.2m ± 1%  -2.57% (p=0.000 n=10)   319.9m ± 1%  +8.51% (p=0.000 n=10)   310.6m ± 1%  +5.38% (p=0.000 n=10)

Comparing just 1.21, we see there is still ~8% regression with PGO:

goos: linux
goarch: amd64
pkg: github.com/Deleplace/pgo-test
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
           │ 121_nopgo.txt │            121_pgo.txt             │
           │    sec/op     │   sec/op     vs base               │
Process-12     287.2m ± 1%   310.6m ± 1%  +8.16% (p=0.000 n=10)

@prattmic prattmic modified the milestones: Go1.21, Go1.22 Jun 5, 2023
@Deleplace
Copy link
Contributor Author

Thank you Michael

@gopherbot gopherbot removed this from the Go1.22 milestone Feb 6, 2024
@gopherbot gopherbot added this to the Go1.23 milestone Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests

6 participants