Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: lower performance in tip and AMD64=v3 #59225

Closed
rip-create-your-account opened this issue Mar 24, 2023 · 11 comments
Closed

cmd/compile: lower performance in tip and AMD64=v3 #59225

rip-create-your-account opened this issue Mar 24, 2023 · 11 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@rip-create-your-account

What version of Go are you using (go version)?

$ gotip version
go version devel go1.21-a6c382eaa8 Fri Mar 24 14:23:19 2023 +0000 linux/amd64

$ go1.20.2 version
go version go1.20.2 linux/amd64

Does this issue reproduce with the latest release?

Yes

What did you do?

package main

import (
        "encoding/binary"
        "math/bits"
        "math/rand"
        "testing"
)

func BenchmarkChecksum(b *testing.B) {
        bb := make([]byte, 1024)
        r := rand.New(rand.NewSource(1))

        initial := uint16(r.Uint32())
        r.Read(bb)

        b.ResetTimer()

        var out uint16
        for i := 0; i < b.N; i++ {
                out = Checksum(bb, initial)
        }
        _ = out

        b.SetBytes(int64(len(bb)))
}

func Checksum(buf []byte, _ uint16) uint16 {
        var off int
        var sum, carry uint64
        for ; off <= len(buf)-64; off += 64 {
                chunk := (*[64]byte)(buf[off : off+64 : off+64])

                {
                        n1 := binary.BigEndian.Uint64(chunk[8*0:])
                        sum, carry = bits.Add64(sum, n1, carry)
                        n2 := binary.BigEndian.Uint64(chunk[8*1:])
                        sum, carry = bits.Add64(sum, n2, carry)
                        n3 := binary.BigEndian.Uint64(chunk[8*2:])
                        sum, carry = bits.Add64(sum, n3, carry)
                        n4 := binary.BigEndian.Uint64(chunk[8*3:])
                        sum, carry = bits.Add64(sum, n4, carry)
                }

                {
                        n1 := binary.BigEndian.Uint64(chunk[8*4:])
                        sum, carry = bits.Add64(sum, n1, carry)
                        n2 := binary.BigEndian.Uint64(chunk[8*5:])
                        sum, carry = bits.Add64(sum, n2, carry)
                        n3 := binary.BigEndian.Uint64(chunk[8*6:])
                        sum, carry = bits.Add64(sum, n3, carry)
                        n4 := binary.BigEndian.Uint64(chunk[8*7:])
                        sum, carry = bits.Add64(sum, n4, carry)
                }
        }
        return uint16(sum)
}

What did you expect to see?

gotip and GOAMD64=v3 go1.20.2 to maintain the performance of go1.20.2

What did you see instead?

go1.20.2 test --bench Checksum checksum_test.go
goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
BenchmarkChecksum-8     15659816                72.35 ns/op     14154.40 MB/s
gotip test --bench Checksum checksum_test.go
goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
BenchmarkChecksum-8     12009291                94.35 ns/op     10852.66 MB/s

For go1.20.2 GOAMD64=v3 I get similar performance as for gotip

GOAMD64=v3 go1.20.2 test --bench Checksum checksum_test.go
goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
BenchmarkChecksum-8     12212551                97.91 ns/op     10458.96 MB/s
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Mar 24, 2023
@heschi
Copy link
Contributor

heschi commented Mar 24, 2023

cc @golang/compiler

@heschi heschi added this to the Go1.21 milestone Mar 24, 2023
@heschi heschi added the NeedsFix The path to resolution is known, but the work has not been done. label Mar 24, 2023
@randall77
Copy link
Contributor

I am not seeing the same thing you are seeing. I ran with 1.20.2 and tip, GOAMD64=v1 and v3. They all seem almost identical.

% benchstat 1202v1 1202v3 tipv1 tipv3
name \ time/op  1202v1         1202v3         tipv1          tipv3
Checksum-16       55.5ns ± 0%    55.9ns ± 1%    56.5ns ± 0%    56.5ns ± 0%

My processor is

goos: darwin
goarch: amd64
cpu: Intel(R) Xeon(R) W-3223 CPU @ 3.50GHz

@randall77
Copy link
Contributor

Same on this machine, no significant performance difference:

goos: linux
goarch: amd64
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz

@randall77 randall77 added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Mar 24, 2023
@rip-create-your-account
Copy link
Author

I modified the benchmark loop to call Checksum 50*1000*1000 times so that b.N=1 and then run with go test ... --exec "perf stat". Here are the results for go1.20.2 GOAMD64=v1 and v3:

GOAMD64=v1 go1.20.2 test --bench Checksum checksum_test.go --exec "perf stat"
BenchmarkChecksum              1        3602958906 ns/op        14210.54 MB/s
PASS

           3604.05 msec task-clock                       #    1.000 CPUs utilized          
               935      context-switches                 #  259.430 /sec                   
                 1      cpu-migrations                   #    0.277 /sec                   
               346      page-faults                      #   96.003 /sec                   
       14349593850      cycles                           #    3.982 GHz                    
       39768319694      instructions                     #    2.77  insn per cycle         
        2653987967      branches                         #  736.389 M/sec                  
             67098      branch-misses                    #    0.00% of all branches
GOAMD64=v3 go1.20.2 test --bench Checksum checksum_test.go --exec "perf stat"
BenchmarkChecksum              1        4906196573 ns/op        10435.78 MB/s
PASS

           4907.21 msec task-clock                       #    1.000 CPUs utilized          
              1216      context-switches                 #  247.799 /sec                   
                 1      cpu-migrations                   #    0.204 /sec                   
               349      page-faults                      #   71.120 /sec                   
       19548126041      cycles                           #    3.984 GHz                    
       33372351551      instructions                     #    1.71  insn per cycle         
        2654871068      branches                         #  541.015 M/sec                  
             77628      branch-misses                    #    0.00% of all branches

So the generated code has something going on that impacts instructions per cycle or perf stat is just unable to report the relevant perf counters to me. If there's nothing silly going in the assembly then I'm happy to accept that my CPU has a less wrinkly brain than most.

@randall77
Copy link
Contributor

At v1, the code uses MOVQ + BSWAPQ. At v3, it uses MOVBE instead (combined load+byteswap).
It sounds like your chip really doesn't like MOVBE for some reason.

@rip-create-your-account
Copy link
Author

I modified the Checksum function to process in chunks of 48 bytes instead of 64. This did the trick and now both GOAMD64=v1 and v3 get similar performance.

GOAMD64=v1 BenchmarkChecksum 1 3627673852 ns/op 14113.73 MB/s
GOAMD64=v3 BenchmarkChecksum 1 3639348795 ns/op 14068.45 MB/s

Looking at the generated code the only real thing that GOAMD64=v3 ever did was to convert MOVQ+BSWAP pair to a MOVBE. With 56 byte chunks I got the same performance as with 64 so I guess 48 bytes is a sweet spot for my CPU and that's that.

@randall77
Copy link
Contributor

I did notice that 64-byte chunks needed one more register than was available, so it had to spill a bit. 48-byte chunks would probably avoid that. That may be related to what you are seeing (it's unhappy with MOVBE followed immediately by a store?).

@rip-create-your-account
Copy link
Author

For both 64-byte 56-byte chunks it also seems to be spilling the slice length. But I might be misreading the assembly. Also go tool objdump tells me that there's a sequence of 6 NOPL instructions in the hot loop after all the ADCQs. Why is there so many of them in a sequence? For 64-byte chunks version it's a sequence of 7 NOPLs.

@randall77
Copy link
Contributor

Those NOPs mark the inlining callsites of the binary.Uint64 calls. They are used for tracebacks.
Removing them is #29571

@rip-create-your-account
Copy link
Author

Alright, I now finally realized to check how the assembly diffs between GOAMD64=v1 go1.20.2 and GOAMD64=v1 gotip for the Checksum function. The difference boils down to those NOPLs just being in a different location yet the performance goes from ~14000MB/s -> ~10000MB/s.
diff go1.20.2vsgotip.txt
Understanding why this is the case is above my pay-grade. Feel free to close the issue if there's nothing to really do. I already figured out the fix for me.

@randall77
Copy link
Contributor

The nop thing is tracked on a different issue, so we can close this.

@golang golang locked and limited conversation to collaborators Mar 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

4 participants