Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: performance regression in 1.20 #57505

Open
changkun opened this issue Dec 29, 2022 · 10 comments
Open

cmd/compile: performance regression in 1.20 #57505

changkun opened this issue Dec 29, 2022 · 10 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@changkun
Copy link
Member

What version of Go are you using (go version)?

$ go version
go version go1.19.4 linux/amd64
$ gotip version
go version devel go1.20-e870de9 Tue Dec 27 21:10:04 2022 +0000 linux/amd64

Does this issue reproduce with the latest release?

Yes, also in 1.20rc1

What did you do?

$ cat go.mod
module mymodule/math

go 1.20
$ cat math.go
package math

import (
        "math"
        "math/rand"
)

const Epsilon = 1e-7

type Float interface {
        ~float32 | ~float64
}

type Mat4[T Float] struct {
        X00, X01, X02, X03 T
        X10, X11, X12, X13 T
        X20, X21, X22, X23 T
        X30, X31, X32, X33 T
}

func (m Mat4[T]) Eq(n Mat4[T]) bool {
        return ApproxEq(m.X00, n.X00, Epsilon) &&
                ApproxEq(m.X10, n.X10, Epsilon) &&
                ApproxEq(m.X20, n.X20, Epsilon) &&
                ApproxEq(m.X30, n.X30, Epsilon) &&
                ApproxEq(m.X01, n.X01, Epsilon) &&
                ApproxEq(m.X11, n.X11, Epsilon) &&
                ApproxEq(m.X21, n.X21, Epsilon) &&
                ApproxEq(m.X31, n.X31, Epsilon) &&
                ApproxEq(m.X02, n.X02, Epsilon) &&
                ApproxEq(m.X12, n.X12, Epsilon) &&
                ApproxEq(m.X22, n.X22, Epsilon) &&
                ApproxEq(m.X32, n.X32, Epsilon) &&
                ApproxEq(m.X03, n.X03, Epsilon) &&
                ApproxEq(m.X13, n.X13, Epsilon) &&
                ApproxEq(m.X23, n.X23, Epsilon) &&
                ApproxEq(m.X33, n.X33, Epsilon)
}

func Abs[T Float](x T) T {
        return T(math.Abs(float64(x)))
}

func ApproxEq[T Float](v1, v2, epsilon T) bool {
        return Abs(v1-v2) <= epsilon
}

type Vec4[T Float] struct {
        X, Y, Z, W T
}

func NewRandVec4[T Float]() Vec4[T] {
        return Vec4[T]{
                T(rand.Float64()),
                T(rand.Float64()),
                T(rand.Float64()),
                T(rand.Float64()),
        }
}

func (v Vec4[T]) Dot(u Vec4[T]) T {
        return FMA(v.X, u.X, FMA(v.Y, u.Y, FMA(v.Z, u.Z, v.W*u.W)))
}

func FMA[T Float](x, y, z T) T {
        return T(math.FMA(float64(x), float64(y), float64(z)))
}
$ cat bench_test.go 
package math_test

import (
        "testing"

        "mymodule/math"
)

func BenchmarkMat4_Eq(b *testing.B) {
        m1 := math.Mat4[float32]{
                5, 1, 5, 6,
                8, 71, 2, 47,
                5, 1, 582, 4,
                2, 1, 7, 25,
        }
        m2 := math.Mat4[float32]{
                5, 1, 5, 6,
                8, 71, 2, 47,
                5, 1, 582, 4,
                2, 1, 7, 25,
        }

        b.ResetTimer()
        b.ReportAllocs()
        var m bool
        for i := 0; i < b.N; i++ {
                m = m1.Eq(m2)
        }
        _ = m
}

var v float32

func BenchmarkVec_Dot(b *testing.B) {
        b.Run("Vec4", func(b *testing.B) {
                v1 := math.NewRandVec4[float32]()
                v2 := math.NewRandVec4[float32]()

                b.ReportAllocs()
                b.ResetTimer()
                for i := 0; i < b.N; i++ {
                        v = v1.Dot(v2)
                }
        })
}
$ perflock go test -run=none -bench=. -count=10 | tee bench119.txt
goos: linux
goarch: amd64
pkg: mymodule/math
cpu: Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz
BenchmarkMat4_Eq-8      64214283                18.66 ns/op            0 B/op          0 allocs/op
BenchmarkMat4_Eq-8      64270538                18.66 ns/op            0 B/op          0 allocs/op
BenchmarkMat4_Eq-8      64261249                18.66 ns/op            0 B/op          0 allocs/op
...
$ perflock gotip test -run=none -bench=. -count=10 | tee bench120.txt
goos: linux
goarch: amd64
pkg: mymodule/math
cpu: Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz
BenchmarkMat4_Eq-8      35130938                35.00 ns/op            0 B/op          0 allocs/op
BenchmarkMat4_Eq-8      35127861                34.20 ns/op            0 B/op          0 allocs/op
BenchmarkMat4_Eq-8      34658744                34.21 ns/op            0 B/op          0 allocs/op
...

What did you expect to see?

Same performance.

What did you see instead?

$ benchstat bench119.txt bench120.txt
name            old time/op    new time/op    delta
Mat4_Eq-8         18.7ns ± 0%    34.2ns ± 0%   +83.24%  (p=0.000 n=8+9)
Vec_Dot/Vec4-8    4.44ns ± 0%    9.30ns ± 0%  +109.24%  (p=0.000 n=8+8)

name            old alloc/op   new alloc/op   delta
Mat4_Eq-8          0.00B          0.00B           ~     (all equal)
Vec_Dot/Vec4-8     0.00B          0.00B           ~     (all equal)

name            old allocs/op  new allocs/op  delta
Mat4_Eq-8           0.00           0.00           ~     (all equal)
Vec_Dot/Vec4-8      0.00           0.00           ~     (all equal)
@changkun changkun changed the title cmd/go: performance regression in 1.20 cmd/compile: performance regression in 1.20 Dec 29, 2022
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Dec 29, 2022
@changkun
Copy link
Member Author

cc @golang/runtime

@dmitshur dmitshur added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Dec 29, 2022
@dmitshur dmitshur added this to the Go1.20 milestone Dec 29, 2022
@cherrymui
Copy link
Member

I can reproduce the regression on Mat4_Eq, but not on Vec_Dot/Vec4. Setting GOEXPERIMENT=nounified brings the performance back. So it looks like it is due to unified IR.

It looks to me that with Go 1.19 or non-unified, it inlines math.Abs, math.Float64bits, and math.Float64frombits (from the standard library math package, not mymodule/math) into . At tip with unified IR, they are not inlined. Maybe there is some issue about inlining non-generic callee into generic caller?

cc @mdempsky

@cherrymui
Copy link
Member

Yeah, if I remove the type parameters (hard code float32), I get the same performance as Go 1.19.

@changkun
Copy link
Member Author

changkun commented Dec 29, 2022

It is weird that Vec_Dot/Vec4 is not reproducible. Nevertheless, in https://github.com/polyred/polyred/tree/develop/math, there are more regression examples:

name                        old time/op    new time/op    delta
Mat_Mul-8                     5.80ms ± 0%    5.92ms ± 0%    +2.11%  (p=0.000 n=9+8)
Mat4_Eq-8                     19.0ns ± 0%    34.2ns ± 0%   +79.90%  (p=0.000 n=10+9)
Vec_Eq/Vec2-8                 1.82ns ± 1%    2.52ns ± 0%   +38.40%  (p=0.000 n=8+9)
Vec_Eq/Vec3-8                 1.90ns ± 2%    2.74ns ± 0%   +44.39%  (p=0.000 n=9+10)
Vec_Eq/Vec4-8                 2.10ns ± 1%    3.08ns ± 0%   +47.06%  (p=0.000 n=10+10)
Vec_IsZero/Vec2-8             1.82ns ± 1%    2.51ns ± 0%   +37.99%  (p=0.000 n=9+8)
Vec_IsZero/Vec3-8             1.71ns ± 3%    2.51ns ± 0%   +46.40%  (p=0.000 n=9+8)
Vec_IsZero/Vec4-8             1.82ns ± 0%    2.51ns ± 1%   +38.45%  (p=0.000 n=9+9)
Vec_Dot/Vec2-8                2.06ns ± 2%    2.95ns ± 0%   +43.22%  (p=0.000 n=10+9)
Vec_Dot/Vec3-8                3.13ns ± 0%    6.15ns ± 1%   +96.09%  (p=0.000 n=8+9)
Vec_Dot/Vec4-8                4.21ns ± 0%    9.34ns ± 1%  +121.83%  (p=0.000 n=8+10)
Vec_Len/Vec2-8                2.12ns ± 0%    3.20ns ± 0%   +50.73%  (p=0.000 n=9+9)
Vec_Len/Vec3-8                2.98ns ± 1%    3.80ns ± 0%   +27.45%  (p=0.000 n=10+8)
Vec_Len/Vec4-8                4.34ns ± 1%    5.14ns ± 1%   +18.35%  (p=0.000 n=9+10)
Vec_Unit/Vec3-8               8.90ns ± 0%   13.35ns ± 0%   +50.06%  (p=0.000 n=9+8)
Vec_Unit/Vec4-8               15.7ns ± 0%    19.3ns ± 1%   +23.15%  (p=0.000 n=8+10)
Vec_Apply/Vec3-8              8.90ns ± 0%   13.40ns ± 1%   +50.53%  (p=0.000 n=9+10)
Vec_Apply/Vec4-8              15.7ns ± 0%    19.3ns ± 0%   +22.86%  (p=0.000 n=8+8)
Vec_Cross/Vec4-8              4.77ns ± 0%    4.90ns ± 1%    +2.81%  (p=0.000 n=8+10)

@cherrymui
Copy link
Member

This is also multi-level inlining. E.g. standard math.Abs inlined into user-defined, instantiated Abs[go.shape.float32_0], then inlined into ApproxEq[go.shape.float32_0]. #56280 may be related.

@mdempsky mdempsky self-assigned this Dec 30, 2022
@mdempsky
Copy link
Member

I'm on vacation (and currently on a plane), but briefly looking at the compiler's -m and -S output, it looks like everything is inlining the same. I don't see anything obviously wrong. (Caveat: I had to retype stuff from my phone onto my laptop and I simplified things slightly because of that.)

I'll take a look once I'm back in the office on Monday.

@cherrymui
Copy link
Member

cherrymui commented Jan 3, 2023

Hmmm, I got different results with tip vs. 1.19 or non-unified.

With Go 1.19,

$ go1.19 test -c -gcflags=-m 
# mymodule/math_test [mymodule/math.test]
./math.go:40:6: can inline "mymodule/math".Abs[go.shape.float32_0]
./math.go:41:26: inlining call to "math".Abs
./math.go:41:26: inlining call to "math".Float64bits
./math.go:41:26: inlining call to "math".Float64frombits
./math.go:44:6: can inline "mymodule/math".ApproxEq[go.shape.float32_0]
./math.go:45:19: inlining call to "mymodule/math".Abs[go.shape.float32_0]
./math.go:45:19: inlining call to "math".Abs
./math.go:45:19: inlining call to "math".Float64bits
./math.go:45:19: inlining call to "math".Float64frombits
./math.go:22:24: inlining call to "mymodule/math".ApproxEq[go.shape.float32_0]
./math.go:22:24: inlining call to "mymodule/math".Abs[go.shape.float32_0]
./math.go:22:24: inlining call to "math".Abs
./math.go:22:24: inlining call to "math".Float64bits
./math.go:22:24: inlining call to "math".Float64frombits
...

With tip,

$ go test -c -gcflags=-m 
# mymodule/math_test [mymodule/math.test]
./math.go:40:6: can inline math.Abs[go.shape.float32]
./math.go:44:6: can inline math.ApproxEq[go.shape.float32]
./math.go:45:19: inlining call to math.Abs[go.shape.float32]
./math.go:22:24: inlining call to math.ApproxEq[go.shape.float32]
./math.go:23:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:24:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:25:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:26:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:27:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:28:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:29:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:30:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:31:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:32:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:33:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:34:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:35:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:36:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:37:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:22:24: inlining call to math.Abs[go.shape.float32]
./math.go:23:25: inlining call to math.Abs[go.shape.float32]
./math.go:24:25: inlining call to math.Abs[go.shape.float32]
./math.go:25:25: inlining call to math.Abs[go.shape.float32]
./math.go:26:25: inlining call to math.Abs[go.shape.float32]
./math.go:27:25: inlining call to math.Abs[go.shape.float32]
./math.go:28:25: inlining call to math.Abs[go.shape.float32]
./math.go:29:25: inlining call to math.Abs[go.shape.float32]
./math.go:30:25: inlining call to math.Abs[go.shape.float32]
./math.go:31:25: inlining call to math.Abs[go.shape.float32]
./math.go:32:25: inlining call to math.Abs[go.shape.float32]
./math.go:33:25: inlining call to math.Abs[go.shape.float32]
./math.go:34:25: inlining call to math.Abs[go.shape.float32]
./math.go:35:25: inlining call to math.Abs[go.shape.float32]
./math.go:36:25: inlining call to math.Abs[go.shape.float32]
./math.go:37:25: inlining call to math.Abs[go.shape.float32]
./math_test.go:24:23: inlining call to testing.(*B).ReportAllocs
...

but no Float64bits and Float64frombits.

In particular,

$ go1.19 test -c -gcflags=-m 2>&1 | grep Float64frombits
./math.go:41:26: inlining call to "math".Float64frombits
./math.go:45:19: inlining call to "math".Float64frombits
./math.go:22:24: inlining call to "math".Float64frombits
./math.go:23:25: inlining call to "math".Float64frombits
./math.go:24:25: inlining call to "math".Float64frombits
./math.go:25:25: inlining call to "math".Float64frombits
./math.go:26:25: inlining call to "math".Float64frombits
./math.go:27:25: inlining call to "math".Float64frombits
./math.go:28:25: inlining call to "math".Float64frombits
./math.go:29:25: inlining call to "math".Float64frombits
./math.go:30:25: inlining call to "math".Float64frombits
./math.go:31:25: inlining call to "math".Float64frombits
./math.go:32:25: inlining call to "math".Float64frombits
./math.go:33:25: inlining call to "math".Float64frombits
./math.go:34:25: inlining call to "math".Float64frombits
./math.go:35:25: inlining call to "math".Float64frombits
./math.go:36:25: inlining call to "math".Float64frombits
./math.go:37:25: inlining call to "math".Float64frombits
$ go test -c -gcflags=-m 2>&1 | grep Float64frombits
$ # no output

@mdempsky
Copy link
Member

mdempsky commented Jan 3, 2023

@cherrymui Thanks, I'm able to repro the issue now. Not sure what went wrong with my earlier attempt.

@mdempsky
Copy link
Member

mdempsky commented Jan 3, 2023

The issue here is that unified IR has a simpler heuristic for deciding which function bodies to re-export. It simply re-exports functions that were inlined into the current compilation unit. (It also always exports its own inlinable functions.)

The problem manifests here that mymodule/math doesn't actually instantiate its generic types/functions, so math.{Abs,Float64bits,Float64frombits} never get inlined within that package, so they're never re-exported by that package either. Then when compiling mymodule/math_test, the inline bodies aren't available so they don't get inlined.

Two possible workarounds:

  1. Instantiate the generic function/types within mymodule/math. For example, add two statements var _ Mat4[float64]; var _ Vec4[Float64].
  2. Within mymodule/math_test, add an import _ "math" directive. This will make sure math.{Abs,Float64bits,Float64frombits} inline bodies are available from the origin package, regardless of reexporting.

There's supposed to be a compiler diagnostic to warn when this happens. I'm not sure at the moment why it's not firing.

@mknyszek
Copy link
Contributor

mknyszek commented Jun 9, 2023

Hey @mdempsky, doing a sweep of the Go 1.21 milestone. Any updates here? Should this go into Backlog? Thanks.

@mdempsky mdempsky modified the milestones: Go1.21, Go1.22 Jun 27, 2023
@gopherbot gopherbot modified the milestones: Go1.22, Go1.23 Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Development

No branches or pull requests

6 participants