cmd/compile: performance regression in 1.20 #57505

changkun · 2022-12-29T10:11:48Z

What version of Go are you using (`go version`)?

$ go version
go version go1.19.4 linux/amd64
$ gotip version
go version devel go1.20-e870de9 Tue Dec 27 21:10:04 2022 +0000 linux/amd64

Does this issue reproduce with the latest release?

Yes, also in 1.20rc1

What did you do?

$ cat go.mod
module mymodule/math

go 1.20

$ cat math.go
package math

import (
        "math"
        "math/rand"
)

const Epsilon = 1e-7

type Float interface {
        ~float32 | ~float64
}

type Mat4[T Float] struct {
        X00, X01, X02, X03 T
        X10, X11, X12, X13 T
        X20, X21, X22, X23 T
        X30, X31, X32, X33 T
}

func (m Mat4[T]) Eq(n Mat4[T]) bool {
        return ApproxEq(m.X00, n.X00, Epsilon) &&
                ApproxEq(m.X10, n.X10, Epsilon) &&
                ApproxEq(m.X20, n.X20, Epsilon) &&
                ApproxEq(m.X30, n.X30, Epsilon) &&
                ApproxEq(m.X01, n.X01, Epsilon) &&
                ApproxEq(m.X11, n.X11, Epsilon) &&
                ApproxEq(m.X21, n.X21, Epsilon) &&
                ApproxEq(m.X31, n.X31, Epsilon) &&
                ApproxEq(m.X02, n.X02, Epsilon) &&
                ApproxEq(m.X12, n.X12, Epsilon) &&
                ApproxEq(m.X22, n.X22, Epsilon) &&
                ApproxEq(m.X32, n.X32, Epsilon) &&
                ApproxEq(m.X03, n.X03, Epsilon) &&
                ApproxEq(m.X13, n.X13, Epsilon) &&
                ApproxEq(m.X23, n.X23, Epsilon) &&
                ApproxEq(m.X33, n.X33, Epsilon)
}

func Abs[T Float](x T) T {
        return T(math.Abs(float64(x)))
}

func ApproxEq[T Float](v1, v2, epsilon T) bool {
        return Abs(v1-v2) <= epsilon
}

type Vec4[T Float] struct {
        X, Y, Z, W T
}

func NewRandVec4[T Float]() Vec4[T] {
        return Vec4[T]{
                T(rand.Float64()),
                T(rand.Float64()),
                T(rand.Float64()),
                T(rand.Float64()),
        }
}

func (v Vec4[T]) Dot(u Vec4[T]) T {
        return FMA(v.X, u.X, FMA(v.Y, u.Y, FMA(v.Z, u.Z, v.W*u.W)))
}

func FMA[T Float](x, y, z T) T {
        return T(math.FMA(float64(x), float64(y), float64(z)))
}

$ cat bench_test.go 
package math_test

import (
        "testing"

        "mymodule/math"
)

func BenchmarkMat4_Eq(b *testing.B) {
        m1 := math.Mat4[float32]{
                5, 1, 5, 6,
                8, 71, 2, 47,
                5, 1, 582, 4,
                2, 1, 7, 25,
        }
        m2 := math.Mat4[float32]{
                5, 1, 5, 6,
                8, 71, 2, 47,
                5, 1, 582, 4,
                2, 1, 7, 25,
        }

        b.ResetTimer()
        b.ReportAllocs()
        var m bool
        for i := 0; i < b.N; i++ {
                m = m1.Eq(m2)
        }
        _ = m
}

var v float32

func BenchmarkVec_Dot(b *testing.B) {
        b.Run("Vec4", func(b *testing.B) {
                v1 := math.NewRandVec4[float32]()
                v2 := math.NewRandVec4[float32]()

                b.ReportAllocs()
                b.ResetTimer()
                for i := 0; i < b.N; i++ {
                        v = v1.Dot(v2)
                }
        })
}

$ perflock go test -run=none -bench=. -count=10 | tee bench119.txt
goos: linux
goarch: amd64
pkg: mymodule/math
cpu: Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz
BenchmarkMat4_Eq-8      64214283                18.66 ns/op            0 B/op          0 allocs/op
BenchmarkMat4_Eq-8      64270538                18.66 ns/op            0 B/op          0 allocs/op
BenchmarkMat4_Eq-8      64261249                18.66 ns/op            0 B/op          0 allocs/op
...

$ perflock gotip test -run=none -bench=. -count=10 | tee bench120.txt
goos: linux
goarch: amd64
pkg: mymodule/math
cpu: Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz
BenchmarkMat4_Eq-8      35130938                35.00 ns/op            0 B/op          0 allocs/op
BenchmarkMat4_Eq-8      35127861                34.20 ns/op            0 B/op          0 allocs/op
BenchmarkMat4_Eq-8      34658744                34.21 ns/op            0 B/op          0 allocs/op
...

What did you expect to see?

Same performance.

What did you see instead?

$ benchstat bench119.txt bench120.txt
name            old time/op    new time/op    delta
Mat4_Eq-8         18.7ns ± 0%    34.2ns ± 0%   +83.24%  (p=0.000 n=8+9)
Vec_Dot/Vec4-8    4.44ns ± 0%    9.30ns ± 0%  +109.24%  (p=0.000 n=8+8)

name            old alloc/op   new alloc/op   delta
Mat4_Eq-8          0.00B          0.00B           ~     (all equal)
Vec_Dot/Vec4-8     0.00B          0.00B           ~     (all equal)

name            old allocs/op  new allocs/op  delta
Mat4_Eq-8           0.00           0.00           ~     (all equal)
Vec_Dot/Vec4-8      0.00           0.00           ~     (all equal)

The text was updated successfully, but these errors were encountered:

changkun · 2022-12-29T10:13:03Z

cc @golang/runtime

cherrymui · 2022-12-29T22:18:47Z

I can reproduce the regression on Mat4_Eq, but not on Vec_Dot/Vec4. Setting GOEXPERIMENT=nounified brings the performance back. So it looks like it is due to unified IR.

It looks to me that with Go 1.19 or non-unified, it inlines math.Abs, math.Float64bits, and math.Float64frombits (from the standard library math package, not mymodule/math) into . At tip with unified IR, they are not inlined. Maybe there is some issue about inlining non-generic callee into generic caller?

cc @mdempsky

cherrymui · 2022-12-29T22:22:57Z

Yeah, if I remove the type parameters (hard code float32), I get the same performance as Go 1.19.

changkun · 2022-12-29T22:27:56Z

It is weird that Vec_Dot/Vec4 is not reproducible. Nevertheless, in https://github.com/polyred/polyred/tree/develop/math, there are more regression examples:

name                        old time/op    new time/op    delta
Mat_Mul-8                     5.80ms ± 0%    5.92ms ± 0%    +2.11%  (p=0.000 n=9+8)
Mat4_Eq-8                     19.0ns ± 0%    34.2ns ± 0%   +79.90%  (p=0.000 n=10+9)
Vec_Eq/Vec2-8                 1.82ns ± 1%    2.52ns ± 0%   +38.40%  (p=0.000 n=8+9)
Vec_Eq/Vec3-8                 1.90ns ± 2%    2.74ns ± 0%   +44.39%  (p=0.000 n=9+10)
Vec_Eq/Vec4-8                 2.10ns ± 1%    3.08ns ± 0%   +47.06%  (p=0.000 n=10+10)
Vec_IsZero/Vec2-8             1.82ns ± 1%    2.51ns ± 0%   +37.99%  (p=0.000 n=9+8)
Vec_IsZero/Vec3-8             1.71ns ± 3%    2.51ns ± 0%   +46.40%  (p=0.000 n=9+8)
Vec_IsZero/Vec4-8             1.82ns ± 0%    2.51ns ± 1%   +38.45%  (p=0.000 n=9+9)
Vec_Dot/Vec2-8                2.06ns ± 2%    2.95ns ± 0%   +43.22%  (p=0.000 n=10+9)
Vec_Dot/Vec3-8                3.13ns ± 0%    6.15ns ± 1%   +96.09%  (p=0.000 n=8+9)
Vec_Dot/Vec4-8                4.21ns ± 0%    9.34ns ± 1%  +121.83%  (p=0.000 n=8+10)
Vec_Len/Vec2-8                2.12ns ± 0%    3.20ns ± 0%   +50.73%  (p=0.000 n=9+9)
Vec_Len/Vec3-8                2.98ns ± 1%    3.80ns ± 0%   +27.45%  (p=0.000 n=10+8)
Vec_Len/Vec4-8                4.34ns ± 1%    5.14ns ± 1%   +18.35%  (p=0.000 n=9+10)
Vec_Unit/Vec3-8               8.90ns ± 0%   13.35ns ± 0%   +50.06%  (p=0.000 n=9+8)
Vec_Unit/Vec4-8               15.7ns ± 0%    19.3ns ± 1%   +23.15%  (p=0.000 n=8+10)
Vec_Apply/Vec3-8              8.90ns ± 0%   13.40ns ± 1%   +50.53%  (p=0.000 n=9+10)
Vec_Apply/Vec4-8              15.7ns ± 0%    19.3ns ± 0%   +22.86%  (p=0.000 n=8+8)
Vec_Cross/Vec4-8              4.77ns ± 0%    4.90ns ± 1%    +2.81%  (p=0.000 n=8+10)

cherrymui · 2022-12-29T22:28:17Z

This is also multi-level inlining. E.g. standard math.Abs inlined into user-defined, instantiated Abs[go.shape.float32_0], then inlined into ApproxEq[go.shape.float32_0]. #56280 may be related.

mdempsky · 2022-12-30T04:37:24Z

I'm on vacation (and currently on a plane), but briefly looking at the compiler's -m and -S output, it looks like everything is inlining the same. I don't see anything obviously wrong. (Caveat: I had to retype stuff from my phone onto my laptop and I simplified things slightly because of that.)

I'll take a look once I'm back in the office on Monday.

cherrymui · 2023-01-03T18:53:27Z

Hmmm, I got different results with tip vs. 1.19 or non-unified.

With Go 1.19,

$ go1.19 test -c -gcflags=-m 
# mymodule/math_test [mymodule/math.test]
./math.go:40:6: can inline "mymodule/math".Abs[go.shape.float32_0]
./math.go:41:26: inlining call to "math".Abs
./math.go:41:26: inlining call to "math".Float64bits
./math.go:41:26: inlining call to "math".Float64frombits
./math.go:44:6: can inline "mymodule/math".ApproxEq[go.shape.float32_0]
./math.go:45:19: inlining call to "mymodule/math".Abs[go.shape.float32_0]
./math.go:45:19: inlining call to "math".Abs
./math.go:45:19: inlining call to "math".Float64bits
./math.go:45:19: inlining call to "math".Float64frombits
./math.go:22:24: inlining call to "mymodule/math".ApproxEq[go.shape.float32_0]
./math.go:22:24: inlining call to "mymodule/math".Abs[go.shape.float32_0]
./math.go:22:24: inlining call to "math".Abs
./math.go:22:24: inlining call to "math".Float64bits
./math.go:22:24: inlining call to "math".Float64frombits
...

With tip,

$ go test -c -gcflags=-m 
# mymodule/math_test [mymodule/math.test]
./math.go:40:6: can inline math.Abs[go.shape.float32]
./math.go:44:6: can inline math.ApproxEq[go.shape.float32]
./math.go:45:19: inlining call to math.Abs[go.shape.float32]
./math.go:22:24: inlining call to math.ApproxEq[go.shape.float32]
./math.go:23:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:24:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:25:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:26:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:27:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:28:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:29:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:30:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:31:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:32:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:33:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:34:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:35:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:36:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:37:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:22:24: inlining call to math.Abs[go.shape.float32]
./math.go:23:25: inlining call to math.Abs[go.shape.float32]
./math.go:24:25: inlining call to math.Abs[go.shape.float32]
./math.go:25:25: inlining call to math.Abs[go.shape.float32]
./math.go:26:25: inlining call to math.Abs[go.shape.float32]
./math.go:27:25: inlining call to math.Abs[go.shape.float32]
./math.go:28:25: inlining call to math.Abs[go.shape.float32]
./math.go:29:25: inlining call to math.Abs[go.shape.float32]
./math.go:30:25: inlining call to math.Abs[go.shape.float32]
./math.go:31:25: inlining call to math.Abs[go.shape.float32]
./math.go:32:25: inlining call to math.Abs[go.shape.float32]
./math.go:33:25: inlining call to math.Abs[go.shape.float32]
./math.go:34:25: inlining call to math.Abs[go.shape.float32]
./math.go:35:25: inlining call to math.Abs[go.shape.float32]
./math.go:36:25: inlining call to math.Abs[go.shape.float32]
./math.go:37:25: inlining call to math.Abs[go.shape.float32]
./math_test.go:24:23: inlining call to testing.(*B).ReportAllocs
...

but no Float64bits and Float64frombits.

In particular,

$ go1.19 test -c -gcflags=-m 2>&1 | grep Float64frombits
./math.go:41:26: inlining call to "math".Float64frombits
./math.go:45:19: inlining call to "math".Float64frombits
./math.go:22:24: inlining call to "math".Float64frombits
./math.go:23:25: inlining call to "math".Float64frombits
./math.go:24:25: inlining call to "math".Float64frombits
./math.go:25:25: inlining call to "math".Float64frombits
./math.go:26:25: inlining call to "math".Float64frombits
./math.go:27:25: inlining call to "math".Float64frombits
./math.go:28:25: inlining call to "math".Float64frombits
./math.go:29:25: inlining call to "math".Float64frombits
./math.go:30:25: inlining call to "math".Float64frombits
./math.go:31:25: inlining call to "math".Float64frombits
./math.go:32:25: inlining call to "math".Float64frombits
./math.go:33:25: inlining call to "math".Float64frombits
./math.go:34:25: inlining call to "math".Float64frombits
./math.go:35:25: inlining call to "math".Float64frombits
./math.go:36:25: inlining call to "math".Float64frombits
./math.go:37:25: inlining call to "math".Float64frombits
$ go test -c -gcflags=-m 2>&1 | grep Float64frombits
$ # no output

mdempsky · 2023-01-03T19:12:19Z

@cherrymui Thanks, I'm able to repro the issue now. Not sure what went wrong with my earlier attempt.

mdempsky · 2023-01-03T19:48:23Z

The issue here is that unified IR has a simpler heuristic for deciding which function bodies to re-export. It simply re-exports functions that were inlined into the current compilation unit. (It also always exports its own inlinable functions.)

The problem manifests here that mymodule/math doesn't actually instantiate its generic types/functions, so math.{Abs,Float64bits,Float64frombits} never get inlined within that package, so they're never re-exported by that package either. Then when compiling mymodule/math_test, the inline bodies aren't available so they don't get inlined.

Two possible workarounds:

Instantiate the generic function/types within mymodule/math. For example, add two statements var _ Mat4[float64]; var _ Vec4[Float64].
Within mymodule/math_test, add an import _ "math" directive. This will make sure math.{Abs,Float64bits,Float64frombits} inline bodies are available from the origin package, regardless of reexporting.

There's supposed to be a compiler diagnostic to warn when this happens. I'm not sure at the moment why it's not firing.

mknyszek · 2023-06-09T18:40:17Z

Hey @mdempsky, doing a sweep of the Go 1.21 milestone. Any updates here? Should this go into Backlog? Thanks.

changkun changed the title ~~cmd/go: performance regression in 1.20~~ cmd/compile: performance regression in 1.20 Dec 29, 2022

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Dec 29, 2022

dmitshur added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Dec 29, 2022

dmitshur added this to the Go1.20 milestone Dec 29, 2022

mdempsky self-assigned this Dec 30, 2022

mdempsky modified the milestones: Go1.20, Go1.21 Jan 4, 2023

cherrymui mentioned this issue Mar 16, 2023

cmd/compile: does not inline method of generic type across packages when there are multiple instantiations #59070

Open

mdempsky modified the milestones: Go1.21, Go1.22 Jun 27, 2023

gopherbot modified the milestones: Go1.22, Go1.23 Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: performance regression in 1.20 #57505

cmd/compile: performance regression in 1.20 #57505

changkun commented Dec 29, 2022

changkun commented Dec 29, 2022

cherrymui commented Dec 29, 2022

cherrymui commented Dec 29, 2022

changkun commented Dec 29, 2022 •

edited

cherrymui commented Dec 29, 2022

mdempsky commented Dec 30, 2022

cherrymui commented Jan 3, 2023 •

edited

mdempsky commented Jan 3, 2023

mdempsky commented Jan 3, 2023 •

edited

mknyszek commented Jun 9, 2023

cmd/compile: performance regression in 1.20 #57505

cmd/compile: performance regression in 1.20 #57505

Comments

changkun commented Dec 29, 2022

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What did you do?

What did you expect to see?

What did you see instead?

changkun commented Dec 29, 2022

cherrymui commented Dec 29, 2022

cherrymui commented Dec 29, 2022

changkun commented Dec 29, 2022 • edited

cherrymui commented Dec 29, 2022

mdempsky commented Dec 30, 2022

cherrymui commented Jan 3, 2023 • edited

mdempsky commented Jan 3, 2023

mdempsky commented Jan 3, 2023 • edited

mknyszek commented Jun 9, 2023

What version of Go are you using (`go version`)?

changkun commented Dec 29, 2022 •

edited

cherrymui commented Jan 3, 2023 •

edited

mdempsky commented Jan 3, 2023 •

edited