cmd/compile: generic functions are significantly slower than identical non-generic functions in some cases #50182

argusdusty · 2021-12-15T06:20:13Z

What version of Go are you using (`go version`)?

$ go version
go version go1.18beta1 windows/amd64

Does this issue reproduce with the latest release?

It reproduces on the 1.18 Beta 1 release.

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\argusdusty\AppData\Local\go-build
set GOENV=C:\Users\argusdusty\AppData\Roaming\go\env
set GOEXE=.exe
set GOEXPERIMENT=
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\Users\argusdusty\Code\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=C:\Users\argusdusty\Code
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=C:\Program Files\Go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=C:\Program Files\Go\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.18beta1
set GCCGO=gccgo
set GOAMD64=v1
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=C:\Users\argusdusty\Code\src\foo\go.mod
set GOWORK=
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\Users\ARGUSD~1\AppData\Local\Temp\go-build3090171098=/tmp/go-build -gno-record-gcc-switches

What did you do?

sort_test.go:

package sort_test

import (
	"sort"
	"testing"
)

func IsSortedGeneric[T sort.Interface](data T) bool {
	n := data.Len()
	for i := n - 1; i > 0; i-- {
		if data.Less(i, i-1) {
			return false
		}
	}
	return true
}

var data = sort.IntSlice{-10, -5, 0, 1, 2, 3, 5, 7, 11, 100, 100, 100, 1000, 10000}

func BenchmarkIsSorted(b *testing.B) {
	for i := 0; i < b.N; i++ {
		sort.IsSorted(data)
	}
}

func BenchmarkIsSortedGeneric(b *testing.B) {
	for i := 0; i < b.N; i++ {
		IsSortedGeneric[sort.IntSlice](data)
	}
}

go test -bench=. -cpu=1 -benchmem sort_test.go

What did you expect to see?

IsSortedGeneric to be faster than, or at least very similar in runtime to, the otherwise identical sort.IsSorted.

What did you see instead?

goos: windows
goarch: amd64
cpu: AMD Ryzen 5 3600X 6-Core Processor
BenchmarkIsSorted               21008770                55.26 ns/op           24 B/op          1 allocs/op
BenchmarkIsSortedGeneric         2859728               421.3 ns/op           336 B/op         14 allocs/op
PASS
ok      command-line-arguments  3.003s

This appears to be a problem specifically when methods are part of the type constraint - and you may note above that the number of allocs (14) is exactly equal to the number of method calls (1 .Len, 13 .Less calls). Compiler output points to runtime.convTslice triggered at each method call as the cause of the allocs.

A generic IsSorted implementation that restricts to a slice of constraints.Ordered does not have this slowdown. Experimentation also finds that supplying a pointer type instead of a value does not have this issue:

type PtrIntSlice []int

func (x *PtrIntSlice) Len() int           { return len(*x) }
func (x *PtrIntSlice) Less(i, j int) bool { return (*x)[i] < (*x)[j] }
func (x *PtrIntSlice) Swap(i, j int)      { (*x)[i], (*x)[j] = (*x)[j], (*x)[i] }

var ptrdata = PtrIntSlice{-10, -5, 0, 1, 2, 3, 5, 7, 11, 100, 100, 100, 1000, 10000}

func BenchmarkIsSortedPtrGeneric(b *testing.B) {
	for i := 0; i < b.N; i++ {
		IsSortedGeneric[*PtrIntSlice](&ptrdata)
	}
}

BenchmarkIsSortedPtrGeneric 44432265 27.07 ns/op 0 B/op 0 allocs/op

The text was updated successfully, but these errors were encountered:

ALTree · 2021-12-15T09:56:58Z

Not sure if this is expected at this stage or not, but it could be a consequence of the current implementation.

cc @randall77 @danscales

komuw · 2021-12-15T12:41:50Z

If you use -gcflags=all='-G=3 -d=unified=1', the generic version becomes much faster than the non-generic one.

gotip version
go version devel go1.18-0f05ed3b78 Tue Dec 14 20:25:03 2021 +0000 linux/amd64

gotip test -bench=. -cpu=1 -benchmem ./...

goos: linux
goarch: amd64
pkg: cool
cpu: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
BenchmarkIsSorted        	23439408	        51.01 ns/op	      24 B/op	       1 allocs/op
BenchmarkIsSortedGeneric 	 3412116	       341.3 ns/op	     336 B/op	      14 allocs/op

gotip test -gcflags=all='-G=3 -d=unified=1' -bench=. -cpu=1 -benchmem ./...

goos: linux
goarch: amd64
pkg: cool
cpu: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
BenchmarkIsSorted        	24290271	        49.23 ns/op	      24 B/op	       1 allocs/op
BenchmarkIsSortedGeneric 	232407577	         5.210 ns/op	       0 B/op	       0 allocs/op

I still do not know what the unified flag does. I saw it once on @mdempsky's twitch livestream

randall77 · 2021-12-15T18:02:12Z

This is because we convert data (which is a slice) into an interface so we can call a method on it. That conversion requires an allocation (because the callee is not statically known and could save the receiver away somewhere).
Maybe we could lift that conversion out of the loop? i.e. if we need to convert a variable of type parameter to an interface, do so at the place where that variable is last assigned instead of where the method call actually happens. This would require some analysis but easy cases (like this one where data is never reassigned) could be doable.

randall77 · 2021-12-15T18:04:13Z

unified has no allocations because it is fully stenciled and the receiver is known at the callsite.

aclements · 2021-12-15T18:32:32Z

@randall77 , why does that lead to 14 allocations per iteration, though? I would expect that to result in 1 additional allocation per iteration. Instead, we seem to be allocating for each element of the slice.

randall77 · 2021-12-15T20:37:32Z

It's one allocation per iteration of the loop in IsSortedGeneric. That's len(data) allocations per benchmark iteration.

randall77 · 2021-12-15T20:38:37Z

(len(data)-1 allocations for the Less calls, plus one for the Len call.)

DmitriyMV · 2021-12-16T14:27:41Z

@komuw
It was interesting to me, so I took a dive, and what I found was: as described here unified IR doesn't currently do dictionary (shape like) approach for generics, and instead does stenciling (aka separate code for each type no matter memory representation). So resulting binary is indeed faster, but ALOT bigger. That also explains why there was no allocations with this flag.

Further reading:
[dev.typeparams] all: add GOEXPERIMENT=unified knob
#46786
[go/dev.typeparams] [dev.typeparams] cmd/compile: unified IR construction

ConsoleTVs · 2021-12-17T16:36:57Z

Is there a reason to add this to Go1.19 rather than 1.18 itself considering this is a beta ?

randall77 · 2021-12-17T16:48:08Z

I suspect any fix for this is going to be large + difficult. We don't want to put changes like that in at this point, especially if it is just for performance.
If it turns out the fix is small + easy, we could put it in 1.18.

g-talbot · 2021-12-19T16:55:14Z

Why isn't data.Less() inlined into the loop and then, particularly for integers, it becomes a single comparison instruction? You know the precise type at the call site, amirite?

randall77 · 2021-12-20T15:08:22Z

Why isn't data.Less() inlined into the loop and then, particularly for integers, it becomes a single comparison instruction? You know the precise type at the call site, amirite?

No, we don't. We know the type at the callsite implements sort.Interface, and we know its memory layout looks like a slice, but we don't know its exact type. In particular, we don't know what the callee is. We can't inline in that situation.

danscales · 2021-12-23T19:33:20Z

This is because we convert data (which is a slice) into an interface so we can call a method on it. That conversion requires an allocation (because the callee is not statically known and could save the receiver away somewhere). Maybe we could lift that conversion out of the loop? i.e. if we need to convert a variable of type parameter to an interface, do so at the place where that variable is last assigned instead of where the method call actually happens. This would require some analysis but easy cases (like this one where data is never reassigned) could be doable.

@randall77 It would be useful to represent explicitly the knowledge that all dictionaries are read-only (static). Maybe we should represent access to a dictionary entry as a single operation that makes it to the SSA passes, rather than breaking it down immediately into the indexing and loading operations, as we currently do right during instantiation (stenciling). That way, we could ensure that the dictionary access operation takes no memory input, and therefore can always be CSE'ed when it has identical inputs.

In this case (and probably most cases), it seems like we also need to make sure the CONVIDATA calls can be CSE'ed, but they are transformed during the order() part of Walk() into run-time calls. We might want the CONVIDATA call to make it SSA (so again, it can easily be CSEed for the same input value), but that seems harder, since all the logic for understanding/translating CONVIDATA is in walk.go

g-talbot · 2021-12-24T12:05:11Z

IsSortedGeneric[sort.IntSlice](data)

IsSortedGeneric[sort.IntSlice](data)

@randall77 But you do know the exact type. It's passed right here in the brackets. The compiler should be able to deduce that it can short-circuit the type here and that only sort.IntSlice's Less() implementation is being called. This is a real issue. What's the point of generics if the compiler can't short-circuit the interface machinery when it knows the precise types? You lose the performance benefit via inlining if you can't do that.

randall77 · 2021-12-24T13:01:07Z

@randall77 But you do know the exact type. It's passed right here in the brackets.

We don't know it when compiling IsSortedGeneric. We don't compile IsSortedGeneric[T] for every T that instantiates it. We compile it only for each distinct shape of the instantiating T. See https://github.com/golang/proposal/blob/master/design/generics-implementation-gcshape.md , and in particular the second bullet in the risks section.

g-talbot · 2021-12-25T04:44:07Z

@randall77 So, for example, sorting (think sort.Stable for example, but with a generic implementation available instead of the interface) basically becomes functionally identical to the existing sort.Stable that takes an interface upon compilation?

I've run into performance problems with sort.Sort and sort.Stable--copying the implementation to a separate file and s/Interface/MyType/g can be many times faster--and I was hoping that generics would resolve this?

ianlancetaylor · 2021-12-26T23:51:55Z

@g-talbot The goal of generics is not to provide faster execution time. The goal is to make it easier to write certain kinds of code with less repetitive boilerplate and more compile-time type checking. Rewriting code to use generics may sometimes make it faster. It may sometimes make it slower (as in this example). While we will continue to improve the compiler balancing the needs of fast builds and fast runtimes, it will never be the case that generic code will always be faster than non-generic code. That is not a goal.

gopherbot · 2022-01-13T00:33:31Z

Change https://golang.org/cl/378178 mentions this issue: cmd/compile: stop interface conversions for generic method calls from allocating

bboreham · 2022-01-13T21:52:38Z

Just wanted to reference #48849 which to me is a very similar issue.

gopherbot · 2022-02-11T21:18:39Z

Change https://go.dev/cl/385274 mentions this issue: cmd/compile: use method expression closures to implement bound method calls

gopherbot · 2022-02-11T21:18:40Z

Change https://go.dev/cl/385254 mentions this issue: cmd/compile: implement generic method expressions with closures in dictionary

thepudds · 2022-02-15T16:47:31Z

Just wanted to share some benchmarks showing the improvement for the example at the top of this issue.

The first change in CL 378178 that was merged in January reduced the time by ~90% (compared to tip as of 4f6f68e, which was before any change for this issue).
The pending CL 385274 reduces the time by another ~20% (and also simplifies the implementation).

$ benchstat tip-4f6f68e cl-378178 cl-385274 

name \ time/op      tip-4f6f68e  cl-378178    cl-385274
IsSortedGeneric-32   461ns ± 1%    42ns ± 0%    34ns ± 0%
IsSorted-32         65.4ns ± 1%  65.3ns ± 1%  65.5ns ± 1%

name \ alloc/op     tip-4f6f68e  cl-378178    cl-385274
IsSortedGeneric-32    336B ± 0%      0B           0B     
IsSorted-32          24.0B ± 0%   24.0B ± 0%   24.0B ± 0%

name \ allocs/op    tip-4f6f68e  cl-378178    cl-385274
IsSortedGeneric-32    14.0 ± 0%     0.0          0.0     
IsSorted-32           1.00 ± 0%    1.00 ± 0%    1.00 ± 0%

…ctionary Currently we do quite a dance for method expressions on generic types. We write a new closure, and in that closure convert the receiver to an interface with the required method, then call the target using an interface call. Instead in this CL, we just allocate a (captureless) closure in the dictionary which implements that method expression. This CL makes method expressions faster and simpler. But the real win is some followon CLs, where we can use the same closure to implement bound method calls using the same closure, instead of converting to interface and having wrappers convert back. Much faster and simpler. Still thinking about how to do method values. The receiver still needs to be captured, so there must be some closure involved, I think. Update #50182 Change-Id: I1fbd57e7105663f8b049955b8f4111649a5f4aa8 Reviewed-on: https://go-review.googlesource.com/c/go/+/385254 Trust: Keith Randall <khr@golang.org> Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: Matthew Dempsky <mdempsky@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>

… calls When we have x.M(args) where x is a value of type parameter type, we currently cast x to the bound of that type parameter (which is an interface) and then invoke the method on that interface. That's pretty inefficient because: 1) We need to convert x to an interface, which often requires allocation. With CL 378178 it is at least stack allocation, but allocation nontheless. 2) We need to call through wrapper functions to unpack the interface into the right argument locations for the callee. Instead, let's just call the target directly. The previous CL to this one added method expression closures to the dictionary, which is a simple captureless closure that implements T.M for type parameter T and method M. So to implement x.M(args) for x of type T, we use methodexpr(T,M)(x, args). We just need to move x from the receiver slot to the first argument, and use the dictionary entry to implement the polymorphism. This works because we stencil by shape, so we know how to marshal x for the call even though we don't know its exact type. We should be able to revert CL 378178 after this one, as that optimization will no longer be necssary as we're not converting values to interfaces to implement this language construct anymore. Update #50182 Change-Id: I813de4510e41ab63626e58bd1167f9ae93016202 Reviewed-on: https://go-review.googlesource.com/c/go/+/385274 Trust: Keith Randall <khr@golang.org> Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: Matthew Dempsky <mdempsky@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>

… allocating Let T be a type parameter, and say we instantiate it with S, a type that isn't pointer-like (e.g. a pair of ints, or as in 50182, a slice). Then to call a method m on a variable of type T, the compiler does essentially: var v T = ... i := (interface{m()})(v) i.m() The conversion at that second line allocates, as we need to make the data word for an interface. And in the general case, that interface may live an arbitrarily long time. But in this case, we know it doesn't. The data word of i has type *S. When we call i.m, we can't call S.m directly. It is expecting an S, not a *S. We call through a wrapper defined on *S, which looks like: func (p *S) m() { var s S = *p s.m() } The value passed in for p is exactly the data word mentioned above. It never escapes anywhere - the wrapper copies a type S variable out of *p and p is dead after that. That means that in the situation where we build an interface for the explicit purpose of calling a method on it, and use that built interface nowhere else, the allocation of the data word for that interface is known to die before the call returns and thus can be stack allocated. One tricky case is that although the allocation of the backing store of the interface conversion doesn't escape, pointers we store *inside* that allocation might escape (in fact they definitely will, unless we can devirtualize the receiver). Fixes golang#50182 Change-Id: I40e893955c2e6871c54ccecf1b9f0cae17871b0d Reviewed-on: https://go-review.googlesource.com/c/go/+/378178 Trust: Keith Randall <khr@golang.org> Run-TryBot: Keith Randall <khr@golang.org> Trust: Dan Scales <danscales@google.com> Reviewed-by: Matthew Dempsky <mdempsky@google.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Dan Scales <danscales@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>

ALTree changed the title ~~runtime: generic functions are significantly slower than identical non-generic functions in some cases~~ cmd/compile: generic functions are significantly slower than identical non-generic functions in some cases Dec 15, 2021

ALTree added Performance NeedsInvestigation labels Dec 15, 2021

ianlancetaylor added this to the Go1.19 milestone Dec 15, 2021

lni mentioned this issue Dec 17, 2021

Go generics is not as fast as non-generics matrixorigin/matrixone#1427

Closed

1 task

gopherbot closed this as completed in e550c30 Jan 13, 2022

pzartem mentioned this issue Apr 20, 2022

Any plans for generic retry from 1.18? avast/retry-go#58

Closed

golang locked and limited conversation to collaborators Feb 15, 2023

gopherbot added the FrozenDueToAge label Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: generic functions are significantly slower than identical non-generic functions in some cases #50182

cmd/compile: generic functions are significantly slower than identical non-generic functions in some cases #50182

argusdusty commented Dec 15, 2021

ALTree commented Dec 15, 2021

komuw commented Dec 15, 2021

randall77 commented Dec 15, 2021

randall77 commented Dec 15, 2021

aclements commented Dec 15, 2021

randall77 commented Dec 15, 2021

randall77 commented Dec 15, 2021

DmitriyMV commented Dec 16, 2021 •

edited

Loading

ConsoleTVs commented Dec 17, 2021

randall77 commented Dec 17, 2021

g-talbot commented Dec 19, 2021 •

edited

Loading

randall77 commented Dec 20, 2021

danscales commented Dec 23, 2021 •

edited

Loading

g-talbot commented Dec 24, 2021 •

edited

Loading

randall77 commented Dec 24, 2021

g-talbot commented Dec 25, 2021 •

edited

Loading

ianlancetaylor commented Dec 26, 2021

gopherbot commented Jan 13, 2022

bboreham commented Jan 13, 2022

gopherbot commented Feb 11, 2022

gopherbot commented Feb 11, 2022

thepudds commented Feb 15, 2022

cmd/compile: generic functions are significantly slower than identical non-generic functions in some cases #50182

cmd/compile: generic functions are significantly slower than identical non-generic functions in some cases #50182

Comments

argusdusty commented Dec 15, 2021

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

ALTree commented Dec 15, 2021

komuw commented Dec 15, 2021

randall77 commented Dec 15, 2021

randall77 commented Dec 15, 2021

aclements commented Dec 15, 2021

randall77 commented Dec 15, 2021

randall77 commented Dec 15, 2021

DmitriyMV commented Dec 16, 2021 • edited Loading

ConsoleTVs commented Dec 17, 2021

randall77 commented Dec 17, 2021

g-talbot commented Dec 19, 2021 • edited Loading

randall77 commented Dec 20, 2021

danscales commented Dec 23, 2021 • edited Loading

g-talbot commented Dec 24, 2021 • edited Loading

randall77 commented Dec 24, 2021

g-talbot commented Dec 25, 2021 • edited Loading

ianlancetaylor commented Dec 26, 2021

gopherbot commented Jan 13, 2022

bboreham commented Jan 13, 2022

gopherbot commented Feb 11, 2022

gopherbot commented Feb 11, 2022

thepudds commented Feb 15, 2022

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

DmitriyMV commented Dec 16, 2021 •

edited

Loading

g-talbot commented Dec 19, 2021 •

edited

Loading

danscales commented Dec 23, 2021 •

edited

Loading

g-talbot commented Dec 24, 2021 •

edited

Loading

g-talbot commented Dec 25, 2021 •

edited

Loading