Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: adding panic to a never taken branch of an if inside a loop significantly speeds it up by removing knots in the CFG and marking that branch cold #70030

Closed
bboreham opened this issue Oct 24, 2024 · 7 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime.

Comments

@bboreham
Copy link
Contributor

Go version

go version go1.23.2 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/root/.cache/go-build'
GOENV='/root/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.23.2'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/root/.config/go/telemetry'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/tmp/bryan/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build2605208413=/tmp/go-build -gno-record-gcc-switches'

What did you do?

Ran this benchmark:
( I think I can't post benchmarks on go.dev/play)

package bryan

import (
	"testing"
)

func toLower(s string, a []byte) []byte {
	var buf []byte
	for i := 0; i < len(s); i++ {
		c := s[i]
		if 'A' <= c && c <= 'Z' {
			if buf == nil {
				if cap(a) > len(s) {
					buf = a[:len(s)]
					copy(buf, s)
					//panic("copy")  // Goes faster with this line uncommented.
				} else {
					buf = []byte(s)
				}
			}
			buf[i] = c + 'a' - 'A'
		}
	}
	return buf
}

func BenchmarkToLower(b *testing.B) {

	inputs := make([]string, 10)
	for i := range inputs {
		chars := "abcdefghijklmnopqrstuvwxyz"
		// Swap the alphabet to make alternatives.
		inputs[i] = chars[i%len(chars):] + chars[:i%len(chars)]
	}
	b.ResetTimer()
	for n := 0; n < b.N; n++ {
		var a [256]byte
		toLower(inputs[n%len(inputs)], a[:])
	}
}

What did you see happen?

With the panic it takes 25ns per op, and without it takes 45ns. (Panic is never executed)
According to the profiler, it's calling copy, which it shouldn't do. Hence I added a panic to see how it was called, and then it went faster.

What did you expect to see?

It shouldn't call copy. It shouldn't go faster if I add a redundant panic.

@gabyhelp
Copy link

Related Issues and Documentation

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

@Jorropo
Copy link
Member

Jorropo commented Oct 24, 2024

It shouldn't call copy. It shouldn't go faster if I add a redundant panic.

It is easy to assume that code that is never ran does not affect performance, code exists create side effects which could be optimized at runtime but a purely static "naive" compiler like go's one can't see that.
As a caricature example if you have something like this:

var alwaysFalse bool
var leak *int

func f(x *int) int {
 if alwaysFalse {
  leak = x
 }
 return *x
}

the go compiler will heap allocate x when it is passed in, while it could stack allocate it if you did const alwaysFalse = false.


For your function, without the panic it needs to handle cases where calling copy destroyed the values in the registers and they need to be restored. With the panic, that branch of code never goes back to loop, breaking the control flow graph and simplifying the compiler's job.
Altho []byte() calls runtime.stringtoslicebyte which also have this effect so the different aint that drastic.

More importantly the compiler assumes panics are bad, and treat them as cold, the panic is placed after the loop in memory, while the first one interweave the two clauses of the if, experimentally this has the biggest effect, manually taking the assembly for without panic and moving the never taken branch at the bottom of the funtion make it go* from 25ns to 20ns (while the compiler's with panic is 17ns but is significantly shorter).

* on my computer with the winds blowing 10km/h South-SouthEast

Tl;Dr

Adding panic to a never taken branch of an if inside a loop improves the loop layout by moving the never taken branch away from the loop and significantly reducing codesize by removing the need to restore registers after a runtime.memmove call.

I have multiple ideas on how to improve this, like interprocedural register allocation or using PGO to annotate cold branches but neither is particularly easy nor fast to implement, and have downsides so I don't know how to make this issue actionable.

cc @golang/compiler

@Jorropo Jorropo changed the title Benchmark goes faster if I add a panic. cmd/compile: adding panic to a never taken branch of an if inside a loop significantly speeds it up by removing knots in the CFG and marking that branch cold Oct 24, 2024
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Oct 24, 2024
@bboreham
Copy link
Contributor Author

But it did run, according to the profiler.

@bboreham
Copy link
Contributor Author

Maybe I misunderstood. Are you saying the time is attributed to that line, which is still not running?

@Jorropo
Copy link
Member

Jorropo commented Oct 24, 2024

I'm not sure I understand the question or where the profiler is, when I run your benchmark the branch does not run otherwise it would call panic which would blow up loudly. The benchmark does not blow up loudly so the branch does not run.

@randall77
Copy link
Contributor

The compiler knows that control flow never continues past a panic call. That avoids the compiler needing to do the merge of control flow at line 19.
I believe that accounts for the difference in time.

Inner loop with panic:

100108b78: 910004c6     add     x6, x6, #1
100108b7c: eb0200df     cmp     x6, x2
100108b80: 54fffb4a     b.ge    0x100108ae8 <_command-line-arguments.BenchmarkToLower+0xe8>
100108b84: 38666889     ldrb    w9, [x4, x6]
100108b88: d101052a     sub     x10, x9, #65
100108b8c: d3401d4a     ubfx    x10, x10, #0, #8
100108b90: 7100655f     cmp     w10, #25
100108b94: 54ffff28     b.hi    0x100108b78 <_command-line-arguments.BenchmarkToLower+0x178>

inner loop without panic:

100108b78: 910004c6     add     x6, x6, #1
100108b7c: aa0a03e7     mov     x7, x10
100108b80: aa0203e8     mov     x8, x2
100108b84: f940a3e2     ldr     x2, [sp, #320]
100108b88: eb0200df     cmp     x6, x2
100108b8c: 54fffaea     b.ge    0x100108ae8 <_command-line-arguments.BenchmarkToLower+0xe8>
100108b90: 38666889     ldrb    w9, [x4, x6]
100108b94: d101052a     sub     x10, x9, #65
100108b98: d3401d4a     ubfx    x10, x10, #0, #8
100108b9c: 7100655f     cmp     w10, #25
100108ba0: 540006c8     b.hi    0x100108c78 <_command-line-arguments.BenchmarkToLower+0x278>
...
100108c78: aa0803e2     mov     x2, x8
100108c7c: aa0703ea     mov     x10, x7
100108c80: 910103e7     add     x7, sp, #64
100108c84: f940a3e8     ldr     x8, [sp, #320]
100108c88: 17ffffbc     b       0x100108b78 <_command-line-arguments.BenchmarkToLower+0x178>

Note the code being jumped around (as @Jorropo says, in the panic case that code gets pushed to the end) and the additional mov instructions to get values back into the right registers for the merge. (Ideally regalloc should be better here, but nothing is perfect.)

@Jorropo
Copy link
Member

Jorropo commented Oct 24, 2024

According to my experiments 17ns vs 25ns, 5ns comes from improving the loop layout by marking the branch cold. And the remaining 3ns are unexplained but are likely the extra merge you pointed out as there isn't much other differences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime.
Projects
None yet
Development

No branches or pull requests

5 participants