cmd/compile: utf8.DecodeRune should be fast/slow-path inlineable #31666

josharian · 2019-04-24T22:23:35Z

We manually partially inline utf8.DecodeRune in a few places in the tree. For example, in bytes.EqualFold, we have:

		if s[0] < utf8.RuneSelf {
			sr, s = rune(s[0]), s[1:]
		} else {
			r, size := utf8.DecodeRune(s)
			sr, s = r, s[size:]
		}

It'd be nice to rewrite utf8.DecodeRune like this:

func DecodeRune(p []byte) (r rune, size int) {
	if len(p) > 0 && p[0] < RuneSelf {
		return rune(p[0]), 1
	}
	return decodeRuneSlow(p)
}

and then let mid-stack inlining take it from there. (And delete all the manual partial inlining.)

Unfortunately, this has a cost of 89, over the budget of 80.

This is basically just another variant of #17566, but I'm filing it separately since that issue is super noisy.

cc @bradfitz @mvdan @dr2chase

The text was updated successfully, but these errors were encountered:

martisch · 2019-04-25T04:48:18Z

While at it we should make it also work for utf8.DecodeRuneInString at the same time.

mvdan · 2019-05-08T08:29:45Z

Perhaps very clear cases like this can be used to start tweaking the inlining heuristic in the right direction, with the smallest steps possible.

corona10 · 2019-05-13T15:52:30Z

@martisch @mvdan Can I take a look at this issue?

mvdan · 2019-05-13T15:58:46Z

@corona10 you're certainly welcome to take a look, but if you're looking for low-hanging fruit issues to work on, this isn't a good candidate. It likely requires someone with a good understanding of the compiler, and it's not a simple fix, as @josharian already gave that a try.

corona10 · 2019-05-13T17:08:12Z

@mvdan
Oh.. @josharian already tried it. Looks like a not simple issue.
I thought it was just an issue of rewriting so that it can help the compiler to do a mid stack inlining, but I did not know that cost was an issue to consider. I will try to find some easier issues.

renthraysk · 2019-07-14T15:05:37Z

binary.Uvarint() has the same problem. Impossible to write the 1 byte fast path mid stack func without exceeding cost.

This commit adds and ASCII fast path to bytes/strings EqualFold that roughly doubles performance when all characters are ASCII. It also changes strings.EqualFold to use `for range` for the first string since this is ~10% faster than using utf8.DecodeRuneInString for both (see golang#31666). ``` goos: darwin goarch: arm64 pkg: strings name old time/op new time/op delta EqualFold/Tests-10 242ns ± 0% 176ns ± 1% -27.34% (p=0.008 n=5+5) EqualFold/ASCII-10 21.0ns ± 0% 9.9ns ± 0% -52.98% (p=0.008 n=5+5) EqualFold/Unicode-10 88.5ns ± 0% 79.2ns ± 0% -10.52% (p=0.008 n=5+5) ``` ``` goos: windows goarch: amd64 pkg: strings cpu: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz EqualFold-16 264ns ± 3% 183ns ± 5% -30.72% (p=0.008 n=5+5) EqualFold_ASCII-16 25.1ns ± 3% 10.4ns ± 1% -58.72% (p=0.008 n=5+5) EqualFold_Unicode-16 97.6ns ± 2% 85.0ns ± 1% -12.92% (p=0.008 n=5+5) ``` Change-Id: I058f3f97a08dc04d65af895674d85420f920abe1

gopherbot · 2022-08-25T03:54:32Z

Change https://go.dev/cl/425459 mentions this issue: bytes/strings: add ASCII fast path to EqualFold

This commit adds an ASCII fast path to bytes/strings EqualFold that roughly doubles performance when all characters are ASCII. It also changes strings.EqualFold to use `for range` for the first string since this is ~10% faster than using utf8.DecodeRuneInString for both (see golang#31666). Performance (similar results on arm64 and amd64): name old time/op new time/op delta EqualFold/Tests-10 242ns ± 0% 176ns ± 1% -27.34% (p=0.008 n=5+5) EqualFold/ASCII-10 21.0ns ± 0% 9.9ns ± 0% -52.98% (p=0.008 n=5+5) EqualFold/Unicode-10 88.5ns ± 0% 79.2ns ± 0% -10.52% (p=0.008 n=5+5) Change-Id: I058f3f97a08dc04d65af895674d85420f920abe1

lexer.next is a hot function, as demonstrated by profiling. Most programs will consist of ASCII characters only, which we can optimize for. Ideally DecodeRuneInString would be inlined here and this wouldn't be a problem at all, but that won't be the case until golang/go#31666 is resolved. name old time/op new time/op delta Parse/lorem_ipsum-4 733ns ± 2% 733ns ± 2% ~ (p=0.933 n=8+8) Parse/short-4 16.6µs ± 1% 15.5µs ± 2% -6.75% (p=0.000 n=8+8) Parse/medium-4 40.1µs ± 1% 38.7µs ± 1% -3.51% (p=0.000 n=8+8) Parse/long-4 123µs ± 1% 115µs ± 1% -5.86% (p=0.001 n=7+7) Parse/very-long-4 416µs ± 1% 396µs ± 1% -4.70% (p=0.000 n=8+8) name old alloc/op new alloc/op delta Parse/lorem_ipsum-4 1.16kB ± 0% 1.16kB ± 0% ~ (all equal) Parse/short-4 5.42kB ± 0% 5.42kB ± 0% ~ (all equal) Parse/medium-4 12.6kB ± 0% 12.6kB ± 0% ~ (all equal) Parse/long-4 35.3kB ± 0% 35.3kB ± 0% ~ (all equal) Parse/very-long-4 113kB ± 0% 113kB ± 0% ~ (all equal) name old allocs/op new allocs/op delta Parse/lorem_ipsum-4 10.0 ± 0% 10.0 ± 0% ~ (all equal) Parse/short-4 107 ± 0% 107 ± 0% ~ (all equal) Parse/medium-4 296 ± 0% 296 ± 0% ~ (all equal) Parse/long-4 777 ± 0% 777 ± 0% ~ (all equal) Parse/very-long-4 2.61k ± 0% 2.61k ± 0% ~ (all equal)

* template: add parse benchmarks Taken from the yagpdb-cc repository. To avoid licensing issues, I only used programs written by me. Baseline performance: name time/op Parse/lorem_ipsum-4 2.01µs ± 7% Parse/short-4 70.2µs ± 1% Parse/medium-4 165µs ± 1% Parse/long-4 517µs ± 1% Parse/very-long-4 1.79ms ± 1% name alloc/op Parse/lorem_ipsum-4 1.23kB ± 0% Parse/short-4 5.48kB ± 0% Parse/medium-4 12.7kB ± 0% Parse/long-4 35.3kB ± 0% Parse/very-long-4 113kB ± 0% name allocs/op Parse/lorem_ipsum-4 12.0 ± 0% Parse/short-4 109 ± 0% Parse/medium-4 298 ± 0% Parse/long-4 779 ± 0% Parse/very-long-4 2.61k ± 0% * template: port upstream changes Notably, this includes https://go-review.googlesource.com/c/go/+/421883, which changes the lexer to not spawn a new goroutine. name old time/op new time/op delta Parse/lorem_ipsum-4 2.01µs ± 7% 0.73µs ± 2% -63.56% (p=0.000 n=7+8) Parse/short-4 70.2µs ± 1% 16.6µs ± 1% -76.32% (p=0.000 n=7+8) Parse/medium-4 165µs ± 1% 40µs ± 1% -75.74% (p=0.000 n=8+8) Parse/long-4 517µs ± 1% 123µs ± 1% -76.28% (p=0.000 n=8+7) Parse/very-long-4 1.79ms ± 1% 0.42ms ± 1% -76.81% (p=0.000 n=8+8) name old alloc/op new alloc/op delta Parse/lorem_ipsum-4 1.23kB ± 0% 1.16kB ± 0% -5.56% (p=0.000 n=8+8) Parse/short-4 5.48kB ± 0% 5.42kB ± 0% -1.17% (p=0.000 n=8+8) Parse/medium-4 12.7kB ± 0% 12.6kB ± 0% -0.51% (p=0.000 n=8+8) Parse/long-4 35.3kB ± 0% 35.3kB ± 0% -0.17% (p=0.000 n=8+8) Parse/very-long-4 113kB ± 0% 113kB ± 0% -0.05% (p=0.000 n=8+8) name old allocs/op new allocs/op delta Parse/lorem_ipsum-4 12.0 ± 0% 10.0 ± 0% -16.67% (p=0.000 n=8+8) Parse/short-4 109 ± 0% 107 ± 0% -1.83% (p=0.000 n=8+8) Parse/medium-4 298 ± 0% 296 ± 0% -0.67% (p=0.000 n=8+8) Parse/long-4 779 ± 0% 777 ± 0% -0.26% (p=0.000 n=8+8) Parse/very-long-4 2.61k ± 0% 2.61k ± 0% -0.08% (p=0.000 n=8+8) * template: add ASCII fast path for lexer.next lexer.next is a hot function, as demonstrated by profiling. Most programs will consist of ASCII characters only, which we can optimize for. Ideally DecodeRuneInString would be inlined here and this wouldn't be a problem at all, but that won't be the case until golang/go#31666 is resolved. name old time/op new time/op delta Parse/lorem_ipsum-4 733ns ± 2% 733ns ± 2% ~ (p=0.933 n=8+8) Parse/short-4 16.6µs ± 1% 15.5µs ± 2% -6.75% (p=0.000 n=8+8) Parse/medium-4 40.1µs ± 1% 38.7µs ± 1% -3.51% (p=0.000 n=8+8) Parse/long-4 123µs ± 1% 115µs ± 1% -5.86% (p=0.001 n=7+7) Parse/very-long-4 416µs ± 1% 396µs ± 1% -4.70% (p=0.000 n=8+8) name old alloc/op new alloc/op delta Parse/lorem_ipsum-4 1.16kB ± 0% 1.16kB ± 0% ~ (all equal) Parse/short-4 5.42kB ± 0% 5.42kB ± 0% ~ (all equal) Parse/medium-4 12.6kB ± 0% 12.6kB ± 0% ~ (all equal) Parse/long-4 35.3kB ± 0% 35.3kB ± 0% ~ (all equal) Parse/very-long-4 113kB ± 0% 113kB ± 0% ~ (all equal) name old allocs/op new allocs/op delta Parse/lorem_ipsum-4 10.0 ± 0% 10.0 ± 0% ~ (all equal) Parse/short-4 107 ± 0% 107 ± 0% ~ (all equal) Parse/medium-4 296 ± 0% 296 ± 0% ~ (all equal) Parse/long-4 777 ± 0% 777 ± 0% ~ (all equal) Parse/very-long-4 2.61k ± 0% 2.61k ± 0% ~ (all equal) * template: preallocate for 4 args in CommandNode CommandNode.append and by extension runtime.growslice was showing up more than expected during profiling. Allocate enough space for four arguments up front so we don't need to reallocate as much. Although this doesn't benefit performance much, it does have a clear positive effect on memory usage. name old time/op new time/op delta Parse/lorem_ipsum-4 733ns ± 2% 728ns ± 2% ~ (p=0.315 n=8+8) Parse/short-4 15.5µs ± 2% 15.2µs ± 1% -2.12% (p=0.001 n=8+8) Parse/medium-4 38.7µs ± 1% 38.0µs ± 1% -1.99% (p=0.000 n=8+8) Parse/long-4 115µs ± 1% 117µs ± 5% ~ (p=0.281 n=7+8) Parse/very-long-4 396µs ± 1% 400µs ± 1% +1.02% (p=0.002 n=8+8) name old alloc/op new alloc/op delta Parse/lorem_ipsum-4 1.16kB ± 0% 1.16kB ± 0% ~ (all equal) Parse/short-4 5.42kB ± 0% 5.08kB ± 0% -6.20% (p=0.000 n=8+8) Parse/medium-4 12.6kB ± 0% 12.7kB ± 0% +0.51% (p=0.000 n=8+8) Parse/long-4 35.3kB ± 0% 34.4kB ± 0% -2.58% (p=0.000 n=8+8) Parse/very-long-4 113kB ± 0% 110kB ± 0% -2.37% (p=0.000 n=8+8) name old allocs/op new allocs/op delta Parse/lorem_ipsum-4 10.0 ± 0% 10.0 ± 0% ~ (all equal) Parse/short-4 107 ± 0% 93 ± 0% -13.08% (p=0.000 n=8+8) Parse/medium-4 296 ± 0% 276 ± 0% -6.76% (p=0.000 n=8+8) Parse/long-4 777 ± 0% 705 ± 0% -9.27% (p=0.000 n=8+8) Parse/very-long-4 2.61k ± 0% 2.35k ± 0% -9.77% (p=0.000 n=8+8)

This commit adds an ASCII fast path to bytes/strings EqualFold that roughly doubles performance when all characters are ASCII. It also changes strings.EqualFold to use `for range` for the first string since this is ~10% faster than using utf8.DecodeRuneInString for both (see #31666). Performance (similar results on arm64 and amd64): name old time/op new time/op delta EqualFold/Tests-10 238ns ± 0% 172ns ± 1% -27.91% (p=0.000 n=10+10) EqualFold/ASCII-10 20.5ns ± 0% 9.7ns ± 0% -52.73% (p=0.000 n=10+10) EqualFold/UnicodePrefix-10 86.5ns ± 0% 77.6ns ± 0% -10.37% (p=0.000 n=10+10) EqualFold/UnicodeSuffix-10 86.8ns ± 2% 71.3ns ± 0% -17.88% (p=0.000 n=10+8) Change-Id: I058f3f97a08dc04d65af895674d85420f920abe1 Reviewed-on: https://go-review.googlesource.com/c/go/+/425459 Reviewed-by: Ian Lance Taylor <iant@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Auto-Submit: Ian Lance Taylor <iant@google.com> Run-TryBot: Ian Lance Taylor <iant@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com>

josharian added Performance NeedsFix The path to resolution is known, but the work has not been done. labels Apr 24, 2019

josharian added this to the Unplanned milestone Apr 24, 2019

renthraysk mentioned this issue Sep 5, 2021

unicode/utf8: make DecodeRune and DecodeLastRune inlineable #48195

Open

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022

CAFxX mentioned this issue Jan 25, 2024

cmd/compile: overhaul inliner #61502

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: utf8.DecodeRune should be fast/slow-path inlineable #31666

cmd/compile: utf8.DecodeRune should be fast/slow-path inlineable #31666

josharian commented Apr 24, 2019

martisch commented Apr 25, 2019

mvdan commented May 8, 2019

corona10 commented May 13, 2019

mvdan commented May 13, 2019

corona10 commented May 13, 2019

renthraysk commented Jul 14, 2019

gopherbot commented Aug 25, 2022

cmd/compile: utf8.DecodeRune should be fast/slow-path inlineable #31666

cmd/compile: utf8.DecodeRune should be fast/slow-path inlineable #31666

Comments

josharian commented Apr 24, 2019

martisch commented Apr 25, 2019

mvdan commented May 8, 2019

corona10 commented May 13, 2019

mvdan commented May 13, 2019

corona10 commented May 13, 2019

renthraysk commented Jul 14, 2019

gopherbot commented Aug 25, 2022