New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: utf8.DecodeRune should be fast/slow-path inlineable #31666
Comments
While at it we should make it also work for utf8.DecodeRuneInString at the same time. |
Perhaps very clear cases like this can be used to start tweaking the inlining heuristic in the right direction, with the smallest steps possible. |
@corona10 you're certainly welcome to take a look, but if you're looking for low-hanging fruit issues to work on, this isn't a good candidate. It likely requires someone with a good understanding of the compiler, and it's not a simple fix, as @josharian already gave that a try. |
@mvdan |
binary.Uvarint() has the same problem. Impossible to write the 1 byte fast path mid stack func without exceeding cost. |
This commit adds and ASCII fast path to bytes/strings EqualFold that roughly doubles performance when all characters are ASCII. It also changes strings.EqualFold to use `for range` for the first string since this is ~10% faster than using utf8.DecodeRuneInString for both (see golang#31666). ``` goos: darwin goarch: arm64 pkg: strings name old time/op new time/op delta EqualFold/Tests-10 242ns ± 0% 176ns ± 1% -27.34% (p=0.008 n=5+5) EqualFold/ASCII-10 21.0ns ± 0% 9.9ns ± 0% -52.98% (p=0.008 n=5+5) EqualFold/Unicode-10 88.5ns ± 0% 79.2ns ± 0% -10.52% (p=0.008 n=5+5) ``` ``` goos: windows goarch: amd64 pkg: strings cpu: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz EqualFold-16 264ns ± 3% 183ns ± 5% -30.72% (p=0.008 n=5+5) EqualFold_ASCII-16 25.1ns ± 3% 10.4ns ± 1% -58.72% (p=0.008 n=5+5) EqualFold_Unicode-16 97.6ns ± 2% 85.0ns ± 1% -12.92% (p=0.008 n=5+5) ``` Change-Id: I058f3f97a08dc04d65af895674d85420f920abe1
Change https://go.dev/cl/425459 mentions this issue: |
This commit adds an ASCII fast path to bytes/strings EqualFold that roughly doubles performance when all characters are ASCII. It also changes strings.EqualFold to use `for range` for the first string since this is ~10% faster than using utf8.DecodeRuneInString for both (see golang#31666). Performance (similar results on arm64 and amd64): name old time/op new time/op delta EqualFold/Tests-10 242ns ± 0% 176ns ± 1% -27.34% (p=0.008 n=5+5) EqualFold/ASCII-10 21.0ns ± 0% 9.9ns ± 0% -52.98% (p=0.008 n=5+5) EqualFold/Unicode-10 88.5ns ± 0% 79.2ns ± 0% -10.52% (p=0.008 n=5+5) Change-Id: I058f3f97a08dc04d65af895674d85420f920abe1
lexer.next is a hot function, as demonstrated by profiling. Most programs will consist of ASCII characters only, which we can optimize for. Ideally DecodeRuneInString would be inlined here and this wouldn't be a problem at all, but that won't be the case until golang/go#31666 is resolved. name old time/op new time/op delta Parse/lorem_ipsum-4 733ns ± 2% 733ns ± 2% ~ (p=0.933 n=8+8) Parse/short-4 16.6µs ± 1% 15.5µs ± 2% -6.75% (p=0.000 n=8+8) Parse/medium-4 40.1µs ± 1% 38.7µs ± 1% -3.51% (p=0.000 n=8+8) Parse/long-4 123µs ± 1% 115µs ± 1% -5.86% (p=0.001 n=7+7) Parse/very-long-4 416µs ± 1% 396µs ± 1% -4.70% (p=0.000 n=8+8) name old alloc/op new alloc/op delta Parse/lorem_ipsum-4 1.16kB ± 0% 1.16kB ± 0% ~ (all equal) Parse/short-4 5.42kB ± 0% 5.42kB ± 0% ~ (all equal) Parse/medium-4 12.6kB ± 0% 12.6kB ± 0% ~ (all equal) Parse/long-4 35.3kB ± 0% 35.3kB ± 0% ~ (all equal) Parse/very-long-4 113kB ± 0% 113kB ± 0% ~ (all equal) name old allocs/op new allocs/op delta Parse/lorem_ipsum-4 10.0 ± 0% 10.0 ± 0% ~ (all equal) Parse/short-4 107 ± 0% 107 ± 0% ~ (all equal) Parse/medium-4 296 ± 0% 296 ± 0% ~ (all equal) Parse/long-4 777 ± 0% 777 ± 0% ~ (all equal) Parse/very-long-4 2.61k ± 0% 2.61k ± 0% ~ (all equal)
* template: add parse benchmarks Taken from the yagpdb-cc repository. To avoid licensing issues, I only used programs written by me. Baseline performance: name time/op Parse/lorem_ipsum-4 2.01µs ± 7% Parse/short-4 70.2µs ± 1% Parse/medium-4 165µs ± 1% Parse/long-4 517µs ± 1% Parse/very-long-4 1.79ms ± 1% name alloc/op Parse/lorem_ipsum-4 1.23kB ± 0% Parse/short-4 5.48kB ± 0% Parse/medium-4 12.7kB ± 0% Parse/long-4 35.3kB ± 0% Parse/very-long-4 113kB ± 0% name allocs/op Parse/lorem_ipsum-4 12.0 ± 0% Parse/short-4 109 ± 0% Parse/medium-4 298 ± 0% Parse/long-4 779 ± 0% Parse/very-long-4 2.61k ± 0% * template: port upstream changes Notably, this includes https://go-review.googlesource.com/c/go/+/421883, which changes the lexer to not spawn a new goroutine. name old time/op new time/op delta Parse/lorem_ipsum-4 2.01µs ± 7% 0.73µs ± 2% -63.56% (p=0.000 n=7+8) Parse/short-4 70.2µs ± 1% 16.6µs ± 1% -76.32% (p=0.000 n=7+8) Parse/medium-4 165µs ± 1% 40µs ± 1% -75.74% (p=0.000 n=8+8) Parse/long-4 517µs ± 1% 123µs ± 1% -76.28% (p=0.000 n=8+7) Parse/very-long-4 1.79ms ± 1% 0.42ms ± 1% -76.81% (p=0.000 n=8+8) name old alloc/op new alloc/op delta Parse/lorem_ipsum-4 1.23kB ± 0% 1.16kB ± 0% -5.56% (p=0.000 n=8+8) Parse/short-4 5.48kB ± 0% 5.42kB ± 0% -1.17% (p=0.000 n=8+8) Parse/medium-4 12.7kB ± 0% 12.6kB ± 0% -0.51% (p=0.000 n=8+8) Parse/long-4 35.3kB ± 0% 35.3kB ± 0% -0.17% (p=0.000 n=8+8) Parse/very-long-4 113kB ± 0% 113kB ± 0% -0.05% (p=0.000 n=8+8) name old allocs/op new allocs/op delta Parse/lorem_ipsum-4 12.0 ± 0% 10.0 ± 0% -16.67% (p=0.000 n=8+8) Parse/short-4 109 ± 0% 107 ± 0% -1.83% (p=0.000 n=8+8) Parse/medium-4 298 ± 0% 296 ± 0% -0.67% (p=0.000 n=8+8) Parse/long-4 779 ± 0% 777 ± 0% -0.26% (p=0.000 n=8+8) Parse/very-long-4 2.61k ± 0% 2.61k ± 0% -0.08% (p=0.000 n=8+8) * template: add ASCII fast path for lexer.next lexer.next is a hot function, as demonstrated by profiling. Most programs will consist of ASCII characters only, which we can optimize for. Ideally DecodeRuneInString would be inlined here and this wouldn't be a problem at all, but that won't be the case until golang/go#31666 is resolved. name old time/op new time/op delta Parse/lorem_ipsum-4 733ns ± 2% 733ns ± 2% ~ (p=0.933 n=8+8) Parse/short-4 16.6µs ± 1% 15.5µs ± 2% -6.75% (p=0.000 n=8+8) Parse/medium-4 40.1µs ± 1% 38.7µs ± 1% -3.51% (p=0.000 n=8+8) Parse/long-4 123µs ± 1% 115µs ± 1% -5.86% (p=0.001 n=7+7) Parse/very-long-4 416µs ± 1% 396µs ± 1% -4.70% (p=0.000 n=8+8) name old alloc/op new alloc/op delta Parse/lorem_ipsum-4 1.16kB ± 0% 1.16kB ± 0% ~ (all equal) Parse/short-4 5.42kB ± 0% 5.42kB ± 0% ~ (all equal) Parse/medium-4 12.6kB ± 0% 12.6kB ± 0% ~ (all equal) Parse/long-4 35.3kB ± 0% 35.3kB ± 0% ~ (all equal) Parse/very-long-4 113kB ± 0% 113kB ± 0% ~ (all equal) name old allocs/op new allocs/op delta Parse/lorem_ipsum-4 10.0 ± 0% 10.0 ± 0% ~ (all equal) Parse/short-4 107 ± 0% 107 ± 0% ~ (all equal) Parse/medium-4 296 ± 0% 296 ± 0% ~ (all equal) Parse/long-4 777 ± 0% 777 ± 0% ~ (all equal) Parse/very-long-4 2.61k ± 0% 2.61k ± 0% ~ (all equal) * template: preallocate for 4 args in CommandNode CommandNode.append and by extension runtime.growslice was showing up more than expected during profiling. Allocate enough space for four arguments up front so we don't need to reallocate as much. Although this doesn't benefit performance much, it does have a clear positive effect on memory usage. name old time/op new time/op delta Parse/lorem_ipsum-4 733ns ± 2% 728ns ± 2% ~ (p=0.315 n=8+8) Parse/short-4 15.5µs ± 2% 15.2µs ± 1% -2.12% (p=0.001 n=8+8) Parse/medium-4 38.7µs ± 1% 38.0µs ± 1% -1.99% (p=0.000 n=8+8) Parse/long-4 115µs ± 1% 117µs ± 5% ~ (p=0.281 n=7+8) Parse/very-long-4 396µs ± 1% 400µs ± 1% +1.02% (p=0.002 n=8+8) name old alloc/op new alloc/op delta Parse/lorem_ipsum-4 1.16kB ± 0% 1.16kB ± 0% ~ (all equal) Parse/short-4 5.42kB ± 0% 5.08kB ± 0% -6.20% (p=0.000 n=8+8) Parse/medium-4 12.6kB ± 0% 12.7kB ± 0% +0.51% (p=0.000 n=8+8) Parse/long-4 35.3kB ± 0% 34.4kB ± 0% -2.58% (p=0.000 n=8+8) Parse/very-long-4 113kB ± 0% 110kB ± 0% -2.37% (p=0.000 n=8+8) name old allocs/op new allocs/op delta Parse/lorem_ipsum-4 10.0 ± 0% 10.0 ± 0% ~ (all equal) Parse/short-4 107 ± 0% 93 ± 0% -13.08% (p=0.000 n=8+8) Parse/medium-4 296 ± 0% 276 ± 0% -6.76% (p=0.000 n=8+8) Parse/long-4 777 ± 0% 705 ± 0% -9.27% (p=0.000 n=8+8) Parse/very-long-4 2.61k ± 0% 2.35k ± 0% -9.77% (p=0.000 n=8+8)
This commit adds an ASCII fast path to bytes/strings EqualFold that roughly doubles performance when all characters are ASCII. It also changes strings.EqualFold to use `for range` for the first string since this is ~10% faster than using utf8.DecodeRuneInString for both (see #31666). Performance (similar results on arm64 and amd64): name old time/op new time/op delta EqualFold/Tests-10 238ns ± 0% 172ns ± 1% -27.91% (p=0.000 n=10+10) EqualFold/ASCII-10 20.5ns ± 0% 9.7ns ± 0% -52.73% (p=0.000 n=10+10) EqualFold/UnicodePrefix-10 86.5ns ± 0% 77.6ns ± 0% -10.37% (p=0.000 n=10+10) EqualFold/UnicodeSuffix-10 86.8ns ± 2% 71.3ns ± 0% -17.88% (p=0.000 n=10+8) Change-Id: I058f3f97a08dc04d65af895674d85420f920abe1 Reviewed-on: https://go-review.googlesource.com/c/go/+/425459 Reviewed-by: Ian Lance Taylor <iant@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Auto-Submit: Ian Lance Taylor <iant@google.com> Run-TryBot: Ian Lance Taylor <iant@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com>
We manually partially inline utf8.DecodeRune in a few places in the tree. For example, in
bytes.EqualFold
, we have:It'd be nice to rewrite
utf8.DecodeRune
like this:and then let mid-stack inlining take it from there. (And delete all the manual partial inlining.)
Unfortunately, this has a cost of 89, over the budget of 80.
This is basically just another variant of #17566, but I'm filing it separately since that issue is super noisy.
cc @bradfitz @mvdan @dr2chase
The text was updated successfully, but these errors were encountered: