Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: utf8.DecodeRune should be fast/slow-path inlineable #31666

Open
josharian opened this issue Apr 24, 2019 · 7 comments
Open

cmd/compile: utf8.DecodeRune should be fast/slow-path inlineable #31666

josharian opened this issue Apr 24, 2019 · 7 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@josharian
Copy link
Contributor

We manually partially inline utf8.DecodeRune in a few places in the tree. For example, in bytes.EqualFold, we have:

		if s[0] < utf8.RuneSelf {
			sr, s = rune(s[0]), s[1:]
		} else {
			r, size := utf8.DecodeRune(s)
			sr, s = r, s[size:]
		}

It'd be nice to rewrite utf8.DecodeRune like this:

func DecodeRune(p []byte) (r rune, size int) {
	if len(p) > 0 && p[0] < RuneSelf {
		return rune(p[0]), 1
	}
	return decodeRuneSlow(p)
}

and then let mid-stack inlining take it from there. (And delete all the manual partial inlining.)

Unfortunately, this has a cost of 89, over the budget of 80.

This is basically just another variant of #17566, but I'm filing it separately since that issue is super noisy.

cc @bradfitz @mvdan @dr2chase

@josharian josharian added Performance NeedsFix The path to resolution is known, but the work has not been done. labels Apr 24, 2019
@josharian josharian added this to the Unplanned milestone Apr 24, 2019
@martisch
Copy link
Contributor

While at it we should make it also work for utf8.DecodeRuneInString at the same time.

@mvdan
Copy link
Member

mvdan commented May 8, 2019

Perhaps very clear cases like this can be used to start tweaking the inlining heuristic in the right direction, with the smallest steps possible.

@corona10
Copy link
Contributor

@martisch @mvdan Can I take a look at this issue?

@mvdan
Copy link
Member

mvdan commented May 13, 2019

@corona10 you're certainly welcome to take a look, but if you're looking for low-hanging fruit issues to work on, this isn't a good candidate. It likely requires someone with a good understanding of the compiler, and it's not a simple fix, as @josharian already gave that a try.

@corona10
Copy link
Contributor

@mvdan
Oh.. @josharian already tried it. Looks like a not simple issue.
I thought it was just an issue of rewriting so that it can help the compiler to do a mid stack inlining, but I did not know that cost was an issue to consider. I will try to find some easier issues.

@renthraysk
Copy link

binary.Uvarint() has the same problem. Impossible to write the 1 byte fast path mid stack func without exceeding cost.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022
charlievieth added a commit to charlievieth/go that referenced this issue Aug 25, 2022
This commit adds and ASCII fast path to bytes/strings EqualFold that
roughly doubles performance when all characters are ASCII.

It also changes strings.EqualFold to use `for range` for the first
string since this is ~10% faster than using utf8.DecodeRuneInString for
both (see golang#31666).

```
goos: darwin
goarch: arm64
pkg: strings

name                  old time/op  new time/op  delta
EqualFold/Tests-10     242ns ± 0%   176ns ± 1%  -27.34%  (p=0.008 n=5+5)
EqualFold/ASCII-10    21.0ns ± 0%   9.9ns ± 0%  -52.98%  (p=0.008 n=5+5)
EqualFold/Unicode-10  88.5ns ± 0%  79.2ns ± 0%  -10.52%  (p=0.008 n=5+5)
```

```
goos: windows
goarch: amd64
pkg: strings
cpu: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz

EqualFold-16           264ns ± 3%   183ns ± 5%  -30.72%  (p=0.008 n=5+5)
EqualFold_ASCII-16    25.1ns ± 3%  10.4ns ± 1%  -58.72%  (p=0.008 n=5+5)
EqualFold_Unicode-16  97.6ns ± 2%  85.0ns ± 1%  -12.92%  (p=0.008 n=5+5)
```

Change-Id: I058f3f97a08dc04d65af895674d85420f920abe1
@gopherbot
Copy link

Change https://go.dev/cl/425459 mentions this issue: bytes/strings: add ASCII fast path to EqualFold

charlievieth added a commit to charlievieth/go that referenced this issue Aug 26, 2022
This commit adds an ASCII fast path to bytes/strings EqualFold that
roughly doubles performance when all characters are ASCII.

It also changes strings.EqualFold to use `for range` for the first
string since this is ~10% faster than using utf8.DecodeRuneInString for
both (see golang#31666).

Performance (similar results on arm64 and amd64):

name                  old time/op  new time/op  delta
EqualFold/Tests-10     242ns ± 0%   176ns ± 1%  -27.34%  (p=0.008 n=5+5)
EqualFold/ASCII-10    21.0ns ± 0%   9.9ns ± 0%  -52.98%  (p=0.008 n=5+5)
EqualFold/Unicode-10  88.5ns ± 0%  79.2ns ± 0%  -10.52%  (p=0.008 n=5+5)

Change-Id: I058f3f97a08dc04d65af895674d85420f920abe1
jo3-l added a commit to jo3-l/yagpdb that referenced this issue Aug 27, 2022
lexer.next is a hot function, as demonstrated by profiling. Most programs
will consist of ASCII characters only, which we can optimize for. Ideally
DecodeRuneInString would be inlined here and this wouldn't be a problem at all,
but that won't be the case until golang/go#31666
is resolved.

name                 old time/op    new time/op    delta
Parse/lorem_ipsum-4     733ns ± 2%     733ns ± 2%    ~     (p=0.933 n=8+8)
Parse/short-4          16.6µs ± 1%    15.5µs ± 2%  -6.75%  (p=0.000 n=8+8)
Parse/medium-4         40.1µs ± 1%    38.7µs ± 1%  -3.51%  (p=0.000 n=8+8)
Parse/long-4            123µs ± 1%     115µs ± 1%  -5.86%  (p=0.001 n=7+7)
Parse/very-long-4       416µs ± 1%     396µs ± 1%  -4.70%  (p=0.000 n=8+8)

name                 old alloc/op   new alloc/op   delta
Parse/lorem_ipsum-4    1.16kB ± 0%    1.16kB ± 0%    ~     (all equal)
Parse/short-4          5.42kB ± 0%    5.42kB ± 0%    ~     (all equal)
Parse/medium-4         12.6kB ± 0%    12.6kB ± 0%    ~     (all equal)
Parse/long-4           35.3kB ± 0%    35.3kB ± 0%    ~     (all equal)
Parse/very-long-4       113kB ± 0%     113kB ± 0%    ~     (all equal)

name                 old allocs/op  new allocs/op  delta
Parse/lorem_ipsum-4      10.0 ± 0%      10.0 ± 0%    ~     (all equal)
Parse/short-4             107 ± 0%       107 ± 0%    ~     (all equal)
Parse/medium-4            296 ± 0%       296 ± 0%    ~     (all equal)
Parse/long-4              777 ± 0%       777 ± 0%    ~     (all equal)
Parse/very-long-4       2.61k ± 0%     2.61k ± 0%    ~     (all equal)
ashishjh-bst pushed a commit to botlabs-gg/yagpdb that referenced this issue Aug 29, 2022
* template: add parse benchmarks

Taken from the yagpdb-cc repository. To avoid licensing issues, I only used
programs written by me.

Baseline performance:

name                 time/op
Parse/lorem_ipsum-4  2.01µs ± 7%
Parse/short-4        70.2µs ± 1%
Parse/medium-4        165µs ± 1%
Parse/long-4          517µs ± 1%
Parse/very-long-4    1.79ms ± 1%

name                 alloc/op
Parse/lorem_ipsum-4  1.23kB ± 0%
Parse/short-4        5.48kB ± 0%
Parse/medium-4       12.7kB ± 0%
Parse/long-4         35.3kB ± 0%
Parse/very-long-4     113kB ± 0%

name                 allocs/op
Parse/lorem_ipsum-4    12.0 ± 0%
Parse/short-4           109 ± 0%
Parse/medium-4          298 ± 0%
Parse/long-4            779 ± 0%
Parse/very-long-4     2.61k ± 0%

* template: port upstream changes

Notably, this includes https://go-review.googlesource.com/c/go/+/421883,
which changes the lexer to not spawn a new goroutine.

name                 old time/op    new time/op    delta
Parse/lorem_ipsum-4    2.01µs ± 7%    0.73µs ± 2%  -63.56%  (p=0.000 n=7+8)
Parse/short-4          70.2µs ± 1%    16.6µs ± 1%  -76.32%  (p=0.000 n=7+8)
Parse/medium-4          165µs ± 1%      40µs ± 1%  -75.74%  (p=0.000 n=8+8)
Parse/long-4            517µs ± 1%     123µs ± 1%  -76.28%  (p=0.000 n=8+7)
Parse/very-long-4      1.79ms ± 1%    0.42ms ± 1%  -76.81%  (p=0.000 n=8+8)

name                 old alloc/op   new alloc/op   delta
Parse/lorem_ipsum-4    1.23kB ± 0%    1.16kB ± 0%   -5.56%  (p=0.000 n=8+8)
Parse/short-4          5.48kB ± 0%    5.42kB ± 0%   -1.17%  (p=0.000 n=8+8)
Parse/medium-4         12.7kB ± 0%    12.6kB ± 0%   -0.51%  (p=0.000 n=8+8)
Parse/long-4           35.3kB ± 0%    35.3kB ± 0%   -0.17%  (p=0.000 n=8+8)
Parse/very-long-4       113kB ± 0%     113kB ± 0%   -0.05%  (p=0.000 n=8+8)

name                 old allocs/op  new allocs/op  delta
Parse/lorem_ipsum-4      12.0 ± 0%      10.0 ± 0%  -16.67%  (p=0.000 n=8+8)
Parse/short-4             109 ± 0%       107 ± 0%   -1.83%  (p=0.000 n=8+8)
Parse/medium-4            298 ± 0%       296 ± 0%   -0.67%  (p=0.000 n=8+8)
Parse/long-4              779 ± 0%       777 ± 0%   -0.26%  (p=0.000 n=8+8)
Parse/very-long-4       2.61k ± 0%     2.61k ± 0%   -0.08%  (p=0.000 n=8+8)

* template: add ASCII fast path for lexer.next

lexer.next is a hot function, as demonstrated by profiling. Most programs
will consist of ASCII characters only, which we can optimize for. Ideally
DecodeRuneInString would be inlined here and this wouldn't be a problem at all,
but that won't be the case until golang/go#31666
is resolved.

name                 old time/op    new time/op    delta
Parse/lorem_ipsum-4     733ns ± 2%     733ns ± 2%    ~     (p=0.933 n=8+8)
Parse/short-4          16.6µs ± 1%    15.5µs ± 2%  -6.75%  (p=0.000 n=8+8)
Parse/medium-4         40.1µs ± 1%    38.7µs ± 1%  -3.51%  (p=0.000 n=8+8)
Parse/long-4            123µs ± 1%     115µs ± 1%  -5.86%  (p=0.001 n=7+7)
Parse/very-long-4       416µs ± 1%     396µs ± 1%  -4.70%  (p=0.000 n=8+8)

name                 old alloc/op   new alloc/op   delta
Parse/lorem_ipsum-4    1.16kB ± 0%    1.16kB ± 0%    ~     (all equal)
Parse/short-4          5.42kB ± 0%    5.42kB ± 0%    ~     (all equal)
Parse/medium-4         12.6kB ± 0%    12.6kB ± 0%    ~     (all equal)
Parse/long-4           35.3kB ± 0%    35.3kB ± 0%    ~     (all equal)
Parse/very-long-4       113kB ± 0%     113kB ± 0%    ~     (all equal)

name                 old allocs/op  new allocs/op  delta
Parse/lorem_ipsum-4      10.0 ± 0%      10.0 ± 0%    ~     (all equal)
Parse/short-4             107 ± 0%       107 ± 0%    ~     (all equal)
Parse/medium-4            296 ± 0%       296 ± 0%    ~     (all equal)
Parse/long-4              777 ± 0%       777 ± 0%    ~     (all equal)
Parse/very-long-4       2.61k ± 0%     2.61k ± 0%    ~     (all equal)

* template: preallocate for 4 args in CommandNode

CommandNode.append and by extension runtime.growslice was showing up
more than expected during profiling. Allocate enough space for four
arguments up front so we don't need to reallocate as much. Although
this doesn't benefit performance much, it does have a clear positive effect on
memory usage.

name                 old time/op    new time/op    delta
Parse/lorem_ipsum-4     733ns ± 2%     728ns ± 2%     ~     (p=0.315 n=8+8)
Parse/short-4          15.5µs ± 2%    15.2µs ± 1%   -2.12%  (p=0.001 n=8+8)
Parse/medium-4         38.7µs ± 1%    38.0µs ± 1%   -1.99%  (p=0.000 n=8+8)
Parse/long-4            115µs ± 1%     117µs ± 5%     ~     (p=0.281 n=7+8)
Parse/very-long-4       396µs ± 1%     400µs ± 1%   +1.02%  (p=0.002 n=8+8)

name                 old alloc/op   new alloc/op   delta
Parse/lorem_ipsum-4    1.16kB ± 0%    1.16kB ± 0%     ~     (all equal)
Parse/short-4          5.42kB ± 0%    5.08kB ± 0%   -6.20%  (p=0.000 n=8+8)
Parse/medium-4         12.6kB ± 0%    12.7kB ± 0%   +0.51%  (p=0.000 n=8+8)
Parse/long-4           35.3kB ± 0%    34.4kB ± 0%   -2.58%  (p=0.000 n=8+8)
Parse/very-long-4       113kB ± 0%     110kB ± 0%   -2.37%  (p=0.000 n=8+8)

name                 old allocs/op  new allocs/op  delta
Parse/lorem_ipsum-4      10.0 ± 0%      10.0 ± 0%     ~     (all equal)
Parse/short-4             107 ± 0%        93 ± 0%  -13.08%  (p=0.000 n=8+8)
Parse/medium-4            296 ± 0%       276 ± 0%   -6.76%  (p=0.000 n=8+8)
Parse/long-4              777 ± 0%       705 ± 0%   -9.27%  (p=0.000 n=8+8)
Parse/very-long-4       2.61k ± 0%     2.35k ± 0%   -9.77%  (p=0.000 n=8+8)
gopherbot pushed a commit that referenced this issue Sep 21, 2022
This commit adds an ASCII fast path to bytes/strings EqualFold that
roughly doubles performance when all characters are ASCII.

It also changes strings.EqualFold to use `for range` for the first
string since this is ~10% faster than using utf8.DecodeRuneInString for
both (see #31666).

Performance (similar results on arm64 and amd64):

name                        old time/op  new time/op  delta
EqualFold/Tests-10           238ns ± 0%   172ns ± 1%  -27.91%  (p=0.000 n=10+10)
EqualFold/ASCII-10          20.5ns ± 0%   9.7ns ± 0%  -52.73%  (p=0.000 n=10+10)
EqualFold/UnicodePrefix-10  86.5ns ± 0%  77.6ns ± 0%  -10.37%  (p=0.000 n=10+10)
EqualFold/UnicodeSuffix-10  86.8ns ± 2%  71.3ns ± 0%  -17.88%  (p=0.000 n=10+8)

Change-Id: I058f3f97a08dc04d65af895674d85420f920abe1
Reviewed-on: https://go-review.googlesource.com/c/go/+/425459
Reviewed-by: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Run-TryBot: Ian Lance Taylor <iant@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
None yet
Development

No branches or pull requests

6 participants