Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bytes, strings: optimize Trim for ASCII cutsets #46446

Closed
dsnet opened this issue May 29, 2021 · 17 comments
Closed

bytes, strings: optimize Trim for ASCII cutsets #46446

dsnet opened this issue May 29, 2021 · 17 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@dsnet
Copy link
Member

dsnet commented May 29, 2021

I analyzed usages of strings.Trim (and related functions in both strings and bytes) in all source code known by the module proxy:

  • 76.6% have a cutset of len=1, and
  • 13.4% have a cutset of len=2.

Given that a vast majority of usages only have a cutset of len=1, I argue we should more heavily optimize for that situation. Currently, there is some optimization for cutsets of len=1, but it’s within the internal makeCutsetFunc function. This is sub-optimal as it incurs an allocation in the internal makeCutsetFunc for the closure over that single byte. I believe we should place special handling of cutsets with len=1 directly in Trim, TrimRight, and TrimLeft.

Example benchmark for strings.TrimRight("hello==", "="):

BenchmarkStd   	18299240	        69.16 ns/op	      24 B/op	       1 allocs/op
BenchmarkOpt   	226575349	         5.330 ns/op	       0 B/op	       0 allocs/op

The suggested optimization results in a >10x speedup for this common case.

P.S. There is currently an optimization for cutsets that are all ASCII. This optimization is still justified to keep as 99.3% of all cutsets are pure ASCII.

@ALTree ALTree added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label May 29, 2021
@ianlancetaylor
Copy link
Contributor

Out of curiosity, is there any particular byte that predominantly appears in the set of single-byte cutsets?

@dsnet
Copy link
Member Author

dsnet commented May 30, 2021

Out of curiosity, is there any particular byte that predominantly appears in the set of single-byte cutsets?

The top-ten single-byte cutsets are:

  • '/' at 30.2%
  • '\n' at 15.7%
  • ' ' at 10.1%
  • '.' at 7.0%
  • '\x00' at 5.7%
  • '0' at 5.0%
  • ',' at 3.5%
  • '=' at 3.4%
  • '-' at 2.6%
  • '|' at 2.5%
  • everything else at 14.3%

@dsnet
Copy link
Member Author

dsnet commented May 30, 2021

https://go-review.googlesource.com/c/go/+/323315 is an example of speedup in application code if bytes.Trim were to be optimized.

@gopherbot
Copy link

Change https://golang.org/cl/323318 mentions this issue: bytes, strings: optimize Trim for single byte cutsets

@nightlyone
Copy link
Contributor

Maybe is is useful to do the same optimization as LastIndexAny and IndexAny where the cutover to use a cutset is 8 bytes.

@dsnet
Copy link
Member Author

dsnet commented May 30, 2021

@nightlyone, can you elaborate more what you have in mind?

Currently, the logic for LastIndexAny is already fairly complex. Any optimizations of LastIndexAny should be driven by how Go code at scale uses it, which is an analysis I haven't performed yet.

I should note that the we could conceivably delete the optimization in LastIndexAny for a cutset with len=1, which was added in golang.org/cl/156998. I don't find that specific optimization particularly compelling since:

  1. LastIndexByte already exists, and
  2. a previous commit by mine indicates that only 6% of usages have 1 character cutsets.

@josharian
Copy link
Contributor

I recently cobbled together https://go-review.googlesource.com/c/go/+/271669 to work around that allocation. As I noted there, "another alternative is to duplicate a bunch of the trimming code to work directly on asciiSets", which might be a viable option to consider.

@bradfitz also had some thoughts about this. I've forgotten what they were, but perhaps he remembers. :)

@dsnet
Copy link
Member Author

dsnet commented Jun 1, 2021

Interesting idea caching the cutset functions. I think inlining the single-byte handling is still preferred since a quick check against a single byte is dramatically faster than always calling a func(rune) bool.

That said, if it provides any insights, the top most common cutsets with len>1 are:

  • "\r\n" at 4.6%
  • " \t" at 2.3%
  • "\n\r" at 0.8%
  • "//" at 0.7%
  • ", " at 0.5%
  • " \t\r\n" at 0.5%
  • "\t\n\f\r " at 0.4%
  • " \t\n\f\r" at 0.4%
  • "./" at 0.3%
  • "01234567" at 0.3%

Observations of the above:

  • Most of them deal with ASCII whitespace characters.
  • Trimming ASCII digits seems common.
  • One of them is useless (e.g., "//" is an identical cutset to "/").

@gopherbot
Copy link

Change https://golang.org/cl/325251 mentions this issue: strings,bytes: optimize the Trim/TrimLeft/TrimRight fast paths

CAFxX added a commit to CAFxX/go that referenced this issue Jul 4, 2021
There is evidence that the vast majority of uses for Trim* involve cutsets
with a single ASCII character, and the vast majority of remaining uses
involve cutsets with a small (<4) ASCII characters. For this reason it makes
sense to provide better fast paths for these common cases.

Note that the regressions in TrimSpace are likely due to code alignment,
since TrimSpace shares no code with the changes in this CL.

bytes:
name                    old time/op  new time/op  delta
TrimSpace/NoTrim        3.60ns ± 2%  3.75ns ±10%   +4.13%  (p=0.027 n=10+10)
TrimSpace/ASCII         6.65ns ± 6%  7.30ns ±12%   +9.73%  (p=0.017 n=9+10)
TrimSpace/SomeNonASCII  72.0ns ± 2%  71.7ns ± 3%     ~     (p=0.604 n=10+9)
TrimSpace/JustNonASCII   116ns ± 2%   115ns ± 1%   -1.16%  (p=0.037 n=10+10)
TrimASCII/1:1           40.9ns ± 8%   4.0ns ± 4%  -90.29%  (p=0.000 n=10+10)
TrimASCII/1:2           62.3ns ± 3%   9.1ns ± 3%  -85.45%  (p=0.000 n=10+10)
TrimASCII/1:4           62.9ns ± 2%  10.3ns ± 2%  -83.54%  (p=0.000 n=9+10)
TrimASCII/1:8           69.6ns ± 8%  69.2ns ± 2%     ~     (p=0.959 n=8+8)
TrimASCII/1:16          78.2ns ± 1%  83.1ns ±12%     ~     (p=0.274 n=8+10)
TrimASCII/16:1          83.5ns ± 1%  13.4ns ± 4%  -83.89%  (p=0.000 n=8+9)
TrimASCII/16:2           116ns ±10%    34ns ± 3%  -71.18%  (p=0.000 n=10+10)
TrimASCII/16:4           106ns ± 4%    49ns ± 2%  -53.29%  (p=0.000 n=10+10)
TrimASCII/16:8           110ns ± 3%   114ns ± 2%   +3.67%  (p=0.000 n=10+10)
TrimASCII/16:16          119ns ± 3%   124ns ± 2%   +4.19%  (p=0.000 n=9+10)
TrimASCII/256:1          721ns ± 6%   182ns ± 6%  -74.79%  (p=0.000 n=10+10)
TrimASCII/256:2          739ns ± 1%   437ns ± 8%  -40.84%  (p=0.000 n=8+10)
TrimASCII/256:4          746ns ± 2%   653ns ± 2%  -12.38%  (p=0.000 n=10+9)
TrimASCII/256:8          745ns ± 2%   801ns ± 2%   +7.49%  (p=0.000 n=10+10)
TrimASCII/256:16         808ns ±11%   813ns ± 2%     ~     (p=0.954 n=10+10)
TrimASCII/4096:1        11.4µs ±13%   2.6µs ± 2%  -76.95%  (p=0.000 n=10+9)
TrimASCII/4096:2        10.9µs ± 1%   5.4µs ± 2%  -50.42%  (p=0.000 n=10+10)
TrimASCII/4096:4        11.3µs ± 6%  10.3µs ± 4%   -9.10%  (p=0.000 n=10+10)
TrimASCII/4096:8        11.1µs ± 3%  11.4µs ± 5%   +3.32%  (p=0.029 n=10+10)
TrimASCII/4096:16       10.9µs ± 3%  12.0µs ±12%  +10.21%  (p=0.001 n=10+10)

strings:
name                    old time/op    new time/op    delta
Trim                      1.33µs ± 4%    0.70µs ± 3%  -47.20%  (p=0.000 n=10+10)
TrimASCII/1:1             39.4ns ± 2%     4.6ns ± 3%  -88.24%  (p=0.000 n=10+10)
TrimASCII/1:2             63.1ns ± 3%     8.8ns ± 3%  -86.02%  (p=0.000 n=9+9)
TrimASCII/1:4             63.9ns ± 3%    10.6ns ± 2%  -83.39%  (p=0.000 n=9+10)
TrimASCII/1:8             68.9ns ± 3%    69.0ns ± 5%     ~     (p=1.000 n=9+9)
TrimASCII/1:16            82.7ns ± 9%    83.6ns ±15%     ~     (p=0.684 n=10+10)
TrimASCII/16:1            79.5ns ± 5%    14.6ns ± 2%  -81.60%  (p=0.000 n=10+9)
TrimASCII/16:2             107ns ± 5%      34ns ± 3%  -68.30%  (p=0.000 n=9+10)
TrimASCII/16:4             107ns ± 4%      49ns ± 2%  -54.30%  (p=0.000 n=10+10)
TrimASCII/16:8             111ns ± 5%     114ns ± 2%     ~     (p=0.068 n=9+10)
TrimASCII/16:16            118ns ± 3%     121ns ± 4%   +2.26%  (p=0.014 n=10+10)
TrimASCII/256:1            611ns ± 8%     183ns ± 8%  -70.10%  (p=0.000 n=9+10)
TrimASCII/256:2            811ns ± 8%     442ns ± 9%  -45.54%  (p=0.000 n=10+10)
TrimASCII/256:4            760ns ± 5%     650ns ± 2%  -14.41%  (p=0.000 n=10+9)
TrimASCII/256:8            754ns ± 3%     765ns ± 3%     ~     (p=0.101 n=10+10)
TrimASCII/256:16           757ns ± 4%     773ns ± 3%   +2.19%  (p=0.012 n=10+10)
TrimASCII/4096:1          8.88µs ± 2%    2.70µs ± 3%  -69.64%  (p=0.000 n=10+9)
TrimASCII/4096:2          11.0µs ± 2%     5.5µs ± 3%  -50.06%  (p=0.000 n=10+10)
TrimASCII/4096:4          11.0µs ± 3%    10.3µs ± 5%   -6.53%  (p=0.000 n=10+10)
TrimASCII/4096:8          11.9µs ±13%    12.0µs ±12%     ~     (p=0.796 n=10+10)
TrimASCII/4096:16         11.6µs ±10%    12.0µs ±12%     ~     (p=0.247 n=10+10)
TrimSpace/NoTrim          3.65ns ± 2%    3.73ns ± 2%   +2.13%  (p=0.004 n=10+10)
TrimSpace/ASCII           6.09ns ± 3%    6.71ns ± 3%  +10.13%  (p=0.000 n=9+10)
TrimSpace/SomeNonASCII    71.6ns ± 1%    72.5ns ± 2%     ~     (p=0.077 n=9+9)
TrimSpace/JustNonASCII     109ns ± 2%     111ns ± 5%   +2.19%  (p=0.030 n=10+10)

name                    old alloc/op   new alloc/op   delta
Trim                        576B ± 0%      168B ± 0%  -70.83%  (p=0.000 n=10+10)

name                    old allocs/op  new allocs/op  delta
Trim                        23.0 ± 0%       6.0 ± 0%  -73.91%  (p=0.000 n=10+10)

Fixes golang#46446

Change-Id: I87bf0d621eecb876f61c6efeff0bf5c3213973b5
CAFxX added a commit to CAFxX/go that referenced this issue Jul 4, 2021
There is evidence that the vast majority of uses for Trim* involve cutsets
with a single ASCII character, and the vast majority of remaining uses
involve cutsets with a small (<4) ASCII characters. For this reason it makes
sense to provide better fast paths for these common cases.

Note that the regressions in TrimSpace are likely due to code alignment,
since TrimSpace shares no code with the changes in this CL.

bytes:
name                    old time/op  new time/op  delta
TrimSpace/NoTrim        3.60ns ± 2%  3.75ns ±10%   +4.13%  (p=0.027 n=10+10)
TrimSpace/ASCII         6.65ns ± 6%  7.30ns ±12%   +9.73%  (p=0.017 n=9+10)
TrimSpace/SomeNonASCII  72.0ns ± 2%  71.7ns ± 3%     ~     (p=0.604 n=10+9)
TrimSpace/JustNonASCII   116ns ± 2%   115ns ± 1%   -1.16%  (p=0.037 n=10+10)
TrimASCII/1:1           40.9ns ± 8%   4.0ns ± 4%  -90.29%  (p=0.000 n=10+10)
TrimASCII/1:2           62.3ns ± 3%   9.1ns ± 3%  -85.45%  (p=0.000 n=10+10)
TrimASCII/1:4           62.9ns ± 2%  10.3ns ± 2%  -83.54%  (p=0.000 n=9+10)
TrimASCII/1:8           69.6ns ± 8%  69.2ns ± 2%     ~     (p=0.959 n=8+8)
TrimASCII/1:16          78.2ns ± 1%  83.1ns ±12%     ~     (p=0.274 n=8+10)
TrimASCII/16:1          83.5ns ± 1%  13.4ns ± 4%  -83.89%  (p=0.000 n=8+9)
TrimASCII/16:2           116ns ±10%    34ns ± 3%  -71.18%  (p=0.000 n=10+10)
TrimASCII/16:4           106ns ± 4%    49ns ± 2%  -53.29%  (p=0.000 n=10+10)
TrimASCII/16:8           110ns ± 3%   114ns ± 2%   +3.67%  (p=0.000 n=10+10)
TrimASCII/16:16          119ns ± 3%   124ns ± 2%   +4.19%  (p=0.000 n=9+10)
TrimASCII/256:1          721ns ± 6%   182ns ± 6%  -74.79%  (p=0.000 n=10+10)
TrimASCII/256:2          739ns ± 1%   437ns ± 8%  -40.84%  (p=0.000 n=8+10)
TrimASCII/256:4          746ns ± 2%   653ns ± 2%  -12.38%  (p=0.000 n=10+9)
TrimASCII/256:8          745ns ± 2%   801ns ± 2%   +7.49%  (p=0.000 n=10+10)
TrimASCII/256:16         808ns ±11%   813ns ± 2%     ~     (p=0.954 n=10+10)
TrimASCII/4096:1        11.4µs ±13%   2.6µs ± 2%  -76.95%  (p=0.000 n=10+9)
TrimASCII/4096:2        10.9µs ± 1%   5.4µs ± 2%  -50.42%  (p=0.000 n=10+10)
TrimASCII/4096:4        11.3µs ± 6%  10.3µs ± 4%   -9.10%  (p=0.000 n=10+10)
TrimASCII/4096:8        11.1µs ± 3%  11.4µs ± 5%   +3.32%  (p=0.029 n=10+10)
TrimASCII/4096:16       10.9µs ± 3%  12.0µs ±12%  +10.21%  (p=0.001 n=10+10)

strings:
name                    old time/op    new time/op    delta
Trim                      1.33µs ± 4%    0.70µs ± 3%  -47.20%  (p=0.000 n=10+10)
TrimASCII/1:1             39.4ns ± 2%     4.6ns ± 3%  -88.24%  (p=0.000 n=10+10)
TrimASCII/1:2             63.1ns ± 3%     8.8ns ± 3%  -86.02%  (p=0.000 n=9+9)
TrimASCII/1:4             63.9ns ± 3%    10.6ns ± 2%  -83.39%  (p=0.000 n=9+10)
TrimASCII/1:8             68.9ns ± 3%    69.0ns ± 5%     ~     (p=1.000 n=9+9)
TrimASCII/1:16            82.7ns ± 9%    83.6ns ±15%     ~     (p=0.684 n=10+10)
TrimASCII/16:1            79.5ns ± 5%    14.6ns ± 2%  -81.60%  (p=0.000 n=10+9)
TrimASCII/16:2             107ns ± 5%      34ns ± 3%  -68.30%  (p=0.000 n=9+10)
TrimASCII/16:4             107ns ± 4%      49ns ± 2%  -54.30%  (p=0.000 n=10+10)
TrimASCII/16:8             111ns ± 5%     114ns ± 2%     ~     (p=0.068 n=9+10)
TrimASCII/16:16            118ns ± 3%     121ns ± 4%   +2.26%  (p=0.014 n=10+10)
TrimASCII/256:1            611ns ± 8%     183ns ± 8%  -70.10%  (p=0.000 n=9+10)
TrimASCII/256:2            811ns ± 8%     442ns ± 9%  -45.54%  (p=0.000 n=10+10)
TrimASCII/256:4            760ns ± 5%     650ns ± 2%  -14.41%  (p=0.000 n=10+9)
TrimASCII/256:8            754ns ± 3%     765ns ± 3%     ~     (p=0.101 n=10+10)
TrimASCII/256:16           757ns ± 4%     773ns ± 3%   +2.19%  (p=0.012 n=10+10)
TrimASCII/4096:1          8.88µs ± 2%    2.70µs ± 3%  -69.64%  (p=0.000 n=10+9)
TrimASCII/4096:2          11.0µs ± 2%     5.5µs ± 3%  -50.06%  (p=0.000 n=10+10)
TrimASCII/4096:4          11.0µs ± 3%    10.3µs ± 5%   -6.53%  (p=0.000 n=10+10)
TrimASCII/4096:8          11.9µs ±13%    12.0µs ±12%     ~     (p=0.796 n=10+10)
TrimASCII/4096:16         11.6µs ±10%    12.0µs ±12%     ~     (p=0.247 n=10+10)
TrimSpace/NoTrim          3.65ns ± 2%    3.73ns ± 2%   +2.13%  (p=0.004 n=10+10)
TrimSpace/ASCII           6.09ns ± 3%    6.71ns ± 3%  +10.13%  (p=0.000 n=9+10)
TrimSpace/SomeNonASCII    71.6ns ± 1%    72.5ns ± 2%     ~     (p=0.077 n=9+9)
TrimSpace/JustNonASCII     109ns ± 2%     111ns ± 5%   +2.19%  (p=0.030 n=10+10)

name                    old alloc/op   new alloc/op   delta
Trim                        576B ± 0%      168B ± 0%  -70.83%  (p=0.000 n=10+10)

name                    old allocs/op  new allocs/op  delta
Trim                        23.0 ± 0%       6.0 ± 0%  -73.91%  (p=0.000 n=10+10)

Fixes golang#46446

Change-Id: I87bf0d621eecb876f61c6efeff0bf5c3213973b5
@gopherbot
Copy link

Change https://golang.org/cl/332771 mentions this issue: strings,bytes: optimize the Trim/TrimLeft/TrimRight fast paths

@CAFxX
Copy link
Contributor

CAFxX commented Jul 5, 2021

https://golang.org/cl/332771 removes the allocation in all cases, and in general speeds up Trim/TrimLeft/TrimRight 2x~10x depending on the cutset, with the biggest speedups for the case of a cutset composed of a single ASCII character.

@andig
Copy link
Contributor

andig commented Jul 31, 2021

Out of curiosity as the speedup seems significant: what is missing to move either of the CLs forward? A decision on fast path vs pooling? Combining both?

@josharian
Copy link
Contributor

The CL from @CAFxX looks like a good approach to me. Hopefully once the tree opens again we can get it reviewed and submitted.

@dsnet dsnet changed the title bytes, strings: optimize Trim for 1-byte cutsets bytes, strings: optimize Trim for multi-byte cutsets Aug 25, 2021
@dsnet dsnet reopened this Aug 25, 2021
@dsnet
Copy link
Member Author

dsnet commented Aug 25, 2021

Re-opening since there's some interest in optimizing for multi-byte ASCII cutsets. The evidence for supporting single-byte cutsets is very strong given that they are >75% of all usages. For the remaining 25% where the cutset is not a single byte, 97% of them are all ASCII.

@cespare
Copy link
Contributor

cespare commented Aug 25, 2021

@dsnet what do you think about opening a new issues and reverting the title of this one? I think it's confusing that this was originally about 1-byte cutsets, and then that was implemented, and now the issue is reopened and is about something new.

At minimum, it should be retitled to be about both optimizations.

@dsnet dsnet changed the title bytes, strings: optimize Trim for multi-byte cutsets bytes, strings: optimize Trim for ASCII cutsets Aug 25, 2021
@dsnet
Copy link
Member Author

dsnet commented Aug 25, 2021

I renamed the issue to be about ASCII cutsets, which I think accurately captured both optimizations being discussed.

@CAFxX
Copy link
Contributor

CAFxX commented Sep 4, 2021

I just rebased and re-ran the benchmarks in https://golang.org/cl/332771. Would be ideal if someone could take a look at it. It completely eliminates all memory allocations and significantly speeds up both the ASCII fast path and the slow path.

@dmitshur dmitshur removed the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Nov 19, 2021
@dmitshur dmitshur added the NeedsFix The path to resolution is known, but the work has not been done. label Nov 19, 2021
@dmitshur dmitshur added this to the Go1.18 milestone Nov 19, 2021
@golang golang locked and limited conversation to collaborators Nov 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
None yet
Development

No branches or pull requests

10 participants