New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bytes, strings: optimize Trim for ASCII cutsets #46446
Comments
Out of curiosity, is there any particular byte that predominantly appears in the set of single-byte cutsets? |
The top-ten single-byte cutsets are:
|
https://go-review.googlesource.com/c/go/+/323315 is an example of speedup in application code if |
Change https://golang.org/cl/323318 mentions this issue: |
Maybe is is useful to do the same optimization as LastIndexAny and IndexAny where the cutover to use a cutset is 8 bytes. |
@nightlyone, can you elaborate more what you have in mind? Currently, the logic for I should note that the we could conceivably delete the optimization in
|
I recently cobbled together https://go-review.googlesource.com/c/go/+/271669 to work around that allocation. As I noted there, "another alternative is to duplicate a bunch of the trimming code to work directly on asciiSets", which might be a viable option to consider. @bradfitz also had some thoughts about this. I've forgotten what they were, but perhaps he remembers. :) |
Interesting idea caching the cutset functions. I think inlining the single-byte handling is still preferred since a quick check against a single byte is dramatically faster than always calling a That said, if it provides any insights, the top most common cutsets with len>1 are:
Observations of the above:
|
Change https://golang.org/cl/325251 mentions this issue: |
There is evidence that the vast majority of uses for Trim* involve cutsets with a single ASCII character, and the vast majority of remaining uses involve cutsets with a small (<4) ASCII characters. For this reason it makes sense to provide better fast paths for these common cases. Note that the regressions in TrimSpace are likely due to code alignment, since TrimSpace shares no code with the changes in this CL. bytes: name old time/op new time/op delta TrimSpace/NoTrim 3.60ns ± 2% 3.75ns ±10% +4.13% (p=0.027 n=10+10) TrimSpace/ASCII 6.65ns ± 6% 7.30ns ±12% +9.73% (p=0.017 n=9+10) TrimSpace/SomeNonASCII 72.0ns ± 2% 71.7ns ± 3% ~ (p=0.604 n=10+9) TrimSpace/JustNonASCII 116ns ± 2% 115ns ± 1% -1.16% (p=0.037 n=10+10) TrimASCII/1:1 40.9ns ± 8% 4.0ns ± 4% -90.29% (p=0.000 n=10+10) TrimASCII/1:2 62.3ns ± 3% 9.1ns ± 3% -85.45% (p=0.000 n=10+10) TrimASCII/1:4 62.9ns ± 2% 10.3ns ± 2% -83.54% (p=0.000 n=9+10) TrimASCII/1:8 69.6ns ± 8% 69.2ns ± 2% ~ (p=0.959 n=8+8) TrimASCII/1:16 78.2ns ± 1% 83.1ns ±12% ~ (p=0.274 n=8+10) TrimASCII/16:1 83.5ns ± 1% 13.4ns ± 4% -83.89% (p=0.000 n=8+9) TrimASCII/16:2 116ns ±10% 34ns ± 3% -71.18% (p=0.000 n=10+10) TrimASCII/16:4 106ns ± 4% 49ns ± 2% -53.29% (p=0.000 n=10+10) TrimASCII/16:8 110ns ± 3% 114ns ± 2% +3.67% (p=0.000 n=10+10) TrimASCII/16:16 119ns ± 3% 124ns ± 2% +4.19% (p=0.000 n=9+10) TrimASCII/256:1 721ns ± 6% 182ns ± 6% -74.79% (p=0.000 n=10+10) TrimASCII/256:2 739ns ± 1% 437ns ± 8% -40.84% (p=0.000 n=8+10) TrimASCII/256:4 746ns ± 2% 653ns ± 2% -12.38% (p=0.000 n=10+9) TrimASCII/256:8 745ns ± 2% 801ns ± 2% +7.49% (p=0.000 n=10+10) TrimASCII/256:16 808ns ±11% 813ns ± 2% ~ (p=0.954 n=10+10) TrimASCII/4096:1 11.4µs ±13% 2.6µs ± 2% -76.95% (p=0.000 n=10+9) TrimASCII/4096:2 10.9µs ± 1% 5.4µs ± 2% -50.42% (p=0.000 n=10+10) TrimASCII/4096:4 11.3µs ± 6% 10.3µs ± 4% -9.10% (p=0.000 n=10+10) TrimASCII/4096:8 11.1µs ± 3% 11.4µs ± 5% +3.32% (p=0.029 n=10+10) TrimASCII/4096:16 10.9µs ± 3% 12.0µs ±12% +10.21% (p=0.001 n=10+10) strings: name old time/op new time/op delta Trim 1.33µs ± 4% 0.70µs ± 3% -47.20% (p=0.000 n=10+10) TrimASCII/1:1 39.4ns ± 2% 4.6ns ± 3% -88.24% (p=0.000 n=10+10) TrimASCII/1:2 63.1ns ± 3% 8.8ns ± 3% -86.02% (p=0.000 n=9+9) TrimASCII/1:4 63.9ns ± 3% 10.6ns ± 2% -83.39% (p=0.000 n=9+10) TrimASCII/1:8 68.9ns ± 3% 69.0ns ± 5% ~ (p=1.000 n=9+9) TrimASCII/1:16 82.7ns ± 9% 83.6ns ±15% ~ (p=0.684 n=10+10) TrimASCII/16:1 79.5ns ± 5% 14.6ns ± 2% -81.60% (p=0.000 n=10+9) TrimASCII/16:2 107ns ± 5% 34ns ± 3% -68.30% (p=0.000 n=9+10) TrimASCII/16:4 107ns ± 4% 49ns ± 2% -54.30% (p=0.000 n=10+10) TrimASCII/16:8 111ns ± 5% 114ns ± 2% ~ (p=0.068 n=9+10) TrimASCII/16:16 118ns ± 3% 121ns ± 4% +2.26% (p=0.014 n=10+10) TrimASCII/256:1 611ns ± 8% 183ns ± 8% -70.10% (p=0.000 n=9+10) TrimASCII/256:2 811ns ± 8% 442ns ± 9% -45.54% (p=0.000 n=10+10) TrimASCII/256:4 760ns ± 5% 650ns ± 2% -14.41% (p=0.000 n=10+9) TrimASCII/256:8 754ns ± 3% 765ns ± 3% ~ (p=0.101 n=10+10) TrimASCII/256:16 757ns ± 4% 773ns ± 3% +2.19% (p=0.012 n=10+10) TrimASCII/4096:1 8.88µs ± 2% 2.70µs ± 3% -69.64% (p=0.000 n=10+9) TrimASCII/4096:2 11.0µs ± 2% 5.5µs ± 3% -50.06% (p=0.000 n=10+10) TrimASCII/4096:4 11.0µs ± 3% 10.3µs ± 5% -6.53% (p=0.000 n=10+10) TrimASCII/4096:8 11.9µs ±13% 12.0µs ±12% ~ (p=0.796 n=10+10) TrimASCII/4096:16 11.6µs ±10% 12.0µs ±12% ~ (p=0.247 n=10+10) TrimSpace/NoTrim 3.65ns ± 2% 3.73ns ± 2% +2.13% (p=0.004 n=10+10) TrimSpace/ASCII 6.09ns ± 3% 6.71ns ± 3% +10.13% (p=0.000 n=9+10) TrimSpace/SomeNonASCII 71.6ns ± 1% 72.5ns ± 2% ~ (p=0.077 n=9+9) TrimSpace/JustNonASCII 109ns ± 2% 111ns ± 5% +2.19% (p=0.030 n=10+10) name old alloc/op new alloc/op delta Trim 576B ± 0% 168B ± 0% -70.83% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Trim 23.0 ± 0% 6.0 ± 0% -73.91% (p=0.000 n=10+10) Fixes golang#46446 Change-Id: I87bf0d621eecb876f61c6efeff0bf5c3213973b5
There is evidence that the vast majority of uses for Trim* involve cutsets with a single ASCII character, and the vast majority of remaining uses involve cutsets with a small (<4) ASCII characters. For this reason it makes sense to provide better fast paths for these common cases. Note that the regressions in TrimSpace are likely due to code alignment, since TrimSpace shares no code with the changes in this CL. bytes: name old time/op new time/op delta TrimSpace/NoTrim 3.60ns ± 2% 3.75ns ±10% +4.13% (p=0.027 n=10+10) TrimSpace/ASCII 6.65ns ± 6% 7.30ns ±12% +9.73% (p=0.017 n=9+10) TrimSpace/SomeNonASCII 72.0ns ± 2% 71.7ns ± 3% ~ (p=0.604 n=10+9) TrimSpace/JustNonASCII 116ns ± 2% 115ns ± 1% -1.16% (p=0.037 n=10+10) TrimASCII/1:1 40.9ns ± 8% 4.0ns ± 4% -90.29% (p=0.000 n=10+10) TrimASCII/1:2 62.3ns ± 3% 9.1ns ± 3% -85.45% (p=0.000 n=10+10) TrimASCII/1:4 62.9ns ± 2% 10.3ns ± 2% -83.54% (p=0.000 n=9+10) TrimASCII/1:8 69.6ns ± 8% 69.2ns ± 2% ~ (p=0.959 n=8+8) TrimASCII/1:16 78.2ns ± 1% 83.1ns ±12% ~ (p=0.274 n=8+10) TrimASCII/16:1 83.5ns ± 1% 13.4ns ± 4% -83.89% (p=0.000 n=8+9) TrimASCII/16:2 116ns ±10% 34ns ± 3% -71.18% (p=0.000 n=10+10) TrimASCII/16:4 106ns ± 4% 49ns ± 2% -53.29% (p=0.000 n=10+10) TrimASCII/16:8 110ns ± 3% 114ns ± 2% +3.67% (p=0.000 n=10+10) TrimASCII/16:16 119ns ± 3% 124ns ± 2% +4.19% (p=0.000 n=9+10) TrimASCII/256:1 721ns ± 6% 182ns ± 6% -74.79% (p=0.000 n=10+10) TrimASCII/256:2 739ns ± 1% 437ns ± 8% -40.84% (p=0.000 n=8+10) TrimASCII/256:4 746ns ± 2% 653ns ± 2% -12.38% (p=0.000 n=10+9) TrimASCII/256:8 745ns ± 2% 801ns ± 2% +7.49% (p=0.000 n=10+10) TrimASCII/256:16 808ns ±11% 813ns ± 2% ~ (p=0.954 n=10+10) TrimASCII/4096:1 11.4µs ±13% 2.6µs ± 2% -76.95% (p=0.000 n=10+9) TrimASCII/4096:2 10.9µs ± 1% 5.4µs ± 2% -50.42% (p=0.000 n=10+10) TrimASCII/4096:4 11.3µs ± 6% 10.3µs ± 4% -9.10% (p=0.000 n=10+10) TrimASCII/4096:8 11.1µs ± 3% 11.4µs ± 5% +3.32% (p=0.029 n=10+10) TrimASCII/4096:16 10.9µs ± 3% 12.0µs ±12% +10.21% (p=0.001 n=10+10) strings: name old time/op new time/op delta Trim 1.33µs ± 4% 0.70µs ± 3% -47.20% (p=0.000 n=10+10) TrimASCII/1:1 39.4ns ± 2% 4.6ns ± 3% -88.24% (p=0.000 n=10+10) TrimASCII/1:2 63.1ns ± 3% 8.8ns ± 3% -86.02% (p=0.000 n=9+9) TrimASCII/1:4 63.9ns ± 3% 10.6ns ± 2% -83.39% (p=0.000 n=9+10) TrimASCII/1:8 68.9ns ± 3% 69.0ns ± 5% ~ (p=1.000 n=9+9) TrimASCII/1:16 82.7ns ± 9% 83.6ns ±15% ~ (p=0.684 n=10+10) TrimASCII/16:1 79.5ns ± 5% 14.6ns ± 2% -81.60% (p=0.000 n=10+9) TrimASCII/16:2 107ns ± 5% 34ns ± 3% -68.30% (p=0.000 n=9+10) TrimASCII/16:4 107ns ± 4% 49ns ± 2% -54.30% (p=0.000 n=10+10) TrimASCII/16:8 111ns ± 5% 114ns ± 2% ~ (p=0.068 n=9+10) TrimASCII/16:16 118ns ± 3% 121ns ± 4% +2.26% (p=0.014 n=10+10) TrimASCII/256:1 611ns ± 8% 183ns ± 8% -70.10% (p=0.000 n=9+10) TrimASCII/256:2 811ns ± 8% 442ns ± 9% -45.54% (p=0.000 n=10+10) TrimASCII/256:4 760ns ± 5% 650ns ± 2% -14.41% (p=0.000 n=10+9) TrimASCII/256:8 754ns ± 3% 765ns ± 3% ~ (p=0.101 n=10+10) TrimASCII/256:16 757ns ± 4% 773ns ± 3% +2.19% (p=0.012 n=10+10) TrimASCII/4096:1 8.88µs ± 2% 2.70µs ± 3% -69.64% (p=0.000 n=10+9) TrimASCII/4096:2 11.0µs ± 2% 5.5µs ± 3% -50.06% (p=0.000 n=10+10) TrimASCII/4096:4 11.0µs ± 3% 10.3µs ± 5% -6.53% (p=0.000 n=10+10) TrimASCII/4096:8 11.9µs ±13% 12.0µs ±12% ~ (p=0.796 n=10+10) TrimASCII/4096:16 11.6µs ±10% 12.0µs ±12% ~ (p=0.247 n=10+10) TrimSpace/NoTrim 3.65ns ± 2% 3.73ns ± 2% +2.13% (p=0.004 n=10+10) TrimSpace/ASCII 6.09ns ± 3% 6.71ns ± 3% +10.13% (p=0.000 n=9+10) TrimSpace/SomeNonASCII 71.6ns ± 1% 72.5ns ± 2% ~ (p=0.077 n=9+9) TrimSpace/JustNonASCII 109ns ± 2% 111ns ± 5% +2.19% (p=0.030 n=10+10) name old alloc/op new alloc/op delta Trim 576B ± 0% 168B ± 0% -70.83% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Trim 23.0 ± 0% 6.0 ± 0% -73.91% (p=0.000 n=10+10) Fixes golang#46446 Change-Id: I87bf0d621eecb876f61c6efeff0bf5c3213973b5
Change https://golang.org/cl/332771 mentions this issue: |
https://golang.org/cl/332771 removes the allocation in all cases, and in general speeds up Trim/TrimLeft/TrimRight 2x~10x depending on the cutset, with the biggest speedups for the case of a cutset composed of a single ASCII character. |
Out of curiosity as the speedup seems significant: what is missing to move either of the CLs forward? A decision on fast path vs pooling? Combining both? |
The CL from @CAFxX looks like a good approach to me. Hopefully once the tree opens again we can get it reviewed and submitted. |
Re-opening since there's some interest in optimizing for multi-byte ASCII cutsets. The evidence for supporting single-byte cutsets is very strong given that they are >75% of all usages. For the remaining 25% where the cutset is not a single byte, 97% of them are all ASCII. |
@dsnet what do you think about opening a new issues and reverting the title of this one? I think it's confusing that this was originally about 1-byte cutsets, and then that was implemented, and now the issue is reopened and is about something new. At minimum, it should be retitled to be about both optimizations. |
I renamed the issue to be about ASCII cutsets, which I think accurately captured both optimizations being discussed. |
I just rebased and re-ran the benchmarks in https://golang.org/cl/332771. Would be ideal if someone could take a look at it. It completely eliminates all memory allocations and significantly speeds up both the ASCII fast path and the slow path. |
I analyzed usages of
strings.Trim
(and related functions in bothstrings
andbytes
) in all source code known by the module proxy:Given that a vast majority of usages only have a cutset of len=1, I argue we should more heavily optimize for that situation. Currently, there is some optimization for cutsets of len=1, but it’s within the internal
makeCutsetFunc
function. This is sub-optimal as it incurs an allocation in the internalmakeCutsetFunc
for the closure over that single byte. I believe we should place special handling of cutsets with len=1 directly inTrim
,TrimRight
, andTrimLeft
.Example benchmark for
strings.TrimRight("hello==", "=")
:The suggested optimization results in a >10x speedup for this common case.
P.S. There is currently an optimization for cutsets that are all ASCII. This optimization is still justified to keep as 99.3% of all cutsets are pure ASCII.
The text was updated successfully, but these errors were encountered: