Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode/utf8: optimize utf8.Valid with AVX2 instructions on AMD64 #63347

Open
achille-roussel opened this issue Oct 3, 2023 · 11 comments
Open
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@achille-roussel
Copy link
Contributor

UTF-8 validation is employed in many applications using text formats such as JSON.

An implementation of Ridiculously fast unicode (UTF-8) validation has been available in https://github.com/segmentio/asm/tree/main/utf8 since early 2022; authored by @pelletier and myself, it was published under MIT license.

The algorithm described in the research paper yields up to 10x throughput on UTF-8 validation of medium to large strings, as shown in this preliminary comparison:

goos: linux
goarch: amd64
pkg: unicode/utf8
cpu: Intel(R) Xeon(R) CPU @ 2.60GHz
                            │     base      │                avx2                 │
                            │    sec/op     │   sec/op     vs base                │
ValidTenASCIIChars              5.038n ± 0%   4.156n ± 0%  -17.51% (p=0.000 n=10)
Valid100KASCIIChars            10.736µ ± 1%   3.684µ ± 0%  -65.68% (p=0.000 n=10)
ValidTenJapaneseChars           30.05n ± 0%   30.00n ± 0%   -0.17% (p=0.000 n=10)
ValidLongMostlyASCII           88.697µ ± 0%   3.174µ ± 0%  -96.42% (p=0.000 n=10)
ValidLongJapanese             126.734µ ± 0%   7.712µ ± 0%  -93.91% (p=0.000 n=10)
ValidStringTenASCIIChars        5.044n ± 0%   4.748n ± 0%   -5.88% (p=0.000 n=10)
ValidString100KASCIIChars      10.667µ ± 0%   3.683µ ± 0%  -65.47% (p=0.000 n=10)
ValidStringTenJapaneseChars     30.25n ± 1%   29.59n ± 0%   -2.21% (p=0.000 n=10)
ValidStringLongMostlyASCII     89.232µ ± 1%   3.173µ ± 0%  -96.44% (p=0.000 n=10)
ValidStringLongJapanese       126.678µ ± 0%   7.717µ ± 1%  -93.91% (p=0.000 n=10)
Valid/1kValid                  907.70n ± 0%   83.38n ± 0%  -90.81% (p=0.000 n=10)
Valid/1MValid                  933.39µ ± 1%   80.94µ ± 1%  -91.33% (p=0.000 n=10)
Valid/10ASCII                   4.750n ± 0%   4.152n ± 0%  -12.59% (p=0.000 n=10)
Valid/10Japan                   32.44n ± 0%   32.00n ± 0%   -1.37% (p=0.000 n=10)
Valid/small4096                3710.0n ± 0%   323.3n ± 0%  -91.28% (p=0.000 n=10)
Valid/small32768               29.623µ ± 1%   2.528µ ± 0%  -91.47% (p=0.000 n=10)
Valid/small65536               59.227µ ± 0%   5.052µ ± 0%  -91.47% (p=0.000 n=10)
Valid/small262144              236.84µ ± 0%   20.19µ ± 0%  -91.47% (p=0.000 n=10)
Valid/small4194304             3794.1µ ± 1%   324.6µ ± 0%  -91.44% (p=0.000 n=10)
Valid/small33554432            30.454m ± 0%   2.829m ± 1%  -90.71% (p=0.000 n=10)
Valid/small134217728           121.93m ± 0%   13.81m ± 1%  -88.68% (p=0.000 n=10)
Valid/small268435456           243.73m ± 0%   27.76m ± 1%  -88.61% (p=0.000 n=10)
Valid/tail300                  269.60n ± 0%   30.41n ± 0%  -88.72% (p=0.000 n=10)
Valid/tail316                  283.85n ± 0%   29.59n ± 0%  -89.58% (p=0.000 n=10)
geomean                         11.27µ        1.871µ       -83.41%

                     │     base     │                  avx2                   │
                     │     B/s      │      B/s       vs base                  │
Valid/1kValid          1.051Gi ± 0%   11.437Gi ± 0%   +988.55% (p=0.000 n=10)
Valid/1MValid          1.046Gi ± 1%   12.065Gi ± 1%  +1053.11% (p=0.000 n=10)
Valid/10ASCII          1.961Gi ± 0%    2.243Gi ± 0%    +14.40% (p=0.000 n=10)
Valid/10Japan          881.9Mi ± 0%    894.2Mi ± 0%     +1.40% (p=0.000 n=10)
Valid/small4096        1.028Gi ± 0%   11.798Gi ± 0%  +1047.49% (p=0.000 n=10)
Valid/small32768       1.030Gi ± 1%   12.071Gi ± 0%  +1071.69% (p=0.000 n=10)
Valid/small65536       1.031Gi ± 0%   12.082Gi ± 0%  +1072.40% (p=0.000 n=10)
Valid/small262144      1.031Gi ± 0%   12.090Gi ± 0%  +1072.86% (p=0.000 n=10)
Valid/small4194304     1.030Gi ± 1%   12.034Gi ± 0%  +1068.81% (p=0.000 n=10)
Valid/small33554432    1.026Gi ± 0%   11.044Gi ± 1%   +976.30% (p=0.000 n=10)
Valid/small134217728   1.025Gi ± 0%    9.054Gi ± 1%   +783.14% (p=0.000 n=10)
Valid/small268435456   1.026Gi ± 0%    9.006Gi ± 1%   +777.99% (p=0.000 n=10)
Valid/tail300          1.037Gi ± 0%    9.189Gi ± 0%   +786.53% (p=0.000 n=10)
Valid/tail316          1.037Gi ± 0%    9.948Gi ± 0%   +859.39% (p=0.000 n=10)
geomean                1.067Gi         8.136Gi        +662.20%

The algorithm also scales much better than the current implementation of utf8.Valid as the size of input grows, as shown by this graph plotting the compute time for the two versions for input sizes of 0 to 400 bytes:
image

The main trade-off here is complexity; while the code has been fuzz-tested against the current version of utf8.Valid, as well as successfully used on production workloads, a bug will always be possible in this implementation or on other platforms that would be added later on.

I looked for prior discussions on improving the performance of unicode/utf8, but could not find much content on the topic. This proposal will likely set a precedent for porting the algorithm to other architectures, if accepted.

Part of the requirements to integrate it with the standard library could be to expand test coverage in the unicode/utf8 package to ensure the correctness of this implementation and hypothetical future additions for other architectures.

@gopherbot gopherbot added this to the Proposal milestone Oct 3, 2023
@ianlancetaylor
Copy link
Contributor

This is not an API change, so taking it out of the proposal process.

@ianlancetaylor ianlancetaylor changed the title proposal: unicode/utf8: optimize utf8.Valid with AVX2 instructions on AMD64 unicode/utf8: optimize utf8.Valid with AVX2 instructions on AMD64 Oct 3, 2023
@ianlancetaylor ianlancetaylor added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed Proposal labels Oct 3, 2023
@ianlancetaylor ianlancetaylor modified the milestones: Proposal, Backlog Oct 3, 2023
@ianlancetaylor
Copy link
Contributor

The https://go.dev/wiki/AssemblyPolicy page is relevant here.

@artyom
Copy link
Member

artyom commented Oct 3, 2023

Probably related to #47120

@achille-roussel
Copy link
Contributor Author

achille-roussel commented Oct 3, 2023

Thanks for refining the category @ianlancetaylor, and for pointing out the Assembly Policy (somehow I had not found it); the document is extremely useful to help answer out some of the questions I had:

  • Minimize use of assembly. We'd rather have a small amount of assembly for a 50% speedup rather than twice as much assembly for a 55% speedup. Explain the decision to place the assembly/Go boundary where it is in the commit message, and support it with benchmarks.

Based on that example, it seems that the change would be worth it; ~900% increase in throughput is measurable and likely beneficial to any application that uses utf8.Valid on a non-trivial amount of data. We are adding ~300 lines of assembly, there was no assembly at all in unicode/utf8, but the amount of code remains manageable: there is a research paper documenting the algorithm, and most of it is global masks used in SIMD instructions.

  • Use higher level programs to generate non-trivial amounts of assembly, either standalone Go programs or go get-able programs, like avo. Output of other reproducible processes (like formally verified code generators) will also be considered. Discuss the implementation strategy on the issue tracker in advance.

The implementation was generated with Avo, I will look for examples of how/if Avo has been used in the stdlib (for example, I'm not sure if we should add the Avo generator to the stdlib code somewhere). If anyone has material to share on this topic it will be very useful.

  • Use small, testable units (25–75 lines) called from higher-level logic written in Go. If using small, testable functions called from logic written in Go is too slow, use small, testable assembly units with Go-compatible wrappers, so that Go tests can still test the individual units.
  • Any assembly function needs a reference Go implementation, that’s tested side-by-side with the assembly. Follow golang.org/wiki/TargetSpecific for structure and testing practices.
  • The interface of the assembly units and of the reference Go implementation must be the same across architectures, unless the platforms have fundamentally different capabilities (such as high-level cryptographic instructions).

This pretty much covers all the questions I had about how to shape the implementation and what kind of testing will be expected 👍

  • Unless the Go Security team explicitly commits to owning the specific implementation, an external contributor must commit to maintaining it. If changes are required (for example as part of a broader refactor) and the maintainer is not available, the assembly will be removed.
  • The code must be tested in our CI. This means there need to be builders that support the instructions, and if there are multiple (or fallback) paths they must be tested separately. (Tip: use GODEBUG=cpu.X=off to disable detection of CPU features.)
  • Document in the Go code why the implementation requires assembly (specific performance benefit, access to instructions, etc), so we can reevaluate as the compiler improves.

@seankhliao
Copy link
Member

cc @golang/security

@arl
Copy link
Contributor

arl commented Oct 6, 2023

@achille-roussel avo has already been used in crypto/internal for example.

@achille-roussel
Copy link
Contributor Author

Thank @arl, this is exactly the example I was looking for 👍

@gopherbot
Copy link

Change https://go.dev/cl/535838 mentions this issue: unicode/utf8.Valid: use AVX2 on amd64

@adonovan
Copy link
Member

adonovan commented Mar 5, 2024

It would be interesting to instrument utf8.IsValid to build a histogram of the lengths of strings passed to it, run a handful of applications that use IsValid, and multiply each bucket by its expected benefit according to the graph above. If all strings are short, the benefit will be negative, but it wouldn't take many longer strings to make the balance significantly positive.

My guess is that this call in protobuf alone might justify the new implementation.

@achille-roussel
Copy link
Contributor Author

That's a great suggestion!

Note that the graph I posted above is a bit outdated, in the CL https://go-review.googlesource.com/c/go/+/535838 the implementation was improved and doesn't suffer regressions on small strings anymore.

@pelletier do you have an updated graph that would be more representative of the latest code version?

@pelletier
Copy link

pelletier commented Mar 5, 2024

Here's an updated graph generated on my machine:

image

If anyone is interested in reproducing the results, I've uploaded a small python script I've used to generate this plot: https://gist.github.com/pelletier/46320448776c9ba9163270130ef556aa

For context, the graph initially posted on this issue showed the AVX2 version being slower for pure Go for inputs shorter than 32 bytes. This was caused because the initial implementation called the pure Go version of IsValid for those inputs. That behavior was a trade-off we made to simplify the maintenance of this code: our use case seldom had small inputs, and it required less assembly. The version in CL-535838 doesn't have the same problem, as we decided a little bit more assembly was worth not taking the small inputs hit if that code were to be included in the standard library.

Technically the benchmark above reports the AVX2 version being slower for empty inputs, and I haven't benchmarked it on inputs of sizes 1 through 7. Here are the first few lines of the benchstat output:

goos: linux
goarch: amd64
pkg: unicode/utf8
cpu: AMD Ryzen 9 5950X 16-Core Processor
                  │   base.txt   │              avx2.txt              │
                  │    sec/op    │   sec/op     vs base               │
Valid/small0-16      2.087n ± 0%   2.523n ± 0%  +20.94% (p=0.002 n=6)
Valid/small8-16      4.828n ± 0%   4.215n ± 0%  -12.70% (p=0.002 n=6)
Valid/small16-16     9.149n ± 0%   4.210n ± 0%  -53.98% (p=0.002 n=6)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

8 participants