encoding/base64: decoding is slow #19636

josselin-c · 2017-03-21T12:25:10Z

What version of Go are you using (`go version`)?

Go 1.8

What operating system and processor architecture are you using (`go env`)?

amd64

What did you do?

On my slow computer, using encoding/base64, I can decode data at ~100MB/s.
It should be much faster as shown by https://github.com/aklomp/base64

I'm planning to work on this in my spare time. This issue tracks this effort.

The text was updated successfully, but these errors were encountered:

gopherbot · 2017-03-21T12:30:33Z

CL https://golang.org/cl/34950 mentions this issue.

dgryski · 2017-03-21T14:32:06Z

Also https://github.com/powturbo/TurboBase64 for a fast encoder/decoder that doesn't need assembly.

gopherbot · 2017-03-24T19:30:40Z

CL https://golang.org/cl/38632 mentions this issue.

josselin-c · 2017-03-24T19:44:03Z

@dgryski I saw such LUT based and I'm wondering if it's acceptable. What's go philosophy about embedding "large" LUT? Won't such tables trash the L1 when used and penalize the rest of the application?
I have an SSE implementation in the pipe that will be another improvement to CL 38632 but it's limited to amd64.

Optimize DecodeString for the common case where most of the input isn't a newline or a padding character. Also add some testcases found when fuzzing this implementation against upstream. Change Decode benchmark to run with different input sizes. name old time/op new time/op delta DecodeString/2-4 71.5ns ± 4% 70.0ns ± 6% ~ (p=0.246 n=5+5) DecodeString/4-4 112ns ±25% 91ns ± 2% ~ (p=0.056 n=5+5) DecodeString/8-4 136ns ± 5% 126ns ± 5% -7.33% (p=0.016 n=5+5) DecodeString/64-4 872ns ±29% 652ns ±21% -25.23% (p=0.032 n=5+5) DecodeString/8192-4 90.9µs ±21% 61.0µs ±13% -32.87% (p=0.008 n=5+5) name old speed new speed delta DecodeString/2-4 56.0MB/s ± 4% 57.2MB/s ± 6% ~ (p=0.310 n=5+5) DecodeString/4-4 73.4MB/s ±23% 87.7MB/s ± 2% ~ (p=0.056 n=5+5) DecodeString/8-4 87.8MB/s ± 5% 94.8MB/s ± 5% +7.98% (p=0.016 n=5+5) DecodeString/64-4 103MB/s ±24% 136MB/s ±19% +32.63% (p=0.032 n=5+5) DecodeString/8192-4 122MB/s ±19% 180MB/s ±11% +47.75% (p=0.008 n=5+5) Improves #19636 Change-Id: I39667f4fb682a12b3137946d017ad999553c5780 Reviewed-on: https://go-review.googlesource.com/34950 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>

Optimize base64 decoding speed by adding 32-bits and 64-bits specialized methods that don't perform any error checking and fall back to the more complex decodeQuantum method when a non-base64 character is present. On a 64-bits cpu: name old time/op new time/op delta DecodeString/2-4 70.0ns ± 6% 69.2ns ± 0% ~ (p=0.169 n=5+8) DecodeString/4-4 91.3ns ± 2% 80.4ns ± 0% -11.89% (p=0.001 n=5+10) DecodeString/8-4 126ns ± 5% 106ns ± 0% -16.14% (p=0.000 n=5+7) DecodeString/64-4 652ns ±21% 361ns ± 0% -44.57% (p=0.000 n=5+7) DecodeString/8192-4 61.0µs ±13% 31.5µs ± 1% -48.38% (p=0.001 n=5+9) name old speed new speed delta DecodeString/2-4 57.2MB/s ± 6% 57.7MB/s ± 2% ~ (p=0.419 n=5+9) DecodeString/4-4 87.7MB/s ± 2% 99.5MB/s ± 0% +13.45% (p=0.001 n=5+10) DecodeString/8-4 94.8MB/s ± 5% 112.6MB/s ± 1% +18.82% (p=0.001 n=5+9) DecodeString/64-4 136MB/s ±19% 243MB/s ± 0% +78.17% (p=0.003 n=5+7) DecodeString/8192-4 180MB/s ±11% 347MB/s ± 1% +92.94% (p=0.001 n=5+9) Improves #19636 Change-Id: Ic10a454851093a7e1d46ca0c140deed73535d990 Reviewed-on: https://go-review.googlesource.com/38632 Run-TryBot: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>

powturbo · 2018-02-06T09:22:09Z

@josselin-c: I saw such LUT based and I'm wondering if it's acceptable.

Strange rumours, but the short scalar "turbob64encs" encoding uses 64 bytes and the "turbob64decs" decoding function just uses 80 bytes (from 256 bytes LUT) delivering 2 GB/s in practical szenarios, This is less memory than other SIMD/AVX2 base64 functions. The fast "turbob64dec" function needs in fact less than 1k and is decoding faster than other scalar/SSE functions. See benchmark: Turbo-Base64

New tasks include: golang/go#19675 cmd/vet: report uses of -0 in float32/64 context golang/go#19683 cmd/compile: eliminate usages of global lineno golang/go#19670 x/tools/go/ssa: make opaqueType less annoying to use golang/go#19636 encoding/base64: decoding is slow golang/go#23471 x/perf/cmd/benchstat: tips or quickstart for newcomers golang/go#19577 test: errorcheck support for intraline errors golang/go#19490 cmd/vet: reduce the amount of false positives for -shadow mode golang/go#19042 cmd/internal/obj: optimize wrapper method prologue for branch prediction golang/go#19013 cmd/compile: add tool for understanding/debugging SSA rules

gopherbot · 2018-05-18T17:36:37Z

Change https://golang.org/cl/113776 mentions this issue: encoding/base64: slight decoding speed-up

First, use a dummy slice access on decode64 and decode32 to ensure that there is a single bounds check for src. Second, move the PutUint64/PutUint32 calls out of these functions, meaning that they are simpler and smaller. This may also open the door to inlineability in the future, but for now, they both go past the budget. While at it, get rid of the ilen and olen variables, which have no impact whatsoever on performance. At least, not measurable by any of the benchmarks. name old time/op new time/op delta DecodeString/2-4 54.3ns ± 1% 55.2ns ± 2% +1.60% (p=0.017 n=5+6) DecodeString/4-4 66.6ns ± 1% 66.8ns ± 2% ~ (p=0.903 n=6+6) DecodeString/8-4 79.3ns ± 2% 79.6ns ± 1% ~ (p=0.448 n=6+6) DecodeString/64-4 300ns ± 1% 281ns ± 3% -6.54% (p=0.002 n=6+6) DecodeString/8192-4 27.4µs ± 1% 23.7µs ± 2% -13.47% (p=0.002 n=6+6) name old speed new speed delta DecodeString/2-4 73.7MB/s ± 1% 72.5MB/s ± 2% -1.55% (p=0.026 n=5+6) DecodeString/4-4 120MB/s ± 1% 120MB/s ± 2% ~ (p=0.851 n=6+6) DecodeString/8-4 151MB/s ± 2% 151MB/s ± 1% ~ (p=0.485 n=6+6) DecodeString/64-4 292MB/s ± 1% 313MB/s ± 3% +7.03% (p=0.002 n=6+6) DecodeString/8192-4 399MB/s ± 1% 461MB/s ± 2% +15.58% (p=0.002 n=6+6) For #19636. Change-Id: I0dfbdafa2a41dc4c582f63aef94b90b8e473731c Reviewed-on: https://go-review.googlesource.com/113776 Reviewed-by: Ian Lance Taylor <iant@golang.org>

gopherbot · 2018-11-25T18:03:40Z

Change https://golang.org/cl/151177 mentions this issue: encoding/base64: lift nil check out of decode loop

gopherbot · 2018-11-25T21:03:45Z

Change https://golang.org/cl/151197 mentions this issue: encoding/json: make decode32/decode64 inlineable

mvdan · 2018-11-25T21:07:33Z

With the two CLs above, the decoder goes from ~500MB/s to ~630MB/s on an 8KiB input. Note that this is on a 2014 ultrabook locked at 70% cpu frequency, to prevent overheating and throttling.

The only remaining bottleneck in the pure Go code that I can see is how it has more bounds checks than needed per decoded chunk of bytes. A couple for every 8 bytes should be enough, but it currently does at least eight. I think if we can fix that via #28942, it should give another nice 5-10% speed-up.

Most of the decoding time is spent in the first Decode loop, since the rest of the function only deals with the few remaining bytes. Any unnecessary work done in that loop body matters tremendously. One such unnecessary bottleneck was the use of the enc.decodeMap table. Since enc is a pointer receiver, and the field is used within the non-inlineable function decode64, the decoder must perform a nil check at every iteration. To fix that, move the enc.decodeMap uses to the parent function, where we can lift the nil check outside the loop. That gives roughly a 15% speed-up. The function no longer performs decoding per se, so rename it. While at it, remove the now unnecessary receivers. An unfortunate side effect of this change is that the loop now contains eight bounds checks on src instead of just one. However, not having to slice src plus the nil check removal well outweigh the added cost. The other piece that made decode64 slow was that it wasn't inlined, and had multiple branches. Use a simple bitwise-or trick suggested by Roger Peppe, and collapse the rest of the bitwise logic into a single expression. Inlinability and the reduced branching give a further 10% speed-up. Finally, add these two functions to TestIntendedInlining, since we want them to stay inlinable. Apply the same refactor to decode32 for consistency, and to let 32-bit architectures see a similar performance gain for large inputs. name old time/op new time/op delta DecodeString/2-8 47.3ns ± 1% 45.8ns ± 0% -3.28% (p=0.002 n=6+6) DecodeString/4-8 55.8ns ± 2% 51.5ns ± 0% -7.71% (p=0.004 n=5+6) DecodeString/8-8 64.9ns ± 0% 61.7ns ± 0% -4.99% (p=0.004 n=5+6) DecodeString/64-8 238ns ± 0% 198ns ± 0% -16.54% (p=0.002 n=6+6) DecodeString/8192-8 19.5µs ± 0% 14.6µs ± 0% -24.96% (p=0.004 n=6+5) name old speed new speed delta DecodeString/2-8 84.6MB/s ± 1% 87.4MB/s ± 0% +3.38% (p=0.002 n=6+6) DecodeString/4-8 143MB/s ± 2% 155MB/s ± 0% +8.41% (p=0.004 n=5+6) DecodeString/8-8 185MB/s ± 0% 195MB/s ± 0% +5.29% (p=0.004 n=5+6) DecodeString/64-8 369MB/s ± 0% 442MB/s ± 0% +19.78% (p=0.002 n=6+6) DecodeString/8192-8 560MB/s ± 0% 746MB/s ± 0% +33.27% (p=0.004 n=6+5) Updates #19636. Change-Id: Ib839577b0e3f5a2bb201f5cae580c61365d92894 Reviewed-on: https://go-review.googlesource.com/c/go/+/151177 Run-TryBot: Daniel Martí <mvdan@mvdan.cc> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: roger peppe <rogpeppe@gmail.com>

robfig · 2020-01-25T21:39:26Z

A recent paper described an algorithm for base64 encoding using 3 AVX512 instructions per 48/64 bytes:
https://arxiv.org/pdf/1910.05109.pdf

C code implementing the algorithm is here:
https://github.com/WojciechMula/base64simd

It looks like the assembler does not have AVX512 instructions defined.

mvdan · 2020-01-25T23:02:36Z

I'm not sure if using AVX512 is worth it for the standard library. We'd probably need to write assembly and have to start worrying about hardware compatibility.

The current decoder is already capable of reaching 1GiB/s on a modern CPU. Is that still considered slow for the standard library? If so, I think we should continue looking at pure Go improvements.

powturbo · 2020-01-26T07:21:17Z

The AVX512 version is requiring AVX512VBMI which is actually available only for cannonlake and ice lake (notebooks). In the paper they state "The speed of the new AVX-512 coded is more than twice that of the state-of-the-art AVX2 codec".
However, the AVX2 Turbo-Base64 version is also 2x faster than their AVX2 fastbase64 (see benchmark). For short strings TurboBase64 is 4 times faster than the next fastest base64 SIMD.
And recently, Clickhouse has replaced the most popular base64 SIMD on github with Turbo-Base64.
Perhaps not for a standard library, but if you're looking at more speed you should consider base64 SIMD like TurboBase64 delivering 26GB/s on a modest clocked 3.4GHz skylake.
1GB/s seems in general fast enough, but can be too slow for databases for ex. as in the clickhouse case.

robfig · 2020-01-27T22:07:58Z

cc @mvdan

Makes sense, it does seem like a significant / difficult assembly job. I spent some time looking into it, but as a first-time assembly coder I didn't make any progress.

I don't have a use case that requires it to be faster. I assumed that there was interest in making it faster by the existence of an open issue and came upon that when reviewing recent developments.

mvdan · 2020-01-27T22:28:01Z

Easy wins are always welcome, but I don't think adding hundreds of lines of assembly are in that camp :) A base64 package optimized for performance (instead of safety) should probably be written outside the standard library.

ALTree changed the title ~~encoding/base64 decoding is slow~~ encoding/base64: decoding is slow Mar 21, 2017

ALTree added the Performance label Mar 21, 2017

bradfitz added this to the Go1.9Maybe milestone Mar 21, 2017

markdryan mentioned this issue May 5, 2017

encoding/base64: encoding is slow #20206

Open

bradfitz modified the milestones: Go1.10, Go1.9Maybe May 24, 2017

bradfitz modified the milestones: Go1.10, Go1.11 Nov 15, 2017

bradfitz modified the milestones: Go1.11, Unplanned May 18, 2018

Bluebugs mentioned this issue Feb 20, 2023

proposal: Go 2: Add implicit single program multiple data #58610

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding/base64: decoding is slow #19636

encoding/base64: decoding is slow #19636

josselin-c commented Mar 21, 2017

gopherbot commented Mar 21, 2017

dgryski commented Mar 21, 2017

gopherbot commented Mar 24, 2017

josselin-c commented Mar 24, 2017

powturbo commented Feb 6, 2018 •

edited

Loading

gopherbot commented May 18, 2018

gopherbot commented Nov 25, 2018

gopherbot commented Nov 25, 2018

mvdan commented Nov 25, 2018

robfig commented Jan 25, 2020 •

edited

Loading

mvdan commented Jan 25, 2020

powturbo commented Jan 26, 2020 •

edited

Loading

robfig commented Jan 27, 2020

mvdan commented Jan 27, 2020

encoding/base64: decoding is slow #19636

encoding/base64: decoding is slow #19636

Comments

josselin-c commented Mar 21, 2017

What version of Go are you using (go version)?

What operating system and processor architecture are you using (go env)?

What did you do?

gopherbot commented Mar 21, 2017

dgryski commented Mar 21, 2017

gopherbot commented Mar 24, 2017

josselin-c commented Mar 24, 2017

powturbo commented Feb 6, 2018 • edited Loading

gopherbot commented May 18, 2018

gopherbot commented Nov 25, 2018

gopherbot commented Nov 25, 2018

mvdan commented Nov 25, 2018

robfig commented Jan 25, 2020 • edited Loading

mvdan commented Jan 25, 2020

powturbo commented Jan 26, 2020 • edited Loading

robfig commented Jan 27, 2020

mvdan commented Jan 27, 2020

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

powturbo commented Feb 6, 2018 •

edited

Loading

robfig commented Jan 25, 2020 •

edited

Loading

powturbo commented Jan 26, 2020 •

edited

Loading