Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hash/crc32: use faster method for CRC32-C #16107

Closed
RaduBerinde opened this issue Jun 18, 2016 · 4 comments
Closed

hash/crc32: use faster method for CRC32-C #16107

RaduBerinde opened this issue Jun 18, 2016 · 4 comments
Milestone

Comments

@RaduBerinde
Copy link
Contributor

The hardware (SSE4.2) assisted CRC32-C implementation can be made faster (from my experience about 2.5 - 2.7 times faster if checksumming enough bytes e.g. 512+),

The gist of the idea is that (on recent Intel platforms at least) the CRC32 instruction takes three cycles but three of them can be pipelined in parallel if they don't depend on each other; so we can split the buffer into three, checksum the three pieces in "parallel" (interleaving the CRC32 instructions) and then combine the results using some tables.

This article goes into more depth: http://www.drdobbs.com/parallel/fast-parallelized-crc-computation-using/229401411

I can take a stab at implementing it at some point when I have some time of my hands, unless there are objections.

@ianlancetaylor ianlancetaylor added this to the Go1.8 milestone Jun 20, 2016
@ianlancetaylor
Copy link
Contributor

If you can make it faster while producing the same results, by all means go for it. Thanks.

@RaduBerinde
Copy link
Contributor Author

Working on a decent initial version. Sneak peek:

CastagnoliCrc15B-4    597MB/s ± 0%    662MB/s ± 1%   +10.86%  (p=0.029 n=4+4)
CastagnoliCrc40B-4   1.74GB/s ± 0%   1.68GB/s ± 1%    -3.35%  (p=0.029 n=4+4)
CastagnoliCrc1KB-4   7.99GB/s ± 0%  14.14GB/s ± 0%   +76.89%  (p=0.029 n=4+4)
CastagnoliCrc4KB-4   8.84GB/s ± 0%  18.15GB/s ± 0%  +105.38%  (p=0.029 n=4+4)
CastagnoliCrc32KB-4  9.11GB/s ± 1%  19.40GB/s ± 0%  +112.94%  (p=0.029 n=4+4)

@gopherbot
Copy link

CL https://golang.org/cl/24471 mentions this issue.

@gopherbot
Copy link

CL https://golang.org/cl/27931 mentions this issue.

gopherbot pushed a commit that referenced this issue Aug 28, 2016
The algorithm is explained in the comments. The improvement in
throughput is about 1.4x for buffers between 500b-4Kb and 2.5x-2.6x
for larger buffers.

Additionally, we no longer initialize the software tables if SSE4.2 is
available.

Adding a test for the SSE implementation (restricted to amd64 and
amd64p32).

Benchmarks on a Haswell i5-4670 @ 3.4 GHz:

name                           old time/op    new time/op     delta
CastagnoliCrc15B-4               21.9ns ± 1%     22.9ns ± 0%    +4.45%
CastagnoliCrc15BMisaligned-4     22.6ns ± 0%     23.4ns ± 0%    +3.43%
CastagnoliCrc40B-4               23.3ns ± 0%     23.9ns ± 0%    +2.58%
CastagnoliCrc40BMisaligned-4     25.4ns ± 0%     26.1ns ± 0%    +2.86%
CastagnoliCrc512-4               72.6ns ± 0%     52.8ns ± 0%   -27.33%
CastagnoliCrc512Misaligned-4     76.3ns ± 1%     56.3ns ± 0%   -26.18%
CastagnoliCrc1KB-4                128ns ± 1%       89ns ± 0%   -30.04%
CastagnoliCrc1KBMisaligned-4      130ns ± 0%       88ns ± 0%   -32.65%
CastagnoliCrc4KB-4                461ns ± 0%      187ns ± 0%   -59.40%
CastagnoliCrc4KBMisaligned-4      463ns ± 0%      191ns ± 0%   -58.77%
CastagnoliCrc32KB-4              3.58µs ± 0%     1.35µs ± 0%   -62.22%
CastagnoliCrc32KBMisaligned-4    3.58µs ± 0%     1.36µs ± 0%   -61.84%

name                           old speed      new speed       delta
CastagnoliCrc15B-4              684MB/s ± 1%    655MB/s ± 0%    -4.32%
CastagnoliCrc15BMisaligned-4    663MB/s ± 0%    641MB/s ± 0%    -3.32%
CastagnoliCrc40B-4             1.72GB/s ± 0%   1.67GB/s ± 0%    -2.69%
CastagnoliCrc40BMisaligned-4   1.58GB/s ± 0%   1.53GB/s ± 0%    -2.82%
CastagnoliCrc512-4             7.05GB/s ± 0%   9.70GB/s ± 0%   +37.59%
CastagnoliCrc512Misaligned-4   6.71GB/s ± 1%   9.09GB/s ± 0%   +35.43%
CastagnoliCrc1KB-4             7.98GB/s ± 1%  11.46GB/s ± 0%   +43.55%
CastagnoliCrc1KBMisaligned-4   7.86GB/s ± 0%  11.70GB/s ± 0%   +48.75%
CastagnoliCrc4KB-4             8.87GB/s ± 0%  21.80GB/s ± 0%  +145.69%
CastagnoliCrc4KBMisaligned-4   8.83GB/s ± 0%  21.39GB/s ± 0%  +142.25%
CastagnoliCrc32KB-4            9.15GB/s ± 0%  24.22GB/s ± 0%  +164.62%
CastagnoliCrc32KBMisaligned-4  9.16GB/s ± 0%  24.00GB/s ± 0%  +161.94%

Fixes #16107.

Change-Id: Ibe50ea76574674ce0571ef31c31015e0ed66b907
Reviewed-on: https://go-review.googlesource.com/27931
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@golang golang locked and limited conversation to collaborators Aug 28, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants