Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hash/adler32: optimize on 64-bit architectures #17867

Open
kirillx opened this issue Nov 9, 2016 · 12 comments
Open

hash/adler32: optimize on 64-bit architectures #17867

kirillx opened this issue Nov 9, 2016 · 12 comments
Labels
NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@kirillx
Copy link
Contributor

kirillx commented Nov 9, 2016

What version of Go are you using (go version)?

go version go1.7 darwin/amd64

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/dev/gocode"
GORACE=""
GOROOT="/opt/local/lib/go"
GOTOOLDIR="/opt/local/lib/go/pkg/tool/darwin_amd64"
CC="/usr/bin/clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"

What did you do?

$ cd /opt/local/lib/go/src/hash/adler32
$ go test -bench .
BenchmarkAdler32KB-8 3000000 519 ns/op 1971.24 MB/s
PASS

What did you expect to see?

64bit version written in C easily delivers >4GB/s on this machine

Solution

Use 64bit arithmetics via unsafe.Pointer casting.
NOTE: usage of C style indexing instead of slices in the inner loop also gives a small ~5-10% boost.

Result

~1.8x improvement if implemented with 64bit arithmetics.
$ go test -bench .
BenchmarkAdler32KB-8 5000000 289 ns/op 3531.85 MB/s
PASS

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/32990 mentions this issue.

@dsnet dsnet added this to the Go1.9 milestone Nov 9, 2016
@dsnet dsnet changed the title adler32 hash function is quite slower then could be on x64 hash/adler32: optimize on 64-bit architectures Nov 9, 2016
@dr2chase
Copy link
Contributor

Just for future reference, what's the distribution of sizes on adler32 inputs? For sufficiently large inputs it can be computed in parallel (I have done this myself) which is another way of improving performance. The same can be done with somewhat more hair for CRC calculations.

@kirillx
Copy link
Contributor Author

kirillx commented Nov 12, 2016

There are really cases when inputs are GBs/sec. So parallelism is great, but hey, you want everything to be slow just because modern CPUs have 8+ cores typically?

@dsnet
Copy link
Member

dsnet commented Nov 12, 2016

adler32 is used in the zlib format and in many other places (see imports: https://godoc.org/hash/adler32?importers). The size of the inputs probably varies very widely, so I don't think it is appropriate to make the adler32 implementation parallel by default. That said, I personally would be interested in seeing Combine functions added to the various hash packages. I tried pushing for that in #12297.

@dr2chase
Copy link
Contributor

Divide-and-conquer parallelism usually has a minimum chunk size, so "not by default" is hardwired into the algorithm. I recall the minimum chunk size for adler32 was measured in megabytes, but that was using heavier-weight threads, not goroutines. The question is really more one of whether it is worth the programming effort.

@bradfitz bradfitz added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label May 22, 2017
@bradfitz
Copy link
Contributor

@rsc, @ianlancetaylor, @griesemer, @randall77: decision needed: does adler32 warrant unsafe to get a 1.8x speedup?

@bradfitz bradfitz modified the milestones: Go1.9Maybe, Go1.9 May 22, 2017
@dsnet
Copy link
Member

dsnet commented May 22, 2017

I'm not convinced unsafe is needed in that optimization. Why can't it just use the wide integer load technique that binary.LittleEndian.Uint64 uses?

@griesemer
Copy link
Contributor

I'm not against unsafe per se here, but @dsnet's suggestion should be considered first.

@rsc
Copy link
Contributor

rsc commented May 22, 2017

The CL uses unsafe to do an 8-byte load instead of 8 1-byte loads. It has to be amd64-only because of unaligned loads (not valid on some other 64-bit systems). If we use encoding/binary to do the load, doesn't the compiler handle that optimization for us?

While unsafe would be fine here (because assembly would be fine here, given at least a 2X speedup), it may be that unsafe is not necessary and we can still get the speedup, because the compiler knows about the functions in encoding/binary and emits the best instructions it knows how, and in particular it knows how to get an 8-byte load from a byte slice.

@rsc
Copy link
Contributor

rsc commented May 22, 2017

Decision: try to do without unsafe first.

@rsc rsc added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. labels May 22, 2017
@ALTree ALTree modified the milestones: Go1.10, Go1.9Maybe Jun 14, 2017
@mengzhuo
Copy link
Contributor

I have send a CL about it
https://go-review.googlesource.com/c/48270/

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/51850 mentions this issue: hash/adler32: add AMD64 optimized Adler32 calculation

@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017
@bradfitz bradfitz modified the milestones: Go1.11, Unplanned May 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
None yet
Development

No branches or pull requests

9 participants