-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hash/adler32: optimize on 64-bit architectures #17867
Comments
CL https://golang.org/cl/32990 mentions this issue. |
Just for future reference, what's the distribution of sizes on adler32 inputs? For sufficiently large inputs it can be computed in parallel (I have done this myself) which is another way of improving performance. The same can be done with somewhat more hair for CRC calculations. |
There are really cases when inputs are GBs/sec. So parallelism is great, but hey, you want everything to be slow just because modern CPUs have 8+ cores typically? |
adler32 is used in the zlib format and in many other places (see imports: https://godoc.org/hash/adler32?importers). The size of the inputs probably varies very widely, so I don't think it is appropriate to make the adler32 implementation parallel by default. That said, I personally would be interested in seeing |
Divide-and-conquer parallelism usually has a minimum chunk size, so "not by default" is hardwired into the algorithm. I recall the minimum chunk size for adler32 was measured in megabytes, but that was using heavier-weight threads, not goroutines. The question is really more one of whether it is worth the programming effort. |
@rsc, @ianlancetaylor, @griesemer, @randall77: decision needed: does adler32 warrant unsafe to get a 1.8x speedup? |
I'm not convinced unsafe is needed in that optimization. Why can't it just use the wide integer load technique that binary.LittleEndian.Uint64 uses? |
I'm not against unsafe per se here, but @dsnet's suggestion should be considered first. |
The CL uses unsafe to do an 8-byte load instead of 8 1-byte loads. It has to be amd64-only because of unaligned loads (not valid on some other 64-bit systems). If we use encoding/binary to do the load, doesn't the compiler handle that optimization for us? While unsafe would be fine here (because assembly would be fine here, given at least a 2X speedup), it may be that unsafe is not necessary and we can still get the speedup, because the compiler knows about the functions in encoding/binary and emits the best instructions it knows how, and in particular it knows how to get an 8-byte load from a byte slice. |
Decision: try to do without unsafe first. |
I have send a CL about it |
Change https://golang.org/cl/51850 mentions this issue: |
What version of Go are you using (
go version
)?go version go1.7 darwin/amd64
What operating system and processor architecture are you using (
go env
)?GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/dev/gocode"
GORACE=""
GOROOT="/opt/local/lib/go"
GOTOOLDIR="/opt/local/lib/go/pkg/tool/darwin_amd64"
CC="/usr/bin/clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
What did you do?
$ cd /opt/local/lib/go/src/hash/adler32
$ go test -bench .
BenchmarkAdler32KB-8 3000000 519 ns/op 1971.24 MB/s
PASS
What did you expect to see?
64bit version written in C easily delivers >4GB/s on this machine
Solution
Use 64bit arithmetics via unsafe.Pointer casting.
NOTE: usage of C style indexing instead of slices in the inner loop also gives a small ~5-10% boost.
Result
~1.8x improvement if implemented with 64bit arithmetics.
$ go test -bench .
BenchmarkAdler32KB-8 5000000 289 ns/op 3531.85 MB/s
PASS
The text was updated successfully, but these errors were encountered: