runtime: support SSE2 memmove on older amd64 hardware #38512
Labels
compiler/runtime
Issues related to the Go compiler and/or runtime.
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Performance
Milestone
Background
In pursuit of some database work I've been doing, I encountered a strange corner of Go. Inserting into packed segments of memory can be expensive due to the special, backwards
runtime.memmove
call required. (eg:copy(x[n:], x); copy(x[:n], newdata)
When compared to a similar C++/glibc implementation of the same operation, the glibc one was about twice as fast on my hardware (a Xeon E5 v1, Sandy Bridge) to the Go implementation.
Come to find out, that while the Xeon E5v1s have AVX instructions, they have a slower per-cycle data path. That explains why Go 1.14 has the following code restriction for Intel "Bridge families":
https://github.com/golang/go/blob/go1.14.2/src/runtime/cpuflags_amd64.go#L23
This hardware issue was present until Haswell Xeons (E5v3) which happily (and quickly) use the AVX.
glibc concurs. In a
perf
report of the C++ insertion logic, it's using GNU's __memcpy_sse2_unaligned.Which is interesting -- because Go doesn't have the equivalent.
gccgo
will use glibc and get the same performance as C++/glibc.Proposal
Add an SSE2-optimized path for
runtime.memmove
, at least when backwards copying. This would only affect/benefit older hardware (roughly, Xeons from [Nehalem, Haswell) ). Newer systems wouldn't notice at all.I went ahead and implemented it; but the README said to file an issue for inclusion first, so here we are (my first issue!)
Measurements
I wrote a test package and harness to try a bunch of copy methods. Using SSE2 for the forward path as well didn't gain much over the baseline currently in Go 1.14, but for the backward path it was substantially faster.
Forcing AVX on Sandy Bridge functioned, and varied in speed, but was slower than expected (and slower than SSE2-paths) and especially slower when the non-temporal moves got involved.
The biggest win came in the backwards path alone.
So I implemented the backwards path only on my branch of the Go runtime and here's some preliminary highlights:
I figured compiling and benchmarking other packages from the wild with my patch would be interesting, and sure enough...
Using bbolt 's B-Tree write benchmark:
The text was updated successfully, but these errors were encountered: