Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: support SSE2 memmove on older amd64 hardware #38512

Open
barakmich opened this issue Apr 18, 2020 · 3 comments
Open

runtime: support SSE2 memmove on older amd64 hardware #38512

barakmich opened this issue Apr 18, 2020 · 3 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@barakmich
Copy link

Background
In pursuit of some database work I've been doing, I encountered a strange corner of Go. Inserting into packed segments of memory can be expensive due to the special, backwards runtime.memmove call required. (eg: copy(x[n:], x); copy(x[:n], newdata)

When compared to a similar C++/glibc implementation of the same operation, the glibc one was about twice as fast on my hardware (a Xeon E5 v1, Sandy Bridge) to the Go implementation.

Come to find out, that while the Xeon E5v1s have AVX instructions, they have a slower per-cycle data path. That explains why Go 1.14 has the following code restriction for Intel "Bridge families":
https://github.com/golang/go/blob/go1.14.2/src/runtime/cpuflags_amd64.go#L23

This hardware issue was present until Haswell Xeons (E5v3) which happily (and quickly) use the AVX.

glibc concurs. In a perf report of the C++ insertion logic, it's using GNU's __memcpy_sse2_unaligned.

Which is interesting -- because Go doesn't have the equivalent. gccgo will use glibc and get the same performance as C++/glibc.

Proposal
Add an SSE2-optimized path for runtime.memmove, at least when backwards copying. This would only affect/benefit older hardware (roughly, Xeons from [Nehalem, Haswell) ). Newer systems wouldn't notice at all.

I went ahead and implemented it; but the README said to file an issue for inclusion first, so here we are (my first issue!)

Measurements
I wrote a test package and harness to try a bunch of copy methods. Using SSE2 for the forward path as well didn't gain much over the baseline currently in Go 1.14, but for the backward path it was substantially faster.

Forcing AVX on Sandy Bridge functioned, and varied in speed, but was slower than expected (and slower than SSE2-paths) and especially slower when the non-temporal moves got involved.

The biggest win came in the backwards path alone.

So I implemented the backwards path only on my branch of the Go runtime and here's some preliminary highlights:

...
MemmoveUnalignedDstBackwards/256-32   20.6GB/s ±22%   19.9GB/s ±14%      ~     (p=0.753 n=11+4)
MemmoveUnalignedDstBackwards/512-32   7.22GB/s ±31%  23.39GB/s ±21%  +223.72%  (p=0.001 n=11+4)
MemmoveUnalignedDstBackwards/1024-32  11.3GB/s ±21%   28.1GB/s ±16%  +149.36%  (p=0.001 n=11+4)
MemmoveUnalignedDstBackwards/2048-32  13.6GB/s ±17%   28.5GB/s ±12%  +108.86%  (p=0.001 n=11+4)
MemmoveUnalignedDstBackwards/4096-32  16.7GB/s ±19%   35.3GB/s ±16%  +110.93%  (p=0.001 n=11+4)

I figured compiling and benchmarking other packages from the wild with my patch would be interesting, and sure enough...

Using bbolt 's B-Tree write benchmark:

$ ./bbolt-orig bench -write-mode rnd -count 100000
# Write	11.688323313s	(116.883µs/op)	(8555 op/sec)
# Read	1.002042799s	(36ns/op)	(27777777 op/sec)

$ ./bbolt-sse2 bench -write-mode rnd -count 100000
# Write	8.114154792s	(81.141µs/op)	(12324 op/sec)
# Read	1.002062417s	(36ns/op)	(27777777 op/sec)
@martisch
Copy link
Contributor

martisch commented Apr 18, 2020

@barakmich I have been looking into a new memmove implementation (https://go-review.googlesource.com/c/go/+/228820) that tries to slim down the function size and also uses sse2 (even for forward). Unfortunately I have not been able to produce a clear net win for all amd64 cpus while playing with new memmove implementations. Any specialization usally increases icache misses and runtime overhead.

We have to take into account the overall performance of sandy bridge, haswell, skylake, ryzen, ... CPUs when modifying the memmove implementation as other CPUs can regress in performance. REP MOVSQ, REP MOVSB, SSE2 vs AVX can differ in performance on different platforms and depending on alignment and size. In general there is an opportunity that by making the implemntation smaller we can save cycles everywhere by avoiding icache misses.

Do you have a link to your modified version?

As a first step just adding simple Backwards memmove benchmarks would be a nice commit.

@ALTree ALTree changed the title runtime: Support SSE2 memmove on older amd64 hardware runtime: support SSE2 memmove on older amd64 hardware Apr 18, 2020
@ALTree ALTree added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance labels Apr 18, 2020
@ALTree ALTree added this to the Unplanned milestone Apr 18, 2020
@martisch
Copy link
Contributor

martisch commented Jul 2, 2020

I made a new memmove implementation a few weeks ago at https://go-review.googlesource.com/c/go/+/228820 which also has an sse2 memmove that aligns moves for backwards copies. It needs a bit more tuning for large copies. Im currently testing around on Ice Lake.

@barakmich
Copy link
Author

@martisch I'll try it out on my array of hardware!

I ran into the same issues with the instruction cache you mentioned and got to losing ~5% on small copies (icache) and huge gains -- 200-300% -- on large backwards ones.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

4 participants