New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: memmove 1.6x slower than "REP; MOVSB" on amd64 #14630
Comments
I don't understand where the performance difference comes from. runtime.memmove uses REP; MOVSQ for 2049+ bytes. Is it MOVSB vs. MOVSQ? |
I don't see much difference between MOVSB and MOVSQ MOVSQ: MOVSB: |
Alignment is the issue. When the MOVSQ isn't aligned, it is slower. MOVSB: MOVSQ: Tocar, do you think we should always use REP; MOVSB then? |
FWIW, the very same two-line patch performs 5x worse instead of 1.6x better, on my home computer "Intel(R) Core(TM) i5-2415M CPU @ 2.30GHz":
The equivalent "BM_UFlat/2" C++ snappy benchmark on my home computer claims 16.2GB/s, compared to 11.8 and 2.4 above, so it may still be possible to optimize runtime.memmove on this hardware. |
@nigeltao
|
@dvnagorny works on memmove performance on Amd64, so he should know better |
Comparing MOVSB vs MOVSD I'd say that MOVSQ can be better for 5-7% on SandyBridge for aligned data. I also think that memmove implementation can distinguish aligned and unaligned cases. The same time SandyBridge is quite old architecture and for Haswell and newer archs rep movs isn't the best solution. So I'm going to return with AVX-based implementation for modern CPUs. |
Just posted a simple patch to use MOVSB on unaligned moves. @dvnagorny if you could do a review that would be great. |
CL https://golang.org/cl/20293 mentions this issue. |
I don't know where we draw the line with older hardware, but 6a33f77 "runtime: use MOVSB instead of MOVSQ etc" is a regression on my 2011ish-era "Intel(R) Core(TM) i5-2415M CPU @ 2.30GHz" home computer that I mentioned above (aka Sandy Bridge, IIUC): benchmark old MB/s new MB/s speedup @martisch's comment above says that "REP; MOVSB could be a lot better with Ivy Bridge and later (Core i >=3xxx)" but my CPU isn't one of those. |
Yuck. So how do we detect which chips should use REP MOVSB and which chips should use REP MOVSQ? We're doing a large copy by the time we need to make the decision, so we have the time to think about it. But I have no idea what the predicate should be. @dvnagorny , any ideas? In general I'm not too worked up about lower performance on older chips, but this is a very large performance drop to leave unfixed. |
iirc: I will try to find a newer amd cpu because is suspect amd in general could also see a performance drop. The "Advanced Micro Devices Software Optimization Guide" also has a lot of advice on the topic of rep mov. the intel manual also says about enhanced rep movsb:
|
If it is just the erms "Enhanced REP MOVSB/STOSB" CPUID EAX=7 bit, that would be easy. |
CL https://golang.org/cl/21300 mentions this issue. |
Only use REP;MOVSB if: 1) The CPUID flag says it is fast, and 2) The pointers are unaligned Otherwise, use REP;MOVSQ. Update #14630 Change-Id: I946b28b87880c08e5eed1ce2945016466c89db66 Reviewed-on: https://go-review.googlesource.com/21300 Reviewed-by: Nigel Tao <nigeltao@golang.org>
My 2011ish-era "Intel(R) Core(TM) i5-2415M CPU @ 2.30GHz" no longer regresses on BenchmarkMemmoveUnaligned.*4096. |
Snappy is a simple compression format that, for incompressible input such as a JPEG file, doesn't do much more than calling runtime·memmove.
https://github.com/golang/snappy/blob/master/decode_amd64.s has a chunk of code (grep for callMemmove) that essentially does some bounds checking, and then
spill registers
CALL runtime·memmove(DI, SI, CX)
unspill registers
ADDQ CX, DI
ADDQ CX, SI
Replacing this with "REP; MOVSB" should be equivalent, but turns out to be 1.6x faster on one benchmark.
Old:
Benchmark_UFlat2-8 200000 8154 ns/op 15094.70 MB/s
New:
Benchmark_UFlat2-8 300000 4957 ns/op 24831.90 MB/s
The UFlat2 benchmark is indeed decoding a 123,093 byte JPEG file, and the bulk of the time in one decoding is spent in two memmove calls, of length 65218 and 57557. The first one is slightly less than 65536 because the JPEG header is slightly compressible, and is not encoded solely as a literal src-to-dst copy.
I'm happy to try to extract the UFlat2 snappy benchmark as a separate, minimal repro case, if that would help.
Long story short, it seems that if "REP; MOVSB" is significantly faster than runtime·memmove in some cases, then runtime·memmove should just (conditionally) issue "REP; MOVSB". I don't have a good sense what that condition should be though.
This is admittedly on a relatively brawny amd64 machine (/proc/cpuinfo attached), "Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz". I suspect that the same patch actually performs worse on my relatively wimpy home computer (also amd64). I'll update the issue with those numbers later.
The text was updated successfully, but these errors were encountered: