Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: memmove 1.6x slower than "REP; MOVSB" on amd64 #14630

Closed
nigeltao opened this issue Mar 3, 2016 · 15 comments
Closed

runtime: memmove 1.6x slower than "REP; MOVSB" on amd64 #14630

nigeltao opened this issue Mar 3, 2016 · 15 comments

Comments

@nigeltao
Copy link
Contributor

nigeltao commented Mar 3, 2016

Snappy is a simple compression format that, for incompressible input such as a JPEG file, doesn't do much more than calling runtime·memmove.

https://github.com/golang/snappy/blob/master/decode_amd64.s has a chunk of code (grep for callMemmove) that essentially does some bounds checking, and then

spill registers
CALL runtime·memmove(DI, SI, CX)
unspill registers
ADDQ CX, DI
ADDQ CX, SI

Replacing this with "REP; MOVSB" should be equivalent, but turns out to be 1.6x faster on one benchmark.

$ git diff
diff --git a/decode_amd64.s b/decode_amd64.s
index 1486aba..651519f 100644
--- a/decode_amd64.s
+++ b/decode_amd64.s
@@ -141,6 +141,9 @@ callMemmove:
        CMPQ CX, BX
        JGT  errCorrupt

+       REP; MOVSB
+       JMP loop
+
        // copy(dst[d:], src[s:s+length])
        //
        // This means calling runtime·memmove(&dst[d], &src[s], length), so we push

Old:
Benchmark_UFlat2-8 200000 8154 ns/op 15094.70 MB/s
New:
Benchmark_UFlat2-8 300000 4957 ns/op 24831.90 MB/s

The UFlat2 benchmark is indeed decoding a 123,093 byte JPEG file, and the bulk of the time in one decoding is spent in two memmove calls, of length 65218 and 57557. The first one is slightly less than 65536 because the JPEG header is slightly compressible, and is not encoded solely as a literal src-to-dst copy.

I'm happy to try to extract the UFlat2 snappy benchmark as a separate, minimal repro case, if that would help.

Long story short, it seems that if "REP; MOVSB" is significantly faster than runtime·memmove in some cases, then runtime·memmove should just (conditionally) issue "REP; MOVSB". I don't have a good sense what that condition should be though.

This is admittedly on a relatively brawny amd64 machine (/proc/cpuinfo attached), "Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz". I suspect that the same patch actually performs worse on my relatively wimpy home computer (also amd64). I'll update the issue with those numbers later.

@randall77
Copy link
Contributor

I don't understand where the performance difference comes from. runtime.memmove uses REP; MOVSQ for 2049+ bytes. Is it MOVSB vs. MOVSQ?

@TocarIP

@randall77
Copy link
Contributor

I don't see much difference between MOVSB and MOVSQ

MOVSQ:
BenchmarkMemmove65218-8 1000000 2283 ns/op

MOVSB:
BenchmarkMemmove65218-8 1000000 2233 ns/op

@randall77
Copy link
Contributor

Alignment is the issue. When the MOVSQ isn't aligned, it is slower.

MOVSB:
BenchmarkMemmoveUnaligned65218-8 500000 3143 ns/op

MOVSQ:
BenchmarkMemmoveUnaligned65218-8 300000 4511 ns/op

Tocar, do you think we should always use REP; MOVSB then?

@nigeltao
Copy link
Contributor Author

nigeltao commented Mar 4, 2016

FWIW, the very same two-line patch performs 5x worse instead of 1.6x better, on my home computer "Intel(R) Core(TM) i5-2415M CPU @ 2.30GHz":

benchmark                     old MB/s     new MB/s     speedup
Benchmark_UFlat2-4            11765.94     2436.29      0.21x

/proc/cpuinfo

The equivalent "BM_UFlat/2" C++ snappy benchmark on my home computer claims 16.2GB/s, compared to 11.8 and 2.4 above, so it may still be possible to optimize runtime.memmove on this hardware.

@martisch
Copy link
Contributor

martisch commented Mar 4, 2016

@nigeltao
REP; MOVSB could be a lot better with Ivy Bridge and later (Core i >=3xxx) as the "Intel® 64 and IA-32 Architectures Optimization Reference Manual" 3-66 notes:

3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB)

Beginning with processors based on Intel microarchitecture 
code name Ivy Bridge, REP string operation 
using MOVSB and STOSB can provide both flexible and 
high-performance REP string operations for soft-
ware in common situations like memory copy and 
set operations. ...

@TocarIP
Copy link
Contributor

TocarIP commented Mar 4, 2016

@dvnagorny works on memmove performance on Amd64, so he should know better

@dvnagorny
Copy link
Contributor

Comparing MOVSB vs MOVSD I'd say that MOVSQ can be better for 5-7% on SandyBridge for aligned data. I also think that memmove implementation can distinguish aligned and unaligned cases. The same time SandyBridge is quite old architecture and for Haswell and newer archs rep movs isn't the best solution. So I'm going to return with AVX-based implementation for modern CPUs.

@randall77
Copy link
Contributor

Just posted a simple patch to use MOVSB on unaligned moves. @dvnagorny if you could do a review that would be great.

@gopherbot
Copy link

CL https://golang.org/cl/20293 mentions this issue.

@nigeltao
Copy link
Contributor Author

I don't know where we draw the line with older hardware, but 6a33f77 "runtime: use MOVSB instead of MOVSQ etc" is a regression on my 2011ish-era "Intel(R) Core(TM) i5-2415M CPU @ 2.30GHz" home computer that I mentioned above (aka Sandy Bridge, IIUC):

benchmark old MB/s new MB/s speedup
BenchmarkMemmove4096-4 34205.06 34079.04 1.00x
BenchmarkMemmoveUnalignedDst4096-4 8371.81 2709.00 0.32x
BenchmarkMemmoveUnalignedSrc4096-4 18481.84 2760.01 0.15x

@martisch's comment above says that "REP; MOVSB could be a lot better with Ivy Bridge and later (Core i >=3xxx)" but my CPU isn't one of those.

@randall77
Copy link
Contributor

Yuck. So how do we detect which chips should use REP MOVSB and which chips should use REP MOVSQ? We're doing a large copy by the time we need to make the decision, so we have the time to think about it. But I have no idea what the predicate should be. @dvnagorny , any ideas?

In general I'm not too worked up about lower performance on older chips, but this is a very large performance drop to leave unfixed.

@martisch
Copy link
Contributor

iirc:
grep "rep_good" /proc/cpuinfo
might be a way on linux to see if the the rep mov is fast.

I will try to find a newer amd cpu because is suspect amd in general could also see a performance drop. The "Advanced Micro Devices Software Optimization Guide" also has a lot of advice on the topic of rep mov.

the intel manual also says about enhanced rep movsb:

Memcpy_ERMSB/Memcpy_AVX128
<128 | 128 to 2048 | 2048 to 4096
0x7X | 1X | 1.02X

Table 3-4 shows the relative performance of the Memcpy function implemented using enhanced 
REP MOVSB versus 128-bit AVX for several ranges of memcpy lengths, when both the source and
destination addresses are 16-byte aligned and the source region and destination region 
do not overlap. For memcpy length less than 128 bytes, using ERMSB is slower than 
what’s possible using 128-bit AVX, due to internal start-up overhead in the REP string.

@randall77
Copy link
Contributor

If it is just the erms "Enhanced REP MOVSB/STOSB" CPUID EAX=7 bit, that would be easy.

@gopherbot
Copy link

CL https://golang.org/cl/21300 mentions this issue.

gopherbot pushed a commit that referenced this issue Mar 31, 2016
Only use REP;MOVSB if:
 1) The CPUID flag says it is fast, and
 2) The pointers are unaligned
Otherwise, use REP;MOVSQ.

Update #14630

Change-Id: I946b28b87880c08e5eed1ce2945016466c89db66
Reviewed-on: https://go-review.googlesource.com/21300
Reviewed-by: Nigel Tao <nigeltao@golang.org>
@nigeltao
Copy link
Contributor Author

My 2011ish-era "Intel(R) Core(TM) i5-2415M CPU @ 2.30GHz" no longer regresses on BenchmarkMemmoveUnaligned.*4096.

@golang golang locked and limited conversation to collaborators Mar 31, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants