runtime: memmove 1.6x slower than "REP; MOVSB" on amd64 #14630

nigeltao · 2016-03-03T22:59:12Z

Snappy is a simple compression format that, for incompressible input such as a JPEG file, doesn't do much more than calling runtime·memmove.

https://github.com/golang/snappy/blob/master/decode_amd64.s has a chunk of code (grep for callMemmove) that essentially does some bounds checking, and then

spill registers
CALL runtime·memmove(DI, SI, CX)
unspill registers
ADDQ CX, DI
ADDQ CX, SI

Replacing this with "REP; MOVSB" should be equivalent, but turns out to be 1.6x faster on one benchmark.

$ git diff
diff --git a/decode_amd64.s b/decode_amd64.s
index 1486aba..651519f 100644
--- a/decode_amd64.s
+++ b/decode_amd64.s
@@ -141,6 +141,9 @@ callMemmove:
        CMPQ CX, BX
        JGT  errCorrupt

+       REP; MOVSB
+       JMP loop
+
        // copy(dst[d:], src[s:s+length])
        //
        // This means calling runtime·memmove(&dst[d], &src[s], length), so we push

Old:
Benchmark_UFlat2-8 200000 8154 ns/op 15094.70 MB/s
New:
Benchmark_UFlat2-8 300000 4957 ns/op 24831.90 MB/s

The UFlat2 benchmark is indeed decoding a 123,093 byte JPEG file, and the bulk of the time in one decoding is spent in two memmove calls, of length 65218 and 57557. The first one is slightly less than 65536 because the JPEG header is slightly compressible, and is not encoded solely as a literal src-to-dst copy.

I'm happy to try to extract the UFlat2 snappy benchmark as a separate, minimal repro case, if that would help.

Long story short, it seems that if "REP; MOVSB" is significantly faster than runtime·memmove in some cases, then runtime·memmove should just (conditionally) issue "REP; MOVSB". I don't have a good sense what that condition should be though.

This is admittedly on a relatively brawny amd64 machine (/proc/cpuinfo attached), "Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz". I suspect that the same patch actually performs worse on my relatively wimpy home computer (also amd64). I'll update the issue with those numbers later.

The text was updated successfully, but these errors were encountered:

randall77 · 2016-03-03T23:07:38Z

I don't understand where the performance difference comes from. runtime.memmove uses REP; MOVSQ for 2049+ bytes. Is it MOVSB vs. MOVSQ?

@TocarIP

randall77 · 2016-03-03T23:11:58Z

I don't see much difference between MOVSB and MOVSQ

MOVSQ:
BenchmarkMemmove65218-8 1000000 2283 ns/op

MOVSB:
BenchmarkMemmove65218-8 1000000 2233 ns/op

randall77 · 2016-03-03T23:35:53Z

Alignment is the issue. When the MOVSQ isn't aligned, it is slower.

MOVSB:
BenchmarkMemmoveUnaligned65218-8 500000 3143 ns/op

MOVSQ:
BenchmarkMemmoveUnaligned65218-8 300000 4511 ns/op

Tocar, do you think we should always use REP; MOVSB then?

nigeltao · 2016-03-04T06:51:14Z

FWIW, the very same two-line patch performs 5x worse instead of 1.6x better, on my home computer "Intel(R) Core(TM) i5-2415M CPU @ 2.30GHz":

benchmark                     old MB/s     new MB/s     speedup
Benchmark_UFlat2-4            11765.94     2436.29      0.21x

/proc/cpuinfo

The equivalent "BM_UFlat/2" C++ snappy benchmark on my home computer claims 16.2GB/s, compared to 11.8 and 2.4 above, so it may still be possible to optimize runtime.memmove on this hardware.

martisch · 2016-03-04T07:03:23Z

@nigeltao
REP; MOVSB could be a lot better with Ivy Bridge and later (Core i >=3xxx) as the "Intel® 64 and IA-32 Architectures Optimization Reference Manual" 3-66 notes:

3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB)

Beginning with processors based on Intel microarchitecture 
code name Ivy Bridge, REP string operation 
using MOVSB and STOSB can provide both flexible and 
high-performance REP string operations for soft-
ware in common situations like memory copy and 
set operations. ...

TocarIP · 2016-03-04T11:15:36Z

@dvnagorny works on memmove performance on Amd64, so he should know better

dvnagorny · 2016-03-04T20:15:41Z

Comparing MOVSB vs MOVSD I'd say that MOVSQ can be better for 5-7% on SandyBridge for aligned data. I also think that memmove implementation can distinguish aligned and unaligned cases. The same time SandyBridge is quite old architecture and for Haswell and newer archs rep movs isn't the best solution. So I'm going to return with AVX-based implementation for modern CPUs.

randall77 · 2016-03-07T01:04:25Z

Just posted a simple patch to use MOVSB on unaligned moves. @dvnagorny if you could do a review that would be great.

gopherbot · 2016-03-07T02:00:55Z

CL https://golang.org/cl/20293 mentions this issue.

nigeltao · 2016-03-22T09:14:39Z

I don't know where we draw the line with older hardware, but 6a33f77 "runtime: use MOVSB instead of MOVSQ etc" is a regression on my 2011ish-era "Intel(R) Core(TM) i5-2415M CPU @ 2.30GHz" home computer that I mentioned above (aka Sandy Bridge, IIUC):

benchmark old MB/s new MB/s speedup
BenchmarkMemmove4096-4 34205.06 34079.04 1.00x
BenchmarkMemmoveUnalignedDst4096-4 8371.81 2709.00 0.32x
BenchmarkMemmoveUnalignedSrc4096-4 18481.84 2760.01 0.15x

@martisch's comment above says that "REP; MOVSB could be a lot better with Ivy Bridge and later (Core i >=3xxx)" but my CPU isn't one of those.

randall77 · 2016-03-22T16:36:18Z

Yuck. So how do we detect which chips should use REP MOVSB and which chips should use REP MOVSQ? We're doing a large copy by the time we need to make the decision, so we have the time to think about it. But I have no idea what the predicate should be. @dvnagorny , any ideas?

In general I'm not too worked up about lower performance on older chips, but this is a very large performance drop to leave unfixed.

martisch · 2016-03-22T17:01:39Z

iirc:
grep "rep_good" /proc/cpuinfo
might be a way on linux to see if the the rep mov is fast.

I will try to find a newer amd cpu because is suspect amd in general could also see a performance drop. The "Advanced Micro Devices Software Optimization Guide" also has a lot of advice on the topic of rep mov.

the intel manual also says about enhanced rep movsb:

Memcpy_ERMSB/Memcpy_AVX128
<128 | 128 to 2048 | 2048 to 4096
0x7X | 1X | 1.02X

Table 3-4 shows the relative performance of the Memcpy function implemented using enhanced 
REP MOVSB versus 128-bit AVX for several ranges of memcpy lengths, when both the source and
destination addresses are 16-byte aligned and the source region and destination region 
do not overlap. For memcpy length less than 128 bytes, using ERMSB is slower than 
what’s possible using 128-bit AVX, due to internal start-up overhead in the REP string.

randall77 · 2016-03-22T17:17:16Z

If it is just the erms "Enhanced REP MOVSB/STOSB" CPUID EAX=7 bit, that would be easy.

gopherbot · 2016-03-30T05:00:53Z

CL https://golang.org/cl/21300 mentions this issue.

Only use REP;MOVSB if: 1) The CPUID flag says it is fast, and 2) The pointers are unaligned Otherwise, use REP;MOVSQ. Update #14630 Change-Id: I946b28b87880c08e5eed1ce2945016466c89db66 Reviewed-on: https://go-review.googlesource.com/21300 Reviewed-by: Nigel Tao <nigeltao@golang.org>

nigeltao · 2016-03-31T11:40:23Z

My 2011ish-era "Intel(R) Core(TM) i5-2415M CPU @ 2.30GHz" no longer regresses on BenchmarkMemmoveUnaligned.*4096.

nigeltao assigned randall77 Mar 3, 2016

gopherbot closed this as completed in 6a33f77 Mar 21, 2016

golang locked and limited conversation to collaborators Mar 31, 2017

gopherbot added the FrozenDueToAge label Mar 31, 2017

rsc unassigned randall77 Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: memmove 1.6x slower than "REP; MOVSB" on amd64 #14630

runtime: memmove 1.6x slower than "REP; MOVSB" on amd64 #14630

nigeltao commented Mar 3, 2016

randall77 commented Mar 3, 2016

randall77 commented Mar 3, 2016

randall77 commented Mar 3, 2016

nigeltao commented Mar 4, 2016

martisch commented Mar 4, 2016

TocarIP commented Mar 4, 2016

dvnagorny commented Mar 4, 2016

randall77 commented Mar 7, 2016

gopherbot commented Mar 7, 2016

nigeltao commented Mar 22, 2016

randall77 commented Mar 22, 2016

martisch commented Mar 22, 2016

randall77 commented Mar 22, 2016

gopherbot commented Mar 30, 2016

nigeltao commented Mar 31, 2016

runtime: memmove 1.6x slower than "REP; MOVSB" on amd64 #14630

runtime: memmove 1.6x slower than "REP; MOVSB" on amd64 #14630

Comments

nigeltao commented Mar 3, 2016

randall77 commented Mar 3, 2016

randall77 commented Mar 3, 2016

randall77 commented Mar 3, 2016

nigeltao commented Mar 4, 2016

martisch commented Mar 4, 2016

TocarIP commented Mar 4, 2016

dvnagorny commented Mar 4, 2016

randall77 commented Mar 7, 2016

gopherbot commented Mar 7, 2016

nigeltao commented Mar 22, 2016

randall77 commented Mar 22, 2016

martisch commented Mar 22, 2016

randall77 commented Mar 22, 2016

gopherbot commented Mar 30, 2016

nigeltao commented Mar 31, 2016