runtime: memmove sometimes faster than memclrNoHeapPointers #23306

alandonovan · 2018-01-02T15:51:43Z

Memory allocation using make([]int, K) is surprisingly slow compared to append(nil, ...), even though append does strictly more work, such as copying.

$ cat a_test.go
package main

import "testing"

const K = 1e6
var escape []int

func BenchmarkMake(b *testing.B) {
	for i := 0; i < b.N; i++ {
		escape = make([]int, K)
	}
}

var empty [K]int

func BenchmarkAppend(b *testing.B) {
	for i := 0; i < b.N; i++ {
		escape = append([]int(nil), empty[:]...)
	}
}

$ go version
go version devel +6317adeed7 Tue Jan 2 13:39:20 2018 +0000 linux/amd64

$ go test -bench=. a_test.go
BenchmarkAppend-12    	    1000	   1208800 ns/op
BenchmarkMake-12      	    1000	   1473106 ns/op

While reporting this issue, I initially used an older runtime from December 18 in which the effect was much stronger: 10x-20x slowdown. But that seems to have been fixed.

Curiously, this issue is the exact opposite of the problem reported in #14718 (now closed).

The text was updated successfully, but these errors were encountered:

bcmills · 2018-01-02T17:25:50Z

append does strictly more work, such as copying.

append has to copy, but make has to zero, and either of those operations may be hardware-accelerated. It's not obvious that either is strictly more work than the other.

Are you sure that the escape analysis is working as you expect? Since the escape variable is package-local the compiler could reasonably see through it (and hoist the allocations out of either or both of those loops).

mdempsky · 2018-01-02T22:25:25Z

Here's a benchmark of the underlying memory copying/clearing primitives (you'll need to put this in its own package directory, along with an empty .s file to workaround #23311):

package main

import (
    "testing"
    "unsafe"
)

//go:linkname memclrNoHeapPointers runtime.memclrNoHeapPointers
func memclrNoHeapPointers(ptr unsafe.Pointer, n uintptr)

//go:linkname memmove runtime.memmove
func memmove(to, from unsafe.Pointer, n uintptr)

const K = 6e5

var a1, a2 [K]int

func BenchmarkMemclr(b *testing.B) {
    for i := 0; i < b.N; i++ {
            memclrNoHeapPointers(unsafe.Pointer(&a1), unsafe.Sizeof(a1))
    }
}

func BenchmarkMemmove(b *testing.B) {
    for i := 0; i < b.N; i++ {
            memmove(unsafe.Pointer(&a1), unsafe.Pointer(&a2), unsafe.Sizeof(a1))
    }
}

On my laptop, the relative performance seems very sensitive to the exact value of K. For example, at K=6e5, I get:

BenchmarkMemclr-4           5000            322261 ns/op
BenchmarkMemmove-4          5000            305383 ns/op

But at K=1e7, I get:

BenchmarkMemclr-4            300           4485500 ns/op
BenchmarkMemmove-4           300           5060492 ns/op

josharian · 2018-01-11T22:13:02Z

Probably unrelated, but this reminds me of 4k aliasing: https://lemire.me/blog/2018/01/04/dont-make-it-appear-like-you-are-reading-your-own-recent-writes/

TocarIP · 2018-02-28T19:25:18Z

For original benchmark memmove and memclr use different strategies. Memmove switches to non-temporal movs, while memclr uses regular movs. Changing non-temporal mov threshould in memmove to match memclr makes append faster:

Make-6    1.58ms Â± 1%  1.58ms Â± 1%     ~     (p=0.912 n=10+10)
Append-6  1.36ms Â± 1%  1.89ms Â± 1%  +39.07%  (p=0.000 n=10+10)

However, for memmove tests from runtime switching to regular movs makes benchmark slower for larger sizes:

Memmove/65536-6                 14.9GB/s Â± 0%  14.9GB/s Â± 0%   +0.16%  (p=0.028 n=9+10)
Memmove/1048576-6               8.67GB/s Â± 1%  8.26GB/s Â± 2%   -4.80%  (p=0.000 n=10+10)
Memmove/4194304-6               8.51GB/s Â± 2%  8.20GB/s Â± 3%   -3.74%  (p=0.000 n=10+10)
Memmove/8388608-6               8.55GB/s Â± 2%  6.31GB/s Â± 4%  -26.28%  (p=0.000 n=10+10)
Memmove/16777216-6              7.92GB/s Â± 1%  4.33GB/s Â± 2%  -45.30%  (p=0.000 n=10+10)
Memmove/67108864-6              6.56GB/s Â± 2%  6.59GB/s Â± 1%     ~     (p=0.315 n=10+9)

MemmoveUnalignedDst/65536-6     14.5GB/s Â± 1%  14.5GB/s Â± 0%     ~     (p=1.000 n=10+7)
MemmoveUnalignedDst/1048576-6   8.70GB/s Â± 2%  8.14GB/s Â± 1%   -6.48%  (p=0.000 n=10+9)
MemmoveUnalignedDst/4194304-6   8.64GB/s Â± 2%  8.13GB/s Â± 2%   -5.92%  (p=0.000 n=10+10)
MemmoveUnalignedDst/8388608-6   8.55GB/s Â± 3%  6.24GB/s Â± 3%  -27.00%  (p=0.000 n=10+10)
MemmoveUnalignedDst/16777216-6  7.93GB/s Â± 3%  4.36GB/s Â± 1%  -45.08%  (p=0.000 n=10+9)
MemmoveUnalignedDst/67108864-6  6.66GB/s Â± 1%  6.76GB/s Â± 2%   +1.49%  (p=0.000 n=9+10)

MemmoveUnalignedSrc/65536-6     14.5GB/s Â± 1%  14.5GB/s Â± 1%     ~     (p=0.796 n=10+10)
MemmoveUnalignedSrc/1048576-6   8.57GB/s Â± 1%  8.20GB/s Â± 2%   -4.29%  (p=0.000 n=9+10)
MemmoveUnalignedSrc/4194304-6   8.54GB/s Â± 2%  8.19GB/s Â± 2%   -4.18%  (p=0.000 n=10+10)
MemmoveUnalignedSrc/8388608-6   8.53GB/s Â± 2%  6.25GB/s Â± 4%  -26.66%  (p=0.000 n=10+10)
MemmoveUnalignedSrc/16777216-6  8.02GB/s Â± 2%  4.36GB/s Â± 2%  -45.67%  (p=0.000 n=10+10)
MemmoveUnalignedSrc/67108864-6  6.73GB/s Â± 2%  6.82GB/s Â± 2%   +1.32%  (p=0.035 n=10+10)

go101 · 2020-09-06T14:09:57Z

It looks this problem has been solved in Go Toolchain 1.15.

go101 · 2020-09-06T14:19:39Z

Sorry, I mean make+copy is specially optimized in Go 1.15, so that it is more efficient than a single make (also more efficient than append in any case). Single make call is still not optimized.

mdempsky changed the title ~~runtime: allocation using make is 40% slower than append(nil, ...)~~ runtime: memmove sometimes faster than memclrNoHeapPointers Jan 2, 2018

martisch mentioned this issue Feb 18, 2018

proposal: the make function needs to be optimized for large slices #23906

Closed

josharian mentioned this issue Feb 19, 2018

runtime: handle 3 and 4 bytes separately in memclrNoHeapPointers? #23930

Closed

ianlancetaylor added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance labels Mar 28, 2018

ianlancetaylor added this to the Go1.11 milestone Mar 28, 2018

bradfitz modified the milestones: Go1.11, Unplanned May 18, 2018

TocarIP mentioned this issue Jul 6, 2018

cmd/compile: optimize slice copy via make+copy #26252

Closed

go101 mentioned this issue Aug 1, 2018

runtime: append is slower than make+assignments for small slices with known length #26734

Open

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: memmove sometimes faster than memclrNoHeapPointers #23306

runtime: memmove sometimes faster than memclrNoHeapPointers #23306

alandonovan commented Jan 2, 2018 •

edited

bcmills commented Jan 2, 2018

mdempsky commented Jan 2, 2018

josharian commented Jan 11, 2018

TocarIP commented Feb 28, 2018

go101 commented Sep 6, 2020

go101 commented Sep 6, 2020

runtime: memmove sometimes faster than memclrNoHeapPointers #23306

runtime: memmove sometimes faster than memclrNoHeapPointers #23306

Comments

alandonovan commented Jan 2, 2018 • edited

bcmills commented Jan 2, 2018

mdempsky commented Jan 2, 2018

josharian commented Jan 11, 2018

TocarIP commented Feb 28, 2018

go101 commented Sep 6, 2020

go101 commented Sep 6, 2020

alandonovan commented Jan 2, 2018 •

edited