cmd/gc, runtime: slow swapping of slice values #7927

egonelbre · 2014-05-02T18:26:39Z

Idiomatic swapping  e[i], e[j] = e[j], e[i] in a slice of structs is slower than
swapping each field in the structs separately.

What does 'go version' print?
go version devel +c19d7fd53785 Wed Apr 30 13:03:38 2014 -0400 windows/amd64

What steps reproduce the problem?

http://play.golang.org/p/M5aR22tZSQ

type Elem [16]uint64
type Elems []Elem

func (e Elems) SwapIdiom(i, j int) {
    e[i], e[j] = e[j], e[i]
}

func (e Elems) SwapUnrolled(i, j int) {
    a, b := &e[i], &e[j]
    a[0], b[0] = b[0], a[0]
    a[1], b[1] = b[1], a[1]
    a[2], b[2] = b[2], a[2]
    ...
}

My measurements got:

SwapIdiom: 46.0ns/op
SwapUnrolled: 36.2ns/op

What should have happened instead?

I would expect the idiomatic to be as fast or faster than the manually unrolled version.

bradfitz · 2014-05-02T18:34:59Z

Comment 1:

Labels changed: added performance, release-none, repo-main.

Status changed to Accepted.

randall77 · 2014-05-04T04:59:41Z

Comment 2:

The reason why the idiomatic code is slower is that the temporary is bigger and can't be
register-allocated.
  a,b = b,a
compiles to
  tmp = b
  b = a
  a = tmp
If tmp is one word in size, it can be allocated to a register.  This happens for the
unrolled code.
If tmp is large, it becomes a stack variable.  There are then 3 mem->mem copies.
I don't see any obvious answer here.  Just unrolling as the example shows seems tricky. 
The unrolled version is basically trading code size for some performance improvement. 
It is great if it is the hot code, not so good if there are 100s of these in cold code.
We could implement a swap in assembly which did not need the temporary.  It's a slippery
slope, though - how about a,b,c = b,c,a?  We'd have to decide which cases to handle and
which ones to leave as is.
The performance penalty is not that large (about 25% slower), and you don't see
large-object swaps that often (you're much better off swapping pointers to those objects
if you can arrange it).
Short answer: maybe.  I'd like to be more convinced that it would be worth it.  A
real-world example, perhaps?

egonelbre · 2014-05-04T13:12:35Z

Comment 3:

The unrolling was mainly a demonstration of the speed difference, not the code that
should be implemented. I was guessing a similar approach to memcopy, small things
inlined, large ones call some function.
Main use case would be sorting and priority queues. I noticed this thing while
optimizing Brad-s slice package https://github.com/egonelbre/slice. I did some rough
measurements and it showed that starting at 3 int64-s the unrolled swap started being
faster, which corresponds to the size of slice header. Of course the performance benefit
there is quite small, only a few percent.
I tried to measure the actual effect on sorting http://play.golang.org/p/HoBVsztPlg
The first test gave me some conflicting data:
BenchmarkSortSimple         5000            667638 ns/op
BenchmarkSortCache          5000            612834 ns/op
BenchmarkSortUnroll         5000            685639 ns/op
BenchmarkSortReference      5000            664638 ns/op
I'm not entirely sure why the Cache version here that much faster; and why the unrolled
version ended up slower.
A second example http://play.golang.org/p/GWupagv5wY gave somewhat expected results:
BenchmarkSortSimple         5000            506628 ns/op
BenchmarkSortCache          5000            484627 ns/op
BenchmarkSortUnroll         5000            483627 ns/op
BenchmarkSortReference      5000            478427 ns/op
So, I would have to conclude that using pointers is preferable, instead of trying to
make swaps faster. You would have an extra indirection while using that slice, but I'm
not able to figure out a really good example that involves both. So, I don't think I can
come up with a better real world example than the "slice" package; but it uses generated
code anyways, so it can just as easily generate the unrolled code.
It seems the major performance improvement doesn't come from the unrolling, but rather
indexing the slice. ( "e[i], e[j] = e[j], e[i]" vs. "a, b := &e[i], &e[j]; *a, *b = *b,
*a" ) So, that might be a more reasonable optimization; then again, it can be easily
done by hand.

josharian · 2014-05-05T20:02:47Z

Comment 4:

> The unrolled version is basically trading code size for some performance improvement.
A data point here: I ran 'go tool 6g -S 7927.go 2>&1 | grep "Elems\.Swap.*locals"' for a
few different sizes of the Elem array.
For 16 elements:
"".Elems.SwapIdiom t=1 size=240 value=0 args=0x28 locals=0x80
"".Elems.SwapSingle t=1 size=368 value=0 args=0x28 locals=0x8
For 8 elements:
"".Elems.SwapIdiom t=1 size=208 value=0 args=0x28 locals=0x40
"".Elems.SwapSingle t=1 size=240 value=0 args=0x28 locals=0x8
For 6 elements:
"".Elems.SwapIdiom t=1 size=208 value=0 args=0x28 locals=0x30
"".Elems.SwapSingle t=1 size=208 value=0 args=0x28 locals=0x8
For 4 elements (the smallest size for which DUFFCOPY is used):
"".Elems.SwapIdiom t=1 size=208 value=0 args=0x28 locals=0x20
"".Elems.SwapSingle t=1 size=160 value=0 args=0x28 locals=0x8
For 1 element:
"".Elems.SwapIdiom t=1 size=128 value=0 args=0x28 locals=0
"".Elems.SwapSingle t=1 size=112 value=0 args=0x28 locals=0x8
So for moderate numbers of elements, there's both a code size and a perf win.

randall77 · 2014-05-07T21:55:51Z

Comment 5:

Owner changed to @randall77.

ferluht · 2016-03-31T11:33:46Z

Doesn't reproduce with revision b9feb91

bradfitz · 2016-03-31T12:35:18Z

Thanks. Yeah, I suspect SSA helped with this.

I'll close this, but @randall77 @brtzsnr @josharian @dr2chase et al can reopen if they don't like the generated code.

gopherbot · 2016-03-31T17:00:25Z

CL https://golang.org/cl/21346 mentions this issue.

Helpful for indexed loads and stores when the stride is not equal to the size being loaded/stored. Update #7927 Change-Id: I8714dd4c7b18a96a611bf5647ee21f753d723945 Reviewed-on: https://go-review.googlesource.com/21346 Run-TryBot: Todd Neal <todd@tneal.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Todd Neal <todd@tneal.org>

egonelbre added accepted Performance labels May 7, 2014

rsc added this to the Unplanned milestone Apr 10, 2015

rsc removed release-none labels Apr 10, 2015

bradfitz closed this as completed Mar 31, 2016

golang locked and limited conversation to collaborators Mar 31, 2017

gopherbot added the FrozenDueToAge label Mar 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/gc, runtime: slow swapping of slice values #7927

cmd/gc, runtime: slow swapping of slice values #7927

egonelbre commented May 2, 2014

bradfitz commented May 2, 2014

randall77 commented May 4, 2014

egonelbre commented May 4, 2014

josharian commented May 5, 2014

randall77 commented May 7, 2014

ferluht commented Mar 31, 2016

bradfitz commented Mar 31, 2016

gopherbot commented Mar 31, 2016

cmd/gc, runtime: slow swapping of slice values #7927

cmd/gc, runtime: slow swapping of slice values #7927

Comments

egonelbre commented May 2, 2014

bradfitz commented May 2, 2014

randall77 commented May 4, 2014

egonelbre commented May 4, 2014

josharian commented May 5, 2014

randall77 commented May 7, 2014

ferluht commented Mar 31, 2016

bradfitz commented Mar 31, 2016

gopherbot commented Mar 31, 2016