Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/gc, runtime: slow swapping of slice values #7927

Closed
egonelbre opened this issue May 2, 2014 · 8 comments
Closed

cmd/gc, runtime: slow swapping of slice values #7927

egonelbre opened this issue May 2, 2014 · 8 comments

Comments

@egonelbre
Copy link
Contributor

Idiomatic swapping  e[i], e[j] = e[j], e[i] in a slice of structs is slower than
swapping each field in the structs separately.

What does 'go version' print?
go version devel +c19d7fd53785 Wed Apr 30 13:03:38 2014 -0400 windows/amd64

What steps reproduce the problem?

http://play.golang.org/p/M5aR22tZSQ

type Elem [16]uint64
type Elems []Elem

func (e Elems) SwapIdiom(i, j int) {
    e[i], e[j] = e[j], e[i]
}

func (e Elems) SwapUnrolled(i, j int) {
    a, b := &e[i], &e[j]
    a[0], b[0] = b[0], a[0]
    a[1], b[1] = b[1], a[1]
    a[2], b[2] = b[2], a[2]
    ...
}

My measurements got:

SwapIdiom: 46.0ns/op
SwapUnrolled: 36.2ns/op

What should have happened instead?

I would expect the idiomatic to be as fast or faster than the manually unrolled version.
@bradfitz
Copy link
Contributor

bradfitz commented May 2, 2014

Comment 1:

Labels changed: added performance, release-none, repo-main.

Status changed to Accepted.

@randall77
Copy link
Contributor

Comment 2:

The reason why the idiomatic code is slower is that the temporary is bigger and can't be
register-allocated.
  a,b = b,a
compiles to
  tmp = b
  b = a
  a = tmp
If tmp is one word in size, it can be allocated to a register.  This happens for the
unrolled code.
If tmp is large, it becomes a stack variable.  There are then 3 mem->mem copies.
I don't see any obvious answer here.  Just unrolling as the example shows seems tricky. 
The unrolled version is basically trading code size for some performance improvement. 
It is great if it is the hot code, not so good if there are 100s of these in cold code.
We could implement a swap in assembly which did not need the temporary.  It's a slippery
slope, though - how about a,b,c = b,c,a?  We'd have to decide which cases to handle and
which ones to leave as is.
The performance penalty is not that large (about 25% slower), and you don't see
large-object swaps that often (you're much better off swapping pointers to those objects
if you can arrange it).
Short answer: maybe.  I'd like to be more convinced that it would be worth it.  A
real-world example, perhaps?

@egonelbre
Copy link
Contributor Author

Comment 3:

The unrolling was mainly a demonstration of the speed difference, not the code that
should be implemented. I was guessing a similar approach to memcopy, small things
inlined, large ones call some function.
Main use case would be sorting and priority queues. I noticed this thing while
optimizing Brad-s slice package https://github.com/egonelbre/slice. I did some rough
measurements and it showed that starting at 3 int64-s the unrolled swap started being
faster, which corresponds to the size of slice header. Of course the performance benefit
there is quite small, only a few percent.
I tried to measure the actual effect on sorting http://play.golang.org/p/HoBVsztPlg
The first test gave me some conflicting data:
BenchmarkSortSimple         5000            667638 ns/op
BenchmarkSortCache          5000            612834 ns/op
BenchmarkSortUnroll         5000            685639 ns/op
BenchmarkSortReference      5000            664638 ns/op
I'm not entirely sure why the Cache version here that much faster; and why the unrolled
version ended up slower.
A second example http://play.golang.org/p/GWupagv5wY gave somewhat expected results:
BenchmarkSortSimple         5000            506628 ns/op
BenchmarkSortCache          5000            484627 ns/op
BenchmarkSortUnroll         5000            483627 ns/op
BenchmarkSortReference      5000            478427 ns/op
So, I would have to conclude that using pointers is preferable, instead of trying to
make swaps faster. You would have an extra indirection while using that slice, but I'm
not able to figure out a really good example that involves both. So, I don't think I can
come up with a better real world example than the "slice" package; but it uses generated
code anyways, so it can just as easily generate the unrolled code.
It seems the major performance improvement doesn't come from the unrolling, but rather
indexing the slice. ( "e[i], e[j] = e[j], e[i]" vs. "a, b := &e[i], &e[j]; *a, *b = *b,
*a" ) So, that might be a more reasonable optimization; then again, it can be easily
done by hand.

@josharian
Copy link
Contributor

Comment 4:

> The unrolled version is basically trading code size for some performance improvement.
A data point here: I ran 'go tool 6g -S 7927.go 2>&1 | grep "Elems\.Swap.*locals"' for a
few different sizes of the Elem array.
For 16 elements:
"".Elems.SwapIdiom t=1 size=240 value=0 args=0x28 locals=0x80
"".Elems.SwapSingle t=1 size=368 value=0 args=0x28 locals=0x8
For 8 elements:
"".Elems.SwapIdiom t=1 size=208 value=0 args=0x28 locals=0x40
"".Elems.SwapSingle t=1 size=240 value=0 args=0x28 locals=0x8
For 6 elements:
"".Elems.SwapIdiom t=1 size=208 value=0 args=0x28 locals=0x30
"".Elems.SwapSingle t=1 size=208 value=0 args=0x28 locals=0x8
For 4 elements (the smallest size for which DUFFCOPY is used):
"".Elems.SwapIdiom t=1 size=208 value=0 args=0x28 locals=0x20
"".Elems.SwapSingle t=1 size=160 value=0 args=0x28 locals=0x8
For 1 element:
"".Elems.SwapIdiom t=1 size=128 value=0 args=0x28 locals=0
"".Elems.SwapSingle t=1 size=112 value=0 args=0x28 locals=0x8
So for moderate numbers of elements, there's both a code size and a perf win.

@randall77
Copy link
Contributor

Comment 5:

Owner changed to @randall77.

@ferluht
Copy link

ferluht commented Mar 31, 2016

Doesn't reproduce with revision b9feb91

@bradfitz
Copy link
Contributor

Thanks. Yeah, I suspect SSA helped with this.

I'll close this, but @randall77 @brtzsnr @josharian @dr2chase et al can reopen if they don't like the generated code.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/21346 mentions this issue.

gopherbot pushed a commit that referenced this issue Mar 31, 2016
Helpful for indexed loads and stores when the stride is not equal to
the size being loaded/stored.

Update #7927

Change-Id: I8714dd4c7b18a96a611bf5647ee21f753d723945
Reviewed-on: https://go-review.googlesource.com/21346
Run-TryBot: Todd Neal <todd@tneal.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Todd Neal <todd@tneal.org>
@golang golang locked and limited conversation to collaborators Mar 31, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants