-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/gc, runtime: slow swapping of slice values #7927
Labels
Milestone
Comments
The reason why the idiomatic code is slower is that the temporary is bigger and can't be register-allocated. a,b = b,a compiles to tmp = b b = a a = tmp If tmp is one word in size, it can be allocated to a register. This happens for the unrolled code. If tmp is large, it becomes a stack variable. There are then 3 mem->mem copies. I don't see any obvious answer here. Just unrolling as the example shows seems tricky. The unrolled version is basically trading code size for some performance improvement. It is great if it is the hot code, not so good if there are 100s of these in cold code. We could implement a swap in assembly which did not need the temporary. It's a slippery slope, though - how about a,b,c = b,c,a? We'd have to decide which cases to handle and which ones to leave as is. The performance penalty is not that large (about 25% slower), and you don't see large-object swaps that often (you're much better off swapping pointers to those objects if you can arrange it). Short answer: maybe. I'd like to be more convinced that it would be worth it. A real-world example, perhaps? |
The unrolling was mainly a demonstration of the speed difference, not the code that should be implemented. I was guessing a similar approach to memcopy, small things inlined, large ones call some function. Main use case would be sorting and priority queues. I noticed this thing while optimizing Brad-s slice package https://github.com/egonelbre/slice. I did some rough measurements and it showed that starting at 3 int64-s the unrolled swap started being faster, which corresponds to the size of slice header. Of course the performance benefit there is quite small, only a few percent. I tried to measure the actual effect on sorting http://play.golang.org/p/HoBVsztPlg The first test gave me some conflicting data: BenchmarkSortSimple 5000 667638 ns/op BenchmarkSortCache 5000 612834 ns/op BenchmarkSortUnroll 5000 685639 ns/op BenchmarkSortReference 5000 664638 ns/op I'm not entirely sure why the Cache version here that much faster; and why the unrolled version ended up slower. A second example http://play.golang.org/p/GWupagv5wY gave somewhat expected results: BenchmarkSortSimple 5000 506628 ns/op BenchmarkSortCache 5000 484627 ns/op BenchmarkSortUnroll 5000 483627 ns/op BenchmarkSortReference 5000 478427 ns/op So, I would have to conclude that using pointers is preferable, instead of trying to make swaps faster. You would have an extra indirection while using that slice, but I'm not able to figure out a really good example that involves both. So, I don't think I can come up with a better real world example than the "slice" package; but it uses generated code anyways, so it can just as easily generate the unrolled code. It seems the major performance improvement doesn't come from the unrolling, but rather indexing the slice. ( "e[i], e[j] = e[j], e[i]" vs. "a, b := &e[i], &e[j]; *a, *b = *b, *a" ) So, that might be a more reasonable optimization; then again, it can be easily done by hand. |
> The unrolled version is basically trading code size for some performance improvement. A data point here: I ran 'go tool 6g -S 7927.go 2>&1 | grep "Elems\.Swap.*locals"' for a few different sizes of the Elem array. For 16 elements: "".Elems.SwapIdiom t=1 size=240 value=0 args=0x28 locals=0x80 "".Elems.SwapSingle t=1 size=368 value=0 args=0x28 locals=0x8 For 8 elements: "".Elems.SwapIdiom t=1 size=208 value=0 args=0x28 locals=0x40 "".Elems.SwapSingle t=1 size=240 value=0 args=0x28 locals=0x8 For 6 elements: "".Elems.SwapIdiom t=1 size=208 value=0 args=0x28 locals=0x30 "".Elems.SwapSingle t=1 size=208 value=0 args=0x28 locals=0x8 For 4 elements (the smallest size for which DUFFCOPY is used): "".Elems.SwapIdiom t=1 size=208 value=0 args=0x28 locals=0x20 "".Elems.SwapSingle t=1 size=160 value=0 args=0x28 locals=0x8 For 1 element: "".Elems.SwapIdiom t=1 size=128 value=0 args=0x28 locals=0 "".Elems.SwapSingle t=1 size=112 value=0 args=0x28 locals=0x8 So for moderate numbers of elements, there's both a code size and a perf win. |
Owner changed to @randall77. |
Doesn't reproduce with revision b9feb91 |
Thanks. Yeah, I suspect SSA helped with this. I'll close this, but @randall77 @brtzsnr @josharian @dr2chase et al can reopen if they don't like the generated code. |
CL https://golang.org/cl/21346 mentions this issue. |
gopherbot
pushed a commit
that referenced
this issue
Mar 31, 2016
Helpful for indexed loads and stores when the stride is not equal to the size being loaded/stored. Update #7927 Change-Id: I8714dd4c7b18a96a611bf5647ee21f753d723945 Reviewed-on: https://go-review.googlesource.com/21346 Run-TryBot: Todd Neal <todd@tneal.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Todd Neal <todd@tneal.org>
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The text was updated successfully, but these errors were encountered: