sort: improve performance of heapSort for worst case inputs #39512

danielf · 2020-06-10T19:56:41Z

What version of Go are you using (`go version`)?

$ go version
go version devel +7b872b6d95 Tue Jun 9 23:24:08 2020 +0000 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

go env darwin/amd64

$ go env
GOARCH="amd64"
GOOS="darwin"

What did you do?

I've implemented a faster version of siftDown for when it's expected that the root will end up in one of the lower levels. This is the likely outcome right after a "pop", since we move a former leaf to the root.

Benchmarks using an adversarial input (the one in sort_test.go) show that the new code is ~11.8% faster when using a struct with a complex Less, and ~5.5% faster when sorting ints.

My changes makes the code slower in the case when a lot of elements are equivalent (!Less(a, b) && !Less(b, a)), but this case is uncommon when the Heapsort part of sort.go is activated.

Notes

The same changes can be applied to container/heap and to runtime, assuming benchmarks support them.

benchstat results

name                          old time/op  new time/op  delta
Adversary/Using_complex_Less  49.5ms ± 0%  43.7ms ± 0%  -11.77%  (p=0.000 n=10+10)
Adversary/Using_Ints          31.3ms ± 0%  29.6ms ± 0%   -5.48%  (p=0.000 n=10+10)

The text was updated successfully, but these errors were encountered:

gopherbot · 2020-06-10T20:22:11Z

Change https://golang.org/cl/237437 mentions this issue: sort: improve speed of HeapSort

martisch · 2020-06-10T21:20:43Z

As a datapoint for the complexity/maintenence/speedup tradeoff: For the production profiling I have available if time spend in and below sort.Sort is 100% then ~0.001% of it is spend in sort.heapSort.

I think it is therefore fine to improve the speed if the actual added code is very small and obvious but adding another function that would rarely be run and needs additional care to be tested due to only being used on adverserial inputs might not be a good tradeoff.

Windsooon · 2020-06-18T04:01:00Z

There are two opened PRs for improving the heap at PR1 and PR2.

danielf · 2020-06-18T04:45:56Z

@martisch -- I understand that the amount of time spend in heapSort during a sort.Sort is minimal for typical cases. That said, not all cases are typical, and an 11% increase in performance is not negligible in those cases. I disagree that the code is complex, but I am the author, so I'm probably biased. :)

@Windsooon -- The two PRs change different things. PR1 changes container/heap, while PR2 changes sort. I purposely didn't make my changes at container/heap (yet) because it's harder to reason about the improvement there, since there is no information on how the heap will be used. For instance, if all elements are the same (if Less always returns false) then my code would make container/heap's pop about log(N) times slower. The use case in sort is more predictable, and therefore easier to reason about.

martisch · 2020-06-18T05:06:08Z

@danielf You are right that not all cases are typical but the non-typical cases are rare.
Speeding them up by 11% is relatively good but absolutely a tiny amount.

The cost for the typical case isnt free because that code will require binary size use and more icaches that before might have had more typically used code that is now flushed when it is used. When we have 0.001% usage vs normal sort then 11% is about 0.0001% speedup. The code should therefore not regress the typical case by ~0.0001% otherwise it is near a net loss in performance. Of course this is hard to measure and microbenchmarks wont show this. The tradeoff might still be ok here, I just want to point out that adding code also has negative performance side effects that microbenchmarks wont show. On the other side more code also needs more maintainence.

danielf · 2020-06-18T08:04:13Z

Hi @martisch ,

Could you share the production data that you have that shows the 0.001% usage? Thanks. (I understand it if you can't).
My point is that it's not unthinkable that there are applications where the "atypical case" happens say, 70% of the time. For the maintainers of those applications a 11% speedup might be sizeable. My point is that I'm not convinced that in all applications that call sort uniformly we have that ~0.001% of the time is spent on heapSort.

If it is true that in ALL applications we get 0.001% of sort time in heapSort, then there are easier ways to guarantee O(N log N) (with extremely high probability) worst case runtime that would use fewer lines of code to maintain, but this is another discussion. :)

In any case, thanks for your comments!

martisch · 2020-06-18T11:22:17Z

Could you share the production data that you have that shows the 0.001% usage? Thanks. (I understand it if you can't).

I cant share the raw data. It would first be to large and reveal call stacks. For context this is the relative difference of all CPU time within a fixed time window that is sampled from Googles machines running Googles own services for the sort vs heapsort portion of time (Google Wider Profiling paper: https://research.google/pubs/pub36575/). The absolute value is also tiny (I had to adjust the normal search to be more granular and exhasutive to even even find it). If we improve the runtime.siftdownTimer heap implementation by 0.001% it will make up for more cpu time than spend total in heapsort and the rutime functions are used by many more (if not all) go programs. Note that the relative differences could be way off by even 10x since heapSort portion is so tiny to begin with.

My point is that it's not unthinkable that there are applications where the "atypical case" happens say, 70% of the time.

agreed but the same rationale can be applied to overoptimize any function. Its not unthinkable that there are applications that would benefit from not having the heapsort fallback at all for reducing binary size. Its not unthinkable that there are applications that would benefit from a memmove implementation that is tuned for 48byte copies. But adding more ifs to all the functions to add for specialised cases also has a negative costs for other cases. I checked the data and actually the time spend in heapsort was mostly caused by one binary with a spike in the last days I looked over.

If there are specific applications that would benefit from a more tuned heapsort maybe they should use a tuned heapsort directly instead of sort.Sort?

If it is true that in ALL applications we get 0.001% of sort time in heapSort, then there are easier ways to guarantee O(N log N) (with extremely high probability) worst case runtime that would use fewer lines of code to maintain, but this is another discussion. :)
If it guarantees not having O(N log N) and is easier while keeping the same performance as currently I think a CL for it would be welcome.

Being locked in to an implementation because it has been overoptimized for edge cases is I think something to avoid because it blocks other implementations that give better performance overall from being introduced due to regressing some application benchmark.

I would think that the general approach to the Go standard library should be to optimize for the common case while allowing more specialized implementations to live outside the standard library.
Thats not to say I havent myself overoptimizing some functions in the past. But Im differently calibrated nowadays (probably still to much on the performance than simplicity side) and would take some of these back if I could.

Its fair to say that the datapoint I provided above is biased in many ways but it also is quite a large sample over many uses of Go. If others can share data too maybe it would show that heapsort is much more called than that data I have shows.

This is not to say there might not exist an application in the world that would benefit from this. But then there might be better ways to apply a application specific optimization and its own heap sort implementation not using an interface.

danielf · 2020-06-18T19:05:40Z

@martisch , thanks for this very comprehensive reply! I agree with most (if not all) of the points raised, but I think that most of them don't apply to this CL, so I will comment below. Of course there are terms like "code simplicity" that are subjective.

The general idea of the comments below is: The change will help few, not harm anyone, and are simple enough to be easily maintainable.

I cant share the raw data. It would first be to large and reveal call stacks.

I expected this, but I wouldn't be doing my job if I hadn't asked :)

If we improve the runtime.siftdownTimer heap implementation by 0.001% it will make up for more cpu time than spend total in heapsort and the rutime functions are used by many more (if not all) go programs. Note that the relative differences could be way off by even 10x since heapSort portion is so tiny to begin with.

I thought about doing the same changes in both container/heap and runtime, but I didn't know how to properly benchmark those, since it's unclear to me what the usual case looks like. For container/heap I guess I could benchmark with things like naive implementations of Dijkstra's algorithm, but I wouldn't be sure if that corresponds to the "typical use" by the mass of Go developers. If you can suggest a reliable way to benchmark them, I would be glad to replicate these changes! (runtime uses a 4-heap, and not a binary heap, so the code would have to be adapted for it).

My point is that it's not unthinkable that there are applications where the "atypical case" happens say, 70% of the time.

agreed but the same rationale can be applied to overoptimize any function. Its not unthinkable that there are applications that would benefit from not having the heapsort fallback at all for reducing binary size. Its not unthinkable that there are applications that would benefit from a memmove implementation that is tuned for 48byte copies.

I agree with the general idea of what you said, but:

The memmove example is at least a hyperbole when compared to this heapsort debate.
I considered the binary size aspect, but then I realized that people who need small binary sizes should not be using Go. A simple hello world in Go produces a ELF64 binary that has 2M (compare with 8.2K in C, using gcc with no flags). After stripping the binaries they go to 1.2M and 6K. My point here is not that Go produces bloated binaries, but that Go was not intended for uses where binary size is that important (I understand the performance benefits of having small binary sizes, but the difference of one extra (small) function will not strip you from them).

But adding more ifs to all the functions to add for specialised cases also has a negative costs for other cases.

My code doesn't have an "if" to decide on the algorithm. It ALWAYS uses the new function to sift down after a pop. It's NOT a specialized solution for a particular scenario.

If there are specific applications that would benefit from a more tuned heapsort maybe they should use a tuned heapsort directly instead of sort.Sort?

I agree with this sentence, I just don't see it as an argument to not make sort.Sort faster (in rare occasions) by adding one not complicated function.

Being locked in to an implementation because it has been overoptimized for edge cases is I think something to avoid because it blocks other implementations that give better performance overall from being introduced due to regressing some application benchmark.

I agree with this sentence, but I really don't think that it is a fair representation of what my CL proposes.

I would think that the general approach to the Go standard library should be to optimize for the common case while allowing more specialized implementations to live outside the standard library.

I agree with the sentence "the common case should not suffer in favor of the edge cases" (this is a little mantra that people of my field have). I also agree with your specialized implementations comment. That said, there is already an implementation of heapSort in sort.Sort. Why not make it faster? We are not making any specialized implementations.

If the idea is just maintainability than using container/heap to do the heapsort part of sort would make it way cleaner and have fewer lines of code to maintain.

Thats not to say I havent myself overoptimizing some functions in the past. But Im differently calibrated nowadays (probably still to much on the performance than simplicity side) and would take some of these back if I could.

The solution is simple AND performant. :D

This is not to say there might not exist an application in the world that would benefit from this. But then there might be better ways to apply a application specific optimization and its own heap sort implementation not using an interface.

There will be applications that will benefit from it, and there will be no applications that suffer from it, and it is a simple, small, self-contained code.

martisch · 2020-06-18T21:02:25Z

There will be applications that will benefit from it, and there will be no applications that suffer from it, and it is a simple, small, self-contained code.

That is were I disagree: there can be applications that suffer. This might not be net negative here but any change that increases instruction size has the potential to decrease performance overall (when those instructions are executed) by evicting more useful instructions from the CPU (i)caches or needing to load more cache lines first (in this case this might be offset by the data being touched less). That is why I have mentioned memmove before. It can be benchmarked from profiling across large pools of machines that overoptimized (lots of special cases) memmove perform better in microbenchmarks but then perform worse overall due to more icache misses (branch mispredicts can be kept lower with jumptables).

The memmove example is at least a hyperbole when compared to this heapsort debate.

I might have reached to extremes because we talked about the extreme of unthinkables :)

My code doesn't have an "if" to decide on the algorithm. It ALWAYS uses the new function to sift down after a pop. It's NOT a specialized solution for a particular scenario.

True, the if was meant in the general. For a specialisation that makes the instruction footprint larger to handle the uncommon case better there can be still negative effects.

danielf · 2020-06-18T21:17:20Z

That is were I disagree: there can be applications that suffer. This might not be net negative here but any change that increases instruction size has the potential to decrease performance overall (when those instructions are executed) by evicting more useful instructions from the CPU (i)caches. That is why I have mentioned memmove before. It can be benchmarked from profiling across large pools of machines that overoptimized (lots of special cases) memmove perform better in microbenchmarks but then perform worse overall due to more icache misses (branch mispredicts can be kept lower with jumptables).

Mmm, I think that you are nitpicking a little bit... This code (when activated, which is rare) will be called at least 12 times (because of the insertionSort optimization), and likely considerably more (the adversarial case calls it for basically the full array). A code that is called 12 times (and likely way more) deserves to replace something else temporarily in the L1i cache. :)

Note that this code will be called all these "at least 12 times but likely way more" before any other code is run, so it's not like it is going to be inserted into L1i and bumped out of it successively. It will go in (bumping something else out), executed a bunch of times, and then be bumped out again. The code that is bumped out will very likely still reside in L2 in the duration of the heapSort execution, so returning it to L1i shouldn't make the CPU pipeline suffer too much.

martisch · 2020-06-18T21:26:03Z

Mmm, I think that you are nitpicking a little bit...

It may very well be nit-picky but then the argument was absolute that this can not harm any program :)

I think we largely agree on the effects that could potentially happen but disagree on where to draw the line of improving the uncommon cases in the algorithm. And I think thats fine. Would be boring If we all thought alike. Thanks for the constructive discussion.

danielf · 2020-06-18T21:28:58Z

It may very well be nit-picky but then the argument was absolute that this can not harm any program :)

I think we largely agree on the effects that can happen but disagree on where to draw the line of improving the uncommon cases in the algorithm.

I am good with this conclusion! Thanks also, this was great. :)

Windsooon · 2020-06-25T10:11:34Z

@danicat @martisch Do you think we should overwrite the quicksort function in an iterative version? In most case, it should be faster.

martisch · 2020-06-26T05:53:37Z

As long as it does not cause any more heap allocations you could write a prototype and post the numbers then we can start a discussion. If its a very small win it might not be worth breaking peoples tests if the order of the sort changes even if this is not a stable sort.

ianlancetaylor changed the title ~~Improving the performance of sort for bad inputs (when heapSort is called)~~ sort: improve performance of heapSort for worst case inputs Jun 10, 2020

ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 10, 2020

ianlancetaylor added this to the Backlog milestone Jun 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sort: improve performance of heapSort for worst case inputs #39512

sort: improve performance of heapSort for worst case inputs #39512

danielf commented Jun 10, 2020

gopherbot commented Jun 10, 2020

martisch commented Jun 10, 2020 •

edited

Windsooon commented Jun 18, 2020

danielf commented Jun 18, 2020

martisch commented Jun 18, 2020

danielf commented Jun 18, 2020

martisch commented Jun 18, 2020 •

edited

danielf commented Jun 18, 2020

martisch commented Jun 18, 2020 •

edited

danielf commented Jun 18, 2020

martisch commented Jun 18, 2020 •

edited

danielf commented Jun 18, 2020

Windsooon commented Jun 25, 2020

martisch commented Jun 26, 2020

sort: improve performance of heapSort for worst case inputs #39512

sort: improve performance of heapSort for worst case inputs #39512

Comments

danielf commented Jun 10, 2020

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

Notes

benchstat results

gopherbot commented Jun 10, 2020

martisch commented Jun 10, 2020 • edited

Windsooon commented Jun 18, 2020

danielf commented Jun 18, 2020

martisch commented Jun 18, 2020

danielf commented Jun 18, 2020

martisch commented Jun 18, 2020 • edited

danielf commented Jun 18, 2020

martisch commented Jun 18, 2020 • edited

danielf commented Jun 18, 2020

martisch commented Jun 18, 2020 • edited

danielf commented Jun 18, 2020

Windsooon commented Jun 25, 2020

martisch commented Jun 26, 2020

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

martisch commented Jun 10, 2020 •

edited

martisch commented Jun 18, 2020 •

edited

martisch commented Jun 18, 2020 •

edited

martisch commented Jun 18, 2020 •

edited