Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sort: improve performance of heapSort for worst case inputs #39512

Open
danielf opened this issue Jun 10, 2020 · 14 comments
Open

sort: improve performance of heapSort for worst case inputs #39512

danielf opened this issue Jun 10, 2020 · 14 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@danielf
Copy link
Contributor

danielf commented Jun 10, 2020

What version of Go are you using (go version)?

$ go version
go version devel +7b872b6d95 Tue Jun 9 23:24:08 2020 +0000 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env darwin/amd64
$ go env
GOARCH="amd64"
GOOS="darwin"

What did you do?

I've implemented a faster version of siftDown for when it's expected that the root will end up in one of the lower levels. This is the likely outcome right after a "pop", since we move a former leaf to the root.

Benchmarks using an adversarial input (the one in sort_test.go) show that the new code is ~11.8% faster when using a struct with a complex Less, and ~5.5% faster when sorting ints.

My changes makes the code slower in the case when a lot of elements are equivalent (!Less(a, b) && !Less(b, a)), but this case is uncommon when the Heapsort part of sort.go is activated.

Notes

The same changes can be applied to container/heap and to runtime, assuming benchmarks support them.

benchstat results

name                          old time/op  new time/op  delta
Adversary/Using_complex_Less  49.5ms ± 0%  43.7ms ± 0%  -11.77%  (p=0.000 n=10+10)
Adversary/Using_Ints          31.3ms ± 0%  29.6ms ± 0%   -5.48%  (p=0.000 n=10+10)
@gopherbot
Copy link

Change https://golang.org/cl/237437 mentions this issue: sort: improve speed of HeapSort

@ianlancetaylor ianlancetaylor changed the title Improving the performance of sort for bad inputs (when heapSort is called) sort: improve performance of heapSort for worst case inputs Jun 10, 2020
@ianlancetaylor ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 10, 2020
@ianlancetaylor ianlancetaylor added this to the Backlog milestone Jun 10, 2020
@martisch
Copy link
Contributor

martisch commented Jun 10, 2020

As a datapoint for the complexity/maintenence/speedup tradeoff: For the production profiling I have available if time spend in and below sort.Sort is 100% then ~0.001% of it is spend in sort.heapSort.

I think it is therefore fine to improve the speed if the actual added code is very small and obvious but adding another function that would rarely be run and needs additional care to be tested due to only being used on adverserial inputs might not be a good tradeoff.

@Windsooon
Copy link

There are two opened PRs for improving the heap at PR1 and PR2.

@danielf
Copy link
Contributor Author

danielf commented Jun 18, 2020

@martisch -- I understand that the amount of time spend in heapSort during a sort.Sort is minimal for typical cases. That said, not all cases are typical, and an 11% increase in performance is not negligible in those cases. I disagree that the code is complex, but I am the author, so I'm probably biased. :)

@Windsooon -- The two PRs change different things. PR1 changes container/heap, while PR2 changes sort. I purposely didn't make my changes at container/heap (yet) because it's harder to reason about the improvement there, since there is no information on how the heap will be used. For instance, if all elements are the same (if Less always returns false) then my code would make container/heap's pop about log(N) times slower. The use case in sort is more predictable, and therefore easier to reason about.

@martisch
Copy link
Contributor

@danielf You are right that not all cases are typical but the non-typical cases are rare.
Speeding them up by 11% is relatively good but absolutely a tiny amount.

The cost for the typical case isnt free because that code will require binary size use and more icaches that before might have had more typically used code that is now flushed when it is used. When we have 0.001% usage vs normal sort then 11% is about 0.0001% speedup. The code should therefore not regress the typical case by ~0.0001% otherwise it is near a net loss in performance. Of course this is hard to measure and microbenchmarks wont show this. The tradeoff might still be ok here, I just want to point out that adding code also has negative performance side effects that microbenchmarks wont show. On the other side more code also needs more maintainence.

@danielf
Copy link
Contributor Author

danielf commented Jun 18, 2020

Hi @martisch ,

  1. Could you share the production data that you have that shows the 0.001% usage? Thanks. (I understand it if you can't).
  2. My point is that it's not unthinkable that there are applications where the "atypical case" happens say, 70% of the time. For the maintainers of those applications a 11% speedup might be sizeable. My point is that I'm not convinced that in all applications that call sort uniformly we have that ~0.001% of the time is spent on heapSort.

If it is true that in ALL applications we get 0.001% of sort time in heapSort, then there are easier ways to guarantee O(N log N) (with extremely high probability) worst case runtime that would use fewer lines of code to maintain, but this is another discussion. :)

In any case, thanks for your comments!

@martisch
Copy link
Contributor

martisch commented Jun 18, 2020

  1. Could you share the production data that you have that shows the 0.001% usage? Thanks. (I understand it if you can't).

I cant share the raw data. It would first be to large and reveal call stacks. For context this is the relative difference of all CPU time within a fixed time window that is sampled from Googles machines running Googles own services for the sort vs heapsort portion of time (Google Wider Profiling paper: https://research.google/pubs/pub36575/). The absolute value is also tiny (I had to adjust the normal search to be more granular and exhasutive to even even find it). If we improve the runtime.siftdownTimer heap implementation by 0.001% it will make up for more cpu time than spend total in heapsort and the rutime functions are used by many more (if not all) go programs. Note that the relative differences could be way off by even 10x since heapSort portion is so tiny to begin with.

  1. My point is that it's not unthinkable that there are applications where the "atypical case" happens say, 70% of the time.

agreed but the same rationale can be applied to overoptimize any function. Its not unthinkable that there are applications that would benefit from not having the heapsort fallback at all for reducing binary size. Its not unthinkable that there are applications that would benefit from a memmove implementation that is tuned for 48byte copies. But adding more ifs to all the functions to add for specialised cases also has a negative costs for other cases. I checked the data and actually the time spend in heapsort was mostly caused by one binary with a spike in the last days I looked over.

If there are specific applications that would benefit from a more tuned heapsort maybe they should use a tuned heapsort directly instead of sort.Sort?

If it is true that in ALL applications we get 0.001% of sort time in heapSort, then there are easier ways to guarantee O(N log N) (with extremely high probability) worst case runtime that would use fewer lines of code to maintain, but this is another discussion. :)
If it guarantees not having O(N log N) and is easier while keeping the same performance as currently I think a CL for it would be welcome.

Being locked in to an implementation because it has been overoptimized for edge cases is I think something to avoid because it blocks other implementations that give better performance overall from being introduced due to regressing some application benchmark.

I would think that the general approach to the Go standard library should be to optimize for the common case while allowing more specialized implementations to live outside the standard library.
Thats not to say I havent myself overoptimizing some functions in the past. But Im differently calibrated nowadays (probably still to much on the performance than simplicity side) and would take some of these back if I could.

Its fair to say that the datapoint I provided above is biased in many ways but it also is quite a large sample over many uses of Go. If others can share data too maybe it would show that heapsort is much more called than that data I have shows.

This is not to say there might not exist an application in the world that would benefit from this. But then there might be better ways to apply a application specific optimization and its own heap sort implementation not using an interface.

@danielf
Copy link
Contributor Author

danielf commented Jun 18, 2020

@martisch , thanks for this very comprehensive reply! I agree with most (if not all) of the points raised, but I think that most of them don't apply to this CL, so I will comment below. Of course there are terms like "code simplicity" that are subjective.

The general idea of the comments below is: The change will help few, not harm anyone, and are simple enough to be easily maintainable.

I cant share the raw data. It would first be to large and reveal call stacks.

I expected this, but I wouldn't be doing my job if I hadn't asked :)

If we improve the runtime.siftdownTimer heap implementation by 0.001% it will make up for more cpu time than spend total in heapsort and the rutime functions are used by many more (if not all) go programs. Note that the relative differences could be way off by even 10x since heapSort portion is so tiny to begin with.

I thought about doing the same changes in both container/heap and runtime, but I didn't know how to properly benchmark those, since it's unclear to me what the usual case looks like. For container/heap I guess I could benchmark with things like naive implementations of Dijkstra's algorithm, but I wouldn't be sure if that corresponds to the "typical use" by the mass of Go developers. If you can suggest a reliable way to benchmark them, I would be glad to replicate these changes! (runtime uses a 4-heap, and not a binary heap, so the code would have to be adapted for it).

  1. My point is that it's not unthinkable that there are applications where the "atypical case" happens say, 70% of the time.

agreed but the same rationale can be applied to overoptimize any function. Its not unthinkable that there are applications that would benefit from not having the heapsort fallback at all for reducing binary size. Its not unthinkable that there are applications that would benefit from a memmove implementation that is tuned for 48byte copies.

I agree with the general idea of what you said, but:

  1. The memmove example is at least a hyperbole when compared to this heapsort debate.
  2. I considered the binary size aspect, but then I realized that people who need small binary sizes should not be using Go. A simple hello world in Go produces a ELF64 binary that has 2M (compare with 8.2K in C, using gcc with no flags). After stripping the binaries they go to 1.2M and 6K. My point here is not that Go produces bloated binaries, but that Go was not intended for uses where binary size is that important (I understand the performance benefits of having small binary sizes, but the difference of one extra (small) function will not strip you from them).

But adding more ifs to all the functions to add for specialised cases also has a negative costs for other cases.

My code doesn't have an "if" to decide on the algorithm. It ALWAYS uses the new function to sift down after a pop. It's NOT a specialized solution for a particular scenario.

If there are specific applications that would benefit from a more tuned heapsort maybe they should use a tuned heapsort directly instead of sort.Sort?

I agree with this sentence, I just don't see it as an argument to not make sort.Sort faster (in rare occasions) by adding one not complicated function.

Being locked in to an implementation because it has been overoptimized for edge cases is I think something to avoid because it blocks other implementations that give better performance overall from being introduced due to regressing some application benchmark.

I agree with this sentence, but I really don't think that it is a fair representation of what my CL proposes.

I would think that the general approach to the Go standard library should be to optimize for the common case while allowing more specialized implementations to live outside the standard library.

I agree with the sentence "the common case should not suffer in favor of the edge cases" (this is a little mantra that people of my field have). I also agree with your specialized implementations comment. That said, there is already an implementation of heapSort in sort.Sort. Why not make it faster? We are not making any specialized implementations.

If the idea is just maintainability than using container/heap to do the heapsort part of sort would make it way cleaner and have fewer lines of code to maintain.

Thats not to say I havent myself overoptimizing some functions in the past. But Im differently calibrated nowadays (probably still to much on the performance than simplicity side) and would take some of these back if I could.

The solution is simple AND performant. :D

This is not to say there might not exist an application in the world that would benefit from this. But then there might be better ways to apply a application specific optimization and its own heap sort implementation not using an interface.

There will be applications that will benefit from it, and there will be no applications that suffer from it, and it is a simple, small, self-contained code.

@martisch
Copy link
Contributor

martisch commented Jun 18, 2020

There will be applications that will benefit from it, and there will be no applications that suffer from it, and it is a simple, small, self-contained code.

That is were I disagree: there can be applications that suffer. This might not be net negative here but any change that increases instruction size has the potential to decrease performance overall (when those instructions are executed) by evicting more useful instructions from the CPU (i)caches or needing to load more cache lines first (in this case this might be offset by the data being touched less). That is why I have mentioned memmove before. It can be benchmarked from profiling across large pools of machines that overoptimized (lots of special cases) memmove perform better in microbenchmarks but then perform worse overall due to more icache misses (branch mispredicts can be kept lower with jumptables).

The memmove example is at least a hyperbole when compared to this heapsort debate.

I might have reached to extremes because we talked about the extreme of unthinkables :)

My code doesn't have an "if" to decide on the algorithm. It ALWAYS uses the new function to sift down after a pop. It's NOT a specialized solution for a particular scenario.

True, the if was meant in the general. For a specialisation that makes the instruction footprint larger to handle the uncommon case better there can be still negative effects.

@danielf
Copy link
Contributor Author

danielf commented Jun 18, 2020

That is were I disagree: there can be applications that suffer. This might not be net negative here but any change that increases instruction size has the potential to decrease performance overall (when those instructions are executed) by evicting more useful instructions from the CPU (i)caches. That is why I have mentioned memmove before. It can be benchmarked from profiling across large pools of machines that overoptimized (lots of special cases) memmove perform better in microbenchmarks but then perform worse overall due to more icache misses (branch mispredicts can be kept lower with jumptables).

Mmm, I think that you are nitpicking a little bit... This code (when activated, which is rare) will be called at least 12 times (because of the insertionSort optimization), and likely considerably more (the adversarial case calls it for basically the full array). A code that is called 12 times (and likely way more) deserves to replace something else temporarily in the L1i cache. :)

Note that this code will be called all these "at least 12 times but likely way more" before any other code is run, so it's not like it is going to be inserted into L1i and bumped out of it successively. It will go in (bumping something else out), executed a bunch of times, and then be bumped out again. The code that is bumped out will very likely still reside in L2 in the duration of the heapSort execution, so returning it to L1i shouldn't make the CPU pipeline suffer too much.

@martisch
Copy link
Contributor

martisch commented Jun 18, 2020

Mmm, I think that you are nitpicking a little bit...

It may very well be nit-picky but then the argument was absolute that this can not harm any program :)

I think we largely agree on the effects that could potentially happen but disagree on where to draw the line of improving the uncommon cases in the algorithm. And I think thats fine. Would be boring If we all thought alike. Thanks for the constructive discussion.

@danielf
Copy link
Contributor Author

danielf commented Jun 18, 2020

It may very well be nit-picky but then the argument was absolute that this can not harm any program :)

I think we largely agree on the effects that can happen but disagree on where to draw the line of improving the uncommon cases in the algorithm.

I am good with this conclusion! Thanks also, this was great. :)

@Windsooon
Copy link

@danicat @martisch Do you think we should overwrite the quicksort function in an iterative version? In most case, it should be faster.

@martisch
Copy link
Contributor

As long as it does not cause any more heap allocations you could write a prototype and post the numbers then we can start a discussion. If its a very small win it might not be worth breaking peoples tests if the order of the sort changes even if this is not a stable sort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants