Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: sort: Stable with better asymptotic time complexity #25137

Closed
nsajko opened this issue Apr 27, 2018 · 21 comments
Closed

proposal: sort: Stable with better asymptotic time complexity #25137

nsajko opened this issue Apr 27, 2018 · 21 comments

Comments

@nsajko
Copy link
Contributor

nsajko commented Apr 27, 2018

The proposed change does not add any new API and is backwards compatible.

The current implementation of Stable is very simple, but not satisfactorily performant and executes in O(n*log(n)*log(n)) time.

I propose implementing Stable using an O(n*log(n)) algorithm which is, though, much more complex. See https://go-review.googlesource.com/c/go/+/101415

Code complexity: the following functions with noted cyclomatic complexities are added:

Cyclomatic complexity function
118 sort stable sort.go:918:1
19 sort merge sort.go:666:1
16 sort findBDSAndCountDistinctElementsNear sort.go:420:1
13 sort findBDSFarAndCountDistinctElements sort.go:497:1
8 sort simpleMergeBufBigSmall sort.go:625:1
4 sort distinctElementCount sort.go:389:1
3 sort extractDist sort.go:573:1
3 sort equalRange sort.go:562:1
3 sort max sort.go:333:1
3 sort lessBlocks sort.go:613:1
2 sort maxDepth sort.go:223:1
1 sort searchLess sort.go:328:1
1 sort isqrter sort.go:348:1

Performance:

Benchmark results using func bench (it sorts pseudorandom data) from sort_test.go (benchmark names include input array size):

Benchmark Improvement
Stable1e4 10% (from 110 ms to 100 ms)
Stable1e6 31% (from 23 s to 16 s)
Stable1e7 36% (from 1.9 min to 1.2 min)
Stable1e8 43% (from 64 min to 36 min)

The code in master is faster for input arrays with very many identical elements, but those are fast to sort anyways (in both the old and new code they are orders of magnitude faster to sort than random data).

@gopherbot gopherbot added this to the Proposal milestone Apr 27, 2018
@ianlancetaylor
Copy link
Contributor

What are the effects on memory allocation? That is, is any extra memory allocated during the sort?

@nsajko
Copy link
Contributor Author

nsajko commented Apr 27, 2018

What are the effects on memory allocation? That is, is any extra memory allocated during the sort?

There is an array of 5 * 256 uint8 (but the constant is tunable if needed) and a logarithmic amount of stack space will be used if there are calls to symMerge or quickSort because they are recursive (those two exist in sort.go already in master). Neither of those are essential to the algorithm, but were added as optimizations.

@ancaemanuel
Copy link

ancaemanuel commented Apr 27, 2018

Why somebody will need it over the standard library ?

It is better in general random case ? Numbers please.
It is better in some particular case ? And why is that case important ?

@nsajko
Copy link
Contributor Author

nsajko commented Apr 28, 2018

It is better in general random case ? Numbers please.

Yeah, look under "Benchmark results" in my first post here, I do not know how it could be missed. For more details see the linked CL at Go's Gerrit.

EDIT: sorry, I failed to realize that not everybody is familiar with sort_test.go, the benchmark results posted here (produced by func bench) are from sorting pseudorandom data.

@nsajko
Copy link
Contributor Author

nsajko commented Apr 30, 2018

It is interesting to note that with gcc 7.3.1 my code beats the libstdc++ stable_sort when limited to sublinear space/memory usage starting from about array size 1e8! I sort of ported func bench from sort_test.go to C++ and when compiled with c++ c++StableSortBench.cc the C++ program takes 38 minutes/2277 seconds (at 1.2 GHz CPU frequency) to sort the arrays like in func bench, while my Go code takes just 36 minutes/2189 seconds.

(To clarify, both func bench and my C++ "port" sort seven arrays filled with pseudorandom keys and sizes ranging from 1e8-3 to 1e8+3 inclusive. The durations are the sums from sorting all seven arrays.)

I guess that means libstdc++ also uses an O(n*log(n)*log(n)) algorithm ...

NB: when compiled with compiler optimizations turned on, the C++ code is 4.7 times faster than when they are turned off (which is the default); thus much larger array sizes would be needed for my Go code to beat that.

See the C++ code here: https://github.com/nsajko/cxxStableSortBench

@ianlancetaylor
Copy link
Contributor

Is this the same as the suggested algorithm in #7610?

@nsajko
Copy link
Contributor Author

nsajko commented Apr 30, 2018

Is this the same as the suggested algorithm in #7610?

Yes and no.

#7610 proposes using the 2008 merging algorithm by Kim and Kutzner which is the basis of my code. On the other hand #7610 then links to Wikisort, which is not the same as my sort; and in fact uses a more than slightly modified version of the merging algorithm from Kim and Kutzner.

@griesemer
Copy link
Contributor

I'm copying from the CL description:

The new implementation is a merge sort with the merge based on Kim, P.S.
and Kutzner, A., 2008, April. Ratio based stable in-place merging. In
International Conference on Theory and Applications of Models of
Computation (pp. 246-257). Springer, Berlin, Heidelberg.

Having an asymptotically better stable sort seems like a good thing to have. At the same time there's significant amounts non-trivial of new code (~1000 lines + generated code) that needs thorough vetting.

@nsajko Is there anyone besides you that could help validate/review your implementation and confirm its matching the above paper? Ideally we need another expert in sorting algorithms. Perhaps @vdobler ?

@nsajko
Copy link
Contributor Author

nsajko commented May 1, 2018

@nsajko Is there anyone besides you that could help validate/review your implementation and confirm its matching the above paper? Ideally we need another expert in sorting algorithms. Perhaps @vdobler ?

Sadly, I do not know anybody who could help. It would certainly be appreciated if @vdobler or anybody else could review the code.

@ancaemanuel
Copy link

Ok. The complexity of the algorithm looks good, adding papers to that increase the confidence that you are doing an great job, but the amount of code added really worth it ?
I think, in some cases it will. But is not my decision.

@vdobler
Copy link
Contributor

vdobler commented May 2, 2018

I'll try to review it, but it is really difficult:

  • The main merging loop is simply too complex, has too many live variables with too cryptic names differing from the paper.
  • The implementation is not only the implementation of StableOptimalBlockMerge from Kim&Kutzner but contains several (presumably necessary) optimizations.
  • The current CL uses SymMerge instead if Hwang&Lin which makes the code much simpler but I think this makes the complexity analysis of the paper invalid, or at least needs work to re-prove it.

The first two points might impose a fundamental problem:
A CL including all the necessary optimizations will be un-reviewable while a straight implementation of the algorithm (and thus digestible) will not show much (if at all!) performance gain and thus won't get merged.

The current code of https://go-review.googlesource.com/c/go/+/101415/41
is neither maintainable nor production ready. I'm very interested in having a
asymptotically better Stable in the standard library and I'm going to invest
significant time in understanding Kim&Kutzners paper as well as the implementation.

@griesemer
Copy link
Contributor

@vdobler Thank you for offering to review!

One option to make progress, and perhaps starting with a straight algorithm that can be more easily reviewed and vetted as correct, would be to have this as a second implementation to what we have now, with an appropriately documented (constant) flag to enable/disable. This would permit checking in code that is disabled when checked in, but can be reviewed and tested nevertheless on individual machines simply by enabling the flag. Eventually, we will be able to switch the flag.

But before we go there I like to bring this up in the proposal review committee and solicit other opinions. It's not clear to me that this should be a proposal in the first place, for one.

Either way, this is certainly not for 1.11 (maybe for 1.12) so there is quite a bit of time here.

@nsajko
Copy link
Contributor Author

nsajko commented May 2, 2018

Thanks for your criticisms @vdobler, I will try to address just some of your points for now.

Before addressing your points, I just realized that I seemingly failed to mention at all one difference between the Kim&Kutzner paper and my code: K&K put the BDS and MIB on low-index edges of arrays, while I put them on high-index edges, and with that also come other changes like (in func merge) rolling the subblocks of the high-index block instead of the low-index block and a lot of index variables pointing to high-index edges where they point to low-index edges in the paper. I will document that in the next PS.

The main merging loop is simply too complex, has too many live variables with too cryptic names differing from the paper.

Do you mean the "Local merges" part in func merge or the entirety of func merge?

The implementation is not only the implementation of StableOptimalBlockMerge from Kim&Kutzner but contains several (presumably necessary) optimizations.

Presumably you mean the "outOfInputData" MIB and BDS which cause func merge to take two Interface parameters?

The current CL uses SymMerge instead if Hwang&Lin which makes the code much simpler but I think this makes the complexity analysis of the paper invalid, or at least needs work to re-prove it.

In general symMerge is (for sorted arrays u and v and m == |u| == |v|) O(m * log(m)); while Hwang&Lin (the version based on rotations, without a workspace buffer) is O(m * m) in general, with the "extended" version described in the paper being O(λ * m) (see Lemma 3 in the paper), where λ is the number of distinct elements in one of u or v; but AFAIU λ equals m-1 worst case when we use the algorithm for local merges (in func merge), making it O(m * m). O(m * m) is asymptotically greater than O(m * log(m)), thus using symMerge instead of Hwang and Lin does not invalidate the complexity analysis. I hope I did not get something wrong. (Sidenote: please take a look at Theorem 1 in the paper. It proves that the merge takes O(n) assignments, but the reasoning for the case where the rotation based variant of Hwang&Lin should be used for local merges is left to the reader to prove in the end, with a pointer to Lemma 3. To be honest, it seems to me that the proof does not even work for that case, that is that the merge is O(n * sqrt(n)) in that case instead of O(n). So PTAL at the end of Theorem 1.)

A CL including all the necessary optimizations will be un-reviewable while a straight implementation of the algorithm (and thus digestible) will not show much (if at all!) performance gain and thus won't get merged.

Actually, I think the outOfInputData-buffer optimization should not impact the performance of sorting large arrays (like 1e8). But I do not think removing that would simplify the code very much.

@vdobler
Copy link
Contributor

vdobler commented May 2, 2018

@nsajko

K&K put the BDS and MIB on low-index edges of arrays, while I put them on high-index edges

Why? Is there a technical reason? The algorithm is complex and not widely known
and the diagrams from the paper are helpful but only if they do match the
code.

Do you mean the "Local merges" part in func merge or the entirety of func merge?

The whole and any part of it. Let me explain what I would consider reviewable:

  • Merge calling a function stableOptimalBlockMerge which implements Stable-Optimal-Block-Merge from the paper.
  • stableOptimalBlockMerge calling functions which each implements one of the five steps of Stable-Optimal-Block-Merge from the paper.
  • Grouping everything MIB related stuff (were it is located, be it as part of data or "outOfInput") into a struct with appropriate methods.
  • Same for the BDS. One BDS struct instead of dozens of integer variables.

Presumably you mean the "outOfInputData" MIB and BDS which cause func merge to take two Interface parameters?

Yes and every comment of the form "as an optimization..."
But this is not the place for detailed code review.
Please understand that I do have problems reviewing the code in the current form.

We need to agree on nomenclature. I think your comment is based on the specialization
of the paper to m = n i.e. k = 1 and the big-Os you give are for the number of assignments?
Doing the math will take me some time.

Actually, I think the outOfInputData-buffer optimization should not impact the performance of sorting large arrays (like 1e8). But I do not think removing that would simplify the code very much.

While reading the paper and your code I thought it might make the code
much simpler if there where a (opaque) movementImitationBuffer which does
its work with out-of-input or MIB as part of data being an internal detail.
In a first, simplified algorithm one could use a dynamically allocated out-of-input
MIB always, add an data-based MIB in a second step and a mixture of
data-bases and out-of-input MIB in a third step. All nicely encapsulated in
a type hiding the details.

@vdobler
Copy link
Contributor

vdobler commented May 3, 2018

@griesemer That seems like a very reasonable plan.
I'm curious about the review committees opinion.

@nsajko
Copy link
Contributor Author

nsajko commented May 3, 2018

@vdobler

K&K put the BDS and MIB on low-index edges of arrays, while I put them on high-index edges

Why? Is there a technical reason?

The natural way to implement a merge sort is to start from the low indices, possibly leaving an undersized block at the high-index end. The reason is to merge this undersized block more efficiently.

  • Merge calling a function stableOptimalBlockMerge which implements Stable-Optimal-Block-Merge from the paper.
  • stableOptimalBlockMerge calling functions which each implements one of the five steps of Stable-Optimal-Block-Merge from the paper.

func merge only implements the steps 3 (Block rearrangement) and 4 (Local merges), with appropriate code blocks noted with comments. I could separate most of the code of func merge into two separate functions if you really want it, but I think it is simpler like it is now, because the integer variables can be used all through the code, instead of having to recompute them.

  • Grouping everything MIB related stuff (were it is located, be it as part of data or "outOfInput") into a struct with appropriate methods.
  • Same for the BDS. One BDS struct instead of dozens of integer variables.

But the paper also uses integer variables ...

While reading the paper and your code I thought it might make the code
much simpler if there where a (opaque) movementImitationBuffer which does
its work with out-of-input or MIB as part of data being an internal detail.

But that is what we already have in my code: outOfInputDataMIBuffer implements Interface and func merge does not ever know if aux Interface is internal MIB and BDS, or the outOfInputData variants.

@vdobler
Copy link
Contributor

vdobler commented May 3, 2018

@nsajko

The reason is to merge this undersized block more efficiently.

That's a fine explanation and a valid reason. Now this must find i's way into
the description of the algorithm.

For the rest: It seems as I am totally inapt in explaining the problem with
the code of patchset 41. Let me try to be very clear: It doesn't matter
if you find the code easy, simple efficient and handy if I don't understand
it during review. Of course does the paper use integer indices but there
are literally dozens variables like bds0, bds1, backupBDS0, backupBDS1,
buf, bufLastDiEl and of course I could make an educated guess what
bds0 and bds1 are and see if the code matches my guess, but that's
simply not how code should look like.

If we are following Robert's idea of iteratively building up the new algorithm
in small and reviewable steps you will have to start over and come up
with code digestible for the ones reviewing it.

But that is what we already have in my code: outOfInputDataMIBuffer implements Interface and func merge does not ever know if aux Interface is internal MIB and BDS, or the outOfInputData variants.

Only partly: aux cannot live alone and does not abstract the whole
"keep track of what we have to undo later": All the inidces have to be
passed and manipulated. The aux idea is clever and I think it might
be a good idea to take this idea further so that the code becomes simpler
for all who didn't write it.

@nsajko
Copy link
Contributor Author

nsajko commented May 3, 2018

Please continue with codereview here: nsajko/go#1

Also, all issues on that repo will be relevant: https://github.com/nsajko/go/issues

@rsc
Copy link
Contributor

rsc commented May 7, 2018

Putting on hold until the code has been reviewed and cleaned up. Part of the decision here will be informed by how much complexity we are taking on, so it makes sense to wait to evaluate that until the code is as clean as it can be made.

@rsc
Copy link
Contributor

rsc commented Nov 4, 2020

It's been a couple years and the CL has not been updated to try to clean it up. It's an enormous amount of code and in general there aren't many simple algorithms for this. It seems like we should move this to likely decline.

@nsajko
Copy link
Contributor Author

nsajko commented Nov 4, 2020

Sorry for keeping this open. Yes, the code could be cleaned up, but there were other issues too.
During the review process, while trying to verify a case in a proof in Kim & Kutzner's paper that was left to the reader to prove, I got a strong suspicion that their algorithm actually has worse complexity than advertised in the case of arrays consisting of many identical elements. (It's of course possible that there was an error in my calculation and K&K's claims indeed hold, I haven't revisited the proof or tried to contact K&K yet.)
Also, I have since realized that standard libraries are not something that benefits from this kind of algorithm (even if it has asymptotic complexity as advertised), because if someone really needs the extra performance for their specific distribution of inputs (very large ones, in this case) that need sorting they wouldn't stay with the standard library anyway (in Go's case possibly because sort.Interface restricts implementations from using many nice efficiency tricks possible while sorting arrays, but also with standard libraries in general), and on the other hand because nobody wants this much extra code in the standard library when it would benefit only some small number of users. TBH, I think I was overly enamoured with asymptotic complexity of algorithms in general back then, as opposed to real world performance; and with this algorithm in particular because it promised a worst case complexity that a lot of people that I interacted with back then actually thought was impossible for an in-place stable sort. That said, there surely are cases in which this algorithm could be useful, so anybody who is interested in it, please contact me, maybe I do something about it sometime.

@nsajko nsajko closed this as completed Nov 4, 2020
@rsc rsc moved this from Likely Decline to Declined in Proposals (old) Nov 11, 2020
@golang golang locked and limited conversation to collaborators Nov 11, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Development

No branches or pull requests

7 participants