Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slices: use dual-pivot quicksort instead of pdqsort #61027

Open
PeterRK opened this issue Jun 27, 2023 · 12 comments
Open

slices: use dual-pivot quicksort instead of pdqsort #61027

PeterRK opened this issue Jun 27, 2023 · 12 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@PeterRK
Copy link

PeterRK commented Jun 27, 2023

Reinplement #52789 based on go 1.21rc2. Here is new benchmark result.

Xeon-8374C

name                        old time/op  new time/op  delta
SlicesSortInts-4            7.67ms ± 0%  6.23ms ± 0%  -18.82%  (p=0.000 n=10+10)
SlicesSortInts_Sorted-4      113µs ± 1%    88µs ± 2%  -22.31%  (p=0.000 n=10+10)
SlicesSortInts_Reversed-4    148µs ± 0%   126µs ± 1%  -15.04%  (p=0.000 n=10+9)
SlicesSortStrings-4         28.0ms ± 1%  27.3ms ± 4%   -2.35%  (p=0.022 n=9+10)
SlicesSortStrings_Sorted-4   628µs ± 1%   567µs ± 0%   -9.80%  (p=0.000 n=10+9)
SortFuncStructs-4           14.1ms ± 2%  13.4ms ± 2%   -5.53%  (p=0.000 n=10+10)

EPYC-7K83

name                        old time/op  new time/op  delta
SlicesSortInts-4            7.10ms ± 0%  5.69ms ± 0%  -19.82%  (p=0.000 n=9+10)
SlicesSortInts_Sorted-4     96.7µs ± 1%  90.8µs ± 1%   -6.06%  (p=0.000 n=10+10)
SlicesSortInts_Reversed-4    133µs ± 2%   113µs ± 2%  -15.20%  (p=0.000 n=10+10)
SlicesSortStrings-4         26.9ms ± 4%  26.0ms ± 2%   -3.36%  (p=0.000 n=10+9)
SlicesSortStrings_Sorted-4   658µs ± 1%   644µs ± 0%   -2.20%  (p=0.000 n=10+8)
SortFuncStructs-4           13.5ms ± 8%  12.9ms ± 2%   -4.42%  (p=0.002 n=10+8)

Dual-pivot quicksort usually has better performance than current pdqsort. It can be even faster with api change.

Xeon-8374C (X86-64)

name               std time/op  new time/op  delta
Int/Small-1K       27.7µs ± 2%  24.1µs ± 3%  -12.92%  (p=0.008 n=5+5)
Int/Small-10K       315µs ± 0%   249µs ± 0%  -20.96%  (p=0.008 n=5+5)
Int/Small-100K     3.76ms ± 0%  2.86ms ± 0%  -24.06%  (p=0.008 n=5+5)
Int/Small-1M       44.6ms ± 0%  35.4ms ± 0%  -20.57%  (p=0.008 n=5+5)
Int/Random-1K      48.4µs ± 1%  38.8µs ± 1%  -19.69%  (p=0.008 n=5+5)
Int/Random-10K      614µs ± 1%   464µs ± 0%  -24.44%  (p=0.008 n=5+5)
Int/Random-100K    7.58ms ± 0%  5.61ms ± 0%  -26.07%  (p=0.008 n=5+5)
Int/Random-1M      90.3ms ± 1%  65.8ms ± 0%  -27.08%  (p=0.008 n=5+5)
Int/Constant-1K    1.33µs ± 1%  0.93µs ± 1%  -30.13%  (p=0.008 n=5+5)
Int/Constant-10K   11.1µs ± 4%   8.0µs ± 2%  -27.36%  (p=0.008 n=5+5)
Int/Constant-100K  98.0µs ± 5%  67.0µs ± 2%  -31.61%  (p=0.008 n=5+5)
Int/Constant-1M     932µs ± 1%   646µs ± 0%  -30.72%  (p=0.008 n=5+5)
Int/Ascent-1K      1.33µs ± 1%  0.93µs ± 1%  -30.03%  (p=0.008 n=5+5)
Int/Ascent-10K     10.8µs ± 3%   8.0µs ± 3%  -26.21%  (p=0.008 n=5+5)
Int/Ascent-100K    99.1µs ± 5%  66.2µs ± 1%  -33.20%  (p=0.008 n=5+5)
Int/Ascent-1M       926µs ± 0%   646µs ± 0%  -30.19%  (p=0.008 n=5+5)
Int/Descent-1K     1.85µs ± 1%  1.46µs ± 0%  -21.19%  (p=0.008 n=5+5)
Int/Descent-10K    14.8µs ± 3%  12.0µs ± 6%  -18.43%  (p=0.008 n=5+5)
Int/Descent-100K    132µs ± 1%   104µs ± 2%  -21.46%  (p=0.008 n=5+5)
Int/Descent-1M     1.32ms ± 0%  1.04ms ± 1%  -21.37%  (p=0.008 n=5+5)
Int/Mixed-1K       19.9µs ± 3%  17.1µs ± 3%  -13.87%  (p=0.008 n=5+5)
Int/Mixed-10K       219µs ± 1%   231µs ± 0%   +5.53%  (p=0.008 n=5+5)
Int/Mixed-100K     2.54ms ± 0%  2.55ms ± 0%     ~     (p=0.310 n=5+5)
Int/Mixed-1M       29.5ms ± 0%  28.0ms ± 0%   -5.11%  (p=0.008 n=5+5)
Hybrid/5%          4.09ms ± 0%  3.06ms ± 0%  -25.07%  (p=0.008 n=5+5)
Hybrid/10%         7.12ms ± 0%  5.33ms ± 0%  -25.04%  (p=0.008 n=5+5)
Hybrid/20%         13.1ms ± 0%   9.9ms ± 0%  -24.73%  (p=0.008 n=5+5)
Hybrid/30%         19.2ms ± 1%  14.4ms ± 0%  -25.00%  (p=0.008 n=5+5)
Hybrid/50%         31.2ms ± 1%  23.5ms ± 0%  -24.70%  (p=0.008 n=5+5)
Float/1K           58.2µs ± 2%  48.6µs ± 1%  -16.58%  (p=0.008 n=5+5)
Float/10K           751µs ± 0%   549µs ± 0%  -26.80%  (p=0.008 n=5+5)
Float/100K         9.28ms ± 0%  6.41ms ± 0%  -30.91%  (p=0.008 n=5+5)
Float/1M            110ms ± 0%    73ms ± 0%  -33.44%  (p=0.008 n=5+5)
Str/1K              130µs ± 0%   125µs ± 0%   -4.00%  (p=0.008 n=5+5)
Str/10K            1.69ms ± 0%  1.64ms ± 0%   -2.73%  (p=0.008 n=5+5)
Str/100K           21.6ms ± 1%  21.1ms ± 1%   -2.41%  (p=0.008 n=5+5)
Str/1M              272ms ± 1%   269ms ± 2%     ~     (p=0.095 n=5+5)
Struct/1K           104µs ± 1%    86µs ± 0%  -17.34%  (p=0.008 n=5+5)
Struct/10K         1.36ms ± 0%  1.07ms ± 0%  -20.93%  (p=0.008 n=5+5)
Struct/100K        16.8ms ± 0%  13.0ms ± 0%  -22.48%  (p=0.008 n=5+5)
Struct/1M           201ms ± 0%   152ms ± 0%  -24.21%  (p=0.008 n=5+5)
Stable/1K           183µs ± 1%    84µs ± 0%  -53.88%  (p=0.008 n=5+5)
Stable/10K         2.71ms ± 0%  1.14ms ± 0%  -58.16%  (p=0.008 n=5+5)
Stable/100K        37.9ms ± 0%  17.4ms ± 0%  -54.15%  (p=0.016 n=5+4)
Stable/1M           498ms ± 1%   175ms ± 1%  -64.74%  (p=0.008 n=5+5)
Pointer/1K         78.7µs ± 1%  66.1µs ± 1%  -15.96%  (p=0.008 n=5+5)
Pointer/10K        1.05ms ± 0%  0.90ms ± 0%  -14.87%  (p=0.008 n=5+5)
Pointer/100K       14.0ms ± 0%  12.1ms ± 0%  -13.52%  (p=0.008 n=5+5)
Pointer/1M          171ms ± 0%   159ms ± 0%   -6.82%  (p=0.008 n=5+5)

Yitian-710 (ARM64)

name               std time/op  new time/op  delta
Int/Small-1K       21.4µs ± 0%  17.4µs ± 0%  -18.95%  (p=0.008 n=5+5)
Int/Small-10K       253µs ± 0%   207µs ± 0%  -18.18%  (p=0.008 n=5+5)
Int/Small-100K     3.02ms ± 0%  2.47ms ± 0%  -18.42%  (p=0.008 n=5+5)
Int/Small-1M       35.7ms ± 0%  28.8ms ± 0%  -19.54%  (p=0.008 n=5+5)
Int/Random-1K      37.7µs ± 0%  27.6µs ± 0%  -26.84%  (p=0.008 n=5+5)
Int/Random-10K      492µs ± 0%   362µs ± 0%  -26.46%  (p=0.008 n=5+5)
Int/Random-100K    6.09ms ± 0%  4.51ms ± 0%  -25.96%  (p=0.008 n=5+5)
Int/Random-1M      72.5ms ± 0%  53.9ms ± 0%  -25.69%  (p=0.008 n=5+5)
Int/Constant-1K    1.08µs ± 1%  0.70µs ± 0%  -35.23%  (p=0.016 n=5+4)
Int/Constant-10K   9.60µs ± 0%  6.55µs ± 0%  -31.77%  (p=0.008 n=5+5)
Int/Constant-100K  93.1µs ± 0%  63.5µs ± 1%  -31.78%  (p=0.008 n=5+5)
Int/Constant-1M     929µs ± 0%   640µs ± 1%  -31.09%  (p=0.008 n=5+5)
Int/Ascent-1K      1.08µs ± 1%  0.71µs ± 4%  -34.36%  (p=0.008 n=5+5)
Int/Ascent-10K     9.61µs ± 0%  6.57µs ± 0%  -31.57%  (p=0.008 n=5+5)
Int/Ascent-100K    93.1µs ± 0%  63.1µs ± 0%  -32.20%  (p=0.008 n=5+5)
Int/Ascent-1M       929µs ± 0%   639µs ± 1%  -31.27%  (p=0.008 n=5+5)
Int/Descent-1K     1.45µs ± 0%  1.07µs ± 1%  -26.66%  (p=0.008 n=5+5)
Int/Descent-10K    13.2µs ± 0%   9.9µs ± 0%  -25.07%  (p=0.016 n=5+4)
Int/Descent-100K    130µs ± 0%   100µs ± 0%  -23.09%  (p=0.008 n=5+5)
Int/Descent-1M     1.30ms ± 0%  1.02ms ± 1%  -22.02%  (p=0.008 n=5+5)
Int/Mixed-1K       16.3µs ± 4%  12.1µs ± 0%  -25.65%  (p=0.008 n=5+5)
Int/Mixed-10K       187µs ± 0%   150µs ± 0%  -19.82%  (p=0.008 n=5+5)
Int/Mixed-100K     2.20ms ± 0%  1.77ms ± 0%  -19.24%  (p=0.008 n=5+5)
Int/Mixed-1M       25.3ms ± 0%  20.5ms ± 0%  -19.21%  (p=0.008 n=5+5)
Hybrid/5%          3.50ms ± 0%  2.59ms ± 0%  -26.14%  (p=0.008 n=5+5)
Hybrid/10%         5.92ms ± 0%  4.36ms ± 0%  -26.33%  (p=0.008 n=5+5)
Hybrid/20%         10.8ms ± 0%   7.9ms ± 0%  -26.48%  (p=0.008 n=5+5)
Hybrid/30%         15.6ms ± 0%  11.5ms ± 0%  -26.43%  (p=0.008 n=5+5)
Hybrid/50%         25.3ms ± 0%  18.6ms ± 0%  -26.49%  (p=0.008 n=5+5)
Float/1K           46.1µs ± 0%  35.6µs ± 0%  -22.75%  (p=0.008 n=5+5)
Float/10K           606µs ± 0%   467µs ± 0%  -22.85%  (p=0.008 n=5+5)
Float/100K         7.51ms ± 0%  5.79ms ± 0%  -22.83%  (p=0.008 n=5+5)
Float/1M           89.5ms ± 0%  69.0ms ± 0%  -22.84%  (p=0.008 n=5+5)
Str/1K              124µs ± 0%   119µs ± 0%   -4.40%  (p=0.008 n=5+5)
Str/10K            1.54ms ± 0%  1.48ms ± 0%   -3.70%  (p=0.008 n=5+5)
Str/100K           20.1ms ± 1%  19.6ms ± 0%   -2.37%  (p=0.008 n=5+5)
Str/1M              290ms ± 1%   286ms ± 1%   -1.10%  (p=0.032 n=5+5)
Struct/1K           100µs ± 0%    85µs ± 0%  -14.55%  (p=0.008 n=5+5)
Struct/10K         1.32ms ± 0%  1.07ms ± 0%  -19.19%  (p=0.008 n=5+5)
Struct/100K        16.4ms ± 0%  12.7ms ± 0%  -22.38%  (p=0.008 n=5+5)
Struct/1M           196ms ± 0%   141ms ± 2%  -28.06%  (p=0.008 n=5+5)
Stable/1K           186µs ± 0%    73µs ± 0%  -60.49%  (p=0.008 n=5+5)
Stable/10K         2.84ms ± 0%  1.22ms ± 0%  -57.16%  (p=0.008 n=5+5)
Stable/100K        40.5ms ± 0%  14.7ms ± 0%  -63.75%  (p=0.008 n=5+5)
Stable/1M           534ms ± 0%   163ms ± 0%  -69.43%  (p=0.016 n=5+4)
Pointer/1K         63.7µs ± 0%  55.3µs ± 0%  -13.18%  (p=0.008 n=5+5)
Pointer/10K         873µs ± 0%   775µs ± 1%  -11.23%  (p=0.008 n=5+5)
Pointer/100K       12.4ms ± 0%  11.3ms ± 0%   -9.58%  (p=0.008 n=5+5)
Pointer/1M          159ms ± 0%   147ms ± 0%   -7.59%  (p=0.008 n=5+5)

By the way, this new implement is also shorter and simpler.

PeterRK added a commit to PeterRK/go that referenced this issue Jun 27, 2023
Dual-pivot quicksort has smaller recursion depth.
Some pattern-defeating technique is added to make this variant.

Replace standalone rotate function in symmerge with rotateLeft.

Benchmark on Xeon-8374C
name                      old time/op  new time/op  delta
SlicesSortInts            7.67ms ± 0%  6.23ms ± 0%  -18.82%  (p=0.000 n=10+10)
SlicesSortInts_Sorted      113µs ± 1%    88µs ± 2%  -22.31%  (p=0.000 n=10+10)
SlicesSortInts_Reversed    148µs ± 0%   126µs ± 1%  -15.04%  (p=0.000 n=10+9)
SlicesSortStrings         28.0ms ± 1%  27.3ms ± 4%   -2.35%  (p=0.022 n=9+10)
SlicesSortStrings_Sorted   628µs ± 1%   567µs ± 0%   -9.80%  (p=0.000 n=10+9)
SortFuncStructs           14.1ms ± 2%  13.4ms ± 2%   -5.53%  (p=0.000 n=10+10)
SortStableFuncStructs     32.2ms ± 2%  33.3ms ± 1%   +3.43%  (p=0.000 n=9+10)

Benchmark on EPYC-7K83
name                      old time/op  new time/op  delta
SlicesSortInts            7.10ms ± 0%  5.69ms ± 0%  -19.82%  (p=0.000 n=9+10)
SlicesSortInts_Sorted     96.7µs ± 1%  90.8µs ± 1%   -6.06%  (p=0.000 n=10+10)
SlicesSortInts_Reversed    133µs ± 2%   113µs ± 2%  -15.20%  (p=0.000 n=10+10)
SlicesSortStrings         26.9ms ± 4%  26.0ms ± 2%   -3.36%  (p=0.000 n=10+9)
SlicesSortStrings_Sorted   658µs ± 1%   644µs ± 0%   -2.20%  (p=0.000 n=10+8)
SortFuncStructs           13.5ms ± 8%  12.9ms ± 2%   -4.42%  (p=0.002 n=10+8)
SortStableFuncStructs     35.5ms ± 7%  31.6ms ± 7%  -11.18%  (p=0.000 n=9+10)

For golang#61027
@ianlancetaylor
Copy link
Contributor

CC @eliben

@ianlancetaylor ianlancetaylor added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed Proposal labels Jun 27, 2023
@ianlancetaylor ianlancetaylor added this to the Backlog milestone Jun 27, 2023
@rsc
Copy link
Contributor

rsc commented Jun 27, 2023

I think we already used this decade's "change to a new sort algorithm" coupon.

@eliben
Copy link
Member

eliben commented Jun 27, 2023

cc @zhangyunhao116

@PeterRK
Copy link
Author

PeterRK commented Jun 28, 2023

I think we already used this decade's "change to a new sort algorithm" coupon.

Dual-pivot quicksort is not a new sort algorithm. It can be found in Java standard library many years ago. New algorithm like pdqsort is not always best.

@rsc
Copy link
Contributor

rsc commented Jun 28, 2023

My point is we just put the ecosystem through the churn of a sort change in Go 1.19 (2022). I don't see a strong argument that slices.Sort is somehow fundamentally different from sort.Slice and merits a different algorithm, nor do I see a strong argument for invalidating our decision just last year to switch to pdqsort. We could spend lots of time flip-flopping between different sort algorithms in the standard library, or we could do other work. Having just changed the sort algorithm last year, I think we owe it to ourselves and our users not to change it again for at least five years (that is, Go 1.29 at the earliest) unless we have very good reasons.

We should definitely not make a change for Go 1.21 - it is too late in the cycle.

@PeterRK
Copy link
Author

PeterRK commented Jun 28, 2023

My point is we just put the ecosystem through the churn of a sort change in Go 1.19 (2022). I don't see a strong argument that slices.Sort is somehow fundamentally different from sort.Slice and merits a different algorithm, nor do I see a strong argument for invalidating our decision just last year to switch to pdqsort. We could spend lots of time flip-flopping between different sort algorithms in the standard library, or we could do other work. Having just changed the sort algorithm last year, I think we owe it to ourselves and our users not to change it again for at least five years (that is, Go 1.29 at the earliest) unless we have very good reasons.

We should definitely not make a change for Go 1.21 - it is too late in the cycle.

I agree that it’s not a good idea to make changes frequently. People now can use different sort algorithms with third-party libraries. Just let it fly.

@eliben
Copy link
Member

eliben commented Jun 28, 2023

To add to @rsc 's point, we're also putting the community through a "sorting change" right now, in the upcoming Go 1.21, where the slices package with its sorting functionality is entirely new. It provides better performance than the sort package in many cases, so we're going to recommend Go users to shift to it, if possible. This is a fairly major change.

pdqsort was a large performance improvement for some cases. The slices.Sort (vs. sort) is also a significant speedup. Compared to these, the quoted improvements with dual-pivot quicksort are relatively modest (except sorting a slice of ints). So I definitely agree that we should give the community time to settle with the recent changes, and if people require the absolutely fastest sort of an int slice, they can use a 3rd party package.

@PeterRK
Copy link
Author

PeterRK commented Jun 28, 2023

To add to @rsc 's point, we're also putting the community through a "sorting change" right now, in the upcoming Go 1.21, where the slices package with its sorting functionality is entirely new. It provides better performance than the sort package in many cases, so we're going to recommend Go users to shift to it, if possible. This is a fairly major change.

pdqsort was a large performance improvement for some cases. The slices.Sort (vs. sort) is also a significant speedup. Compared to these, the quoted improvements with dual-pivot quicksort are relatively modest (except sorting a slice of ints). So I definitely agree that we should give the community time to settle with the recent changes, and if people require the absolutely fastest sort of an int slice, they can use a 3rd party package.

Patern-defeating techique just makes fastest cases faster. It helps little in real hybrid workload. New apis are needed to achieve significant speedup. But I don't think it's worthy to change standard library apis for speed now.

@PeterRK
Copy link
Author

PeterRK commented Jun 28, 2023

I didn't really care which sort algorithm is used in go standard library until someone claimed pdqsort is fastest algorithm in Go. That's not the truth, but got credit from go tream. @eliben @rsc

@zhangyunhao116
Copy link
Contributor

@PeterRK

I'm the author of this article, and I don't think I claimed that "PDQSORT is the fastest sorting algorithm in Go".

We can see the summary of this article(translated from Chinese directly):

At present, most of the unstable sorting algorithms used in the industry have basically changed from a single sorting algorithm in textbooks to a hybrid sorting algorithm to deal with various sequences in practical scenarios.

Relying on its performance advantages in common scenarios compared with previous algorithms, pdqsort has gradually become the mainstream implementation of unstable sorting algorithms. Based on the generics brought by Go1.18, the implementation of sorting algorithms is greatly simplified, and it also gives us the possibility to implement new algorithms. But pdqsort is not a panacea. In some cases, other algorithms still maintain advantages (for example, timsort of the Python standard library is better than pdqsort in mixed ascending && descending scenarios). However, in most cases, pdqsort has become a better choice for unstable algorithms depending on its specific optimization for different situations.

The title of this article is "Building the fastest sorting algorithm"(it is ongoing), PDQSORT is just an important milestone on the road to building the fastest sorting algorithm. There are still many other languages or libraries that use this algorithm, and I still haven't found any algorithm that is faster than PDQSORT in all cases for now. PDQSORT is still one of the fastest hybrid sorting algorithms in some cases.

In addition, I am not the author of the PDQSORT algorithm itself. This implementation is first used internally at my work and then contributed to the community.

This article is more to let everyone understand the principles of the algorithm (in fact, this article is part of a course for college students within the company), and to attract more people to participate in community building, if you have a better idea for sorting algorithms, feel free.

@PeterRK
Copy link
Author

PeterRK commented Jun 29, 2023

Pattern-defeating technique can be found in timsort. Branch-eliminating technique comes from blockquicksort. Cache-exploiting technique comes from dual-pivot quicksort.
Pdqsort trys to combine those techniques, but fail to handle the conflict between branch-eliminating technique and cache-exploiting technique. The author of pdqsort explains in paper why he choose branch-eliminating technique after trade-off. Cache-exploiting technique should be used in where branch-eliminating technique doesn’t work.
Implement in go standard library distorts the idea of pdqsort and exaggerate its power.

Most worryingly, go team seems to support this mistake.

@zhangyunhao116
Copy link
Contributor

zhangyunhao116 commented Jun 29, 2023

In the process of implementation, we often need some engineering compromises, which means that the optimization of the paper wouldn't always work. For example, the implementation in the standard library is about 10%~ 20% slower than the original implementation I used. This is because it needs to support both interface-sorting and integer-sorting, and these algorithms need to be generated using the same file.

If the algorithm only needs to support integer sorting or doesn't need to be generated via the same file, many other optimizations can be used, but it doesn't mean these engineering compromises are a mistake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants