slices: use dual-pivot quicksort instead of pdqsort #61027

PeterRK · 2023-06-27T13:33:58Z

Reinplement #52789 based on go 1.21rc2. Here is new benchmark result.

Xeon-8374C

name                        old time/op  new time/op  delta
SlicesSortInts-4            7.67ms ± 0%  6.23ms ± 0%  -18.82%  (p=0.000 n=10+10)
SlicesSortInts_Sorted-4      113µs ± 1%    88µs ± 2%  -22.31%  (p=0.000 n=10+10)
SlicesSortInts_Reversed-4    148µs ± 0%   126µs ± 1%  -15.04%  (p=0.000 n=10+9)
SlicesSortStrings-4         28.0ms ± 1%  27.3ms ± 4%   -2.35%  (p=0.022 n=9+10)
SlicesSortStrings_Sorted-4   628µs ± 1%   567µs ± 0%   -9.80%  (p=0.000 n=10+9)
SortFuncStructs-4           14.1ms ± 2%  13.4ms ± 2%   -5.53%  (p=0.000 n=10+10)

EPYC-7K83

name                        old time/op  new time/op  delta
SlicesSortInts-4            7.10ms ± 0%  5.69ms ± 0%  -19.82%  (p=0.000 n=9+10)
SlicesSortInts_Sorted-4     96.7µs ± 1%  90.8µs ± 1%   -6.06%  (p=0.000 n=10+10)
SlicesSortInts_Reversed-4    133µs ± 2%   113µs ± 2%  -15.20%  (p=0.000 n=10+10)
SlicesSortStrings-4         26.9ms ± 4%  26.0ms ± 2%   -3.36%  (p=0.000 n=10+9)
SlicesSortStrings_Sorted-4   658µs ± 1%   644µs ± 0%   -2.20%  (p=0.000 n=10+8)
SortFuncStructs-4           13.5ms ± 8%  12.9ms ± 2%   -4.42%  (p=0.002 n=10+8)

Dual-pivot quicksort usually has better performance than current pdqsort. It can be even faster with api change.

Xeon-8374C (X86-64)

name               std time/op  new time/op  delta
Int/Small-1K       27.7µs ± 2%  24.1µs ± 3%  -12.92%  (p=0.008 n=5+5)
Int/Small-10K       315µs ± 0%   249µs ± 0%  -20.96%  (p=0.008 n=5+5)
Int/Small-100K     3.76ms ± 0%  2.86ms ± 0%  -24.06%  (p=0.008 n=5+5)
Int/Small-1M       44.6ms ± 0%  35.4ms ± 0%  -20.57%  (p=0.008 n=5+5)
Int/Random-1K      48.4µs ± 1%  38.8µs ± 1%  -19.69%  (p=0.008 n=5+5)
Int/Random-10K      614µs ± 1%   464µs ± 0%  -24.44%  (p=0.008 n=5+5)
Int/Random-100K    7.58ms ± 0%  5.61ms ± 0%  -26.07%  (p=0.008 n=5+5)
Int/Random-1M      90.3ms ± 1%  65.8ms ± 0%  -27.08%  (p=0.008 n=5+5)
Int/Constant-1K    1.33µs ± 1%  0.93µs ± 1%  -30.13%  (p=0.008 n=5+5)
Int/Constant-10K   11.1µs ± 4%   8.0µs ± 2%  -27.36%  (p=0.008 n=5+5)
Int/Constant-100K  98.0µs ± 5%  67.0µs ± 2%  -31.61%  (p=0.008 n=5+5)
Int/Constant-1M     932µs ± 1%   646µs ± 0%  -30.72%  (p=0.008 n=5+5)
Int/Ascent-1K      1.33µs ± 1%  0.93µs ± 1%  -30.03%  (p=0.008 n=5+5)
Int/Ascent-10K     10.8µs ± 3%   8.0µs ± 3%  -26.21%  (p=0.008 n=5+5)
Int/Ascent-100K    99.1µs ± 5%  66.2µs ± 1%  -33.20%  (p=0.008 n=5+5)
Int/Ascent-1M       926µs ± 0%   646µs ± 0%  -30.19%  (p=0.008 n=5+5)
Int/Descent-1K     1.85µs ± 1%  1.46µs ± 0%  -21.19%  (p=0.008 n=5+5)
Int/Descent-10K    14.8µs ± 3%  12.0µs ± 6%  -18.43%  (p=0.008 n=5+5)
Int/Descent-100K    132µs ± 1%   104µs ± 2%  -21.46%  (p=0.008 n=5+5)
Int/Descent-1M     1.32ms ± 0%  1.04ms ± 1%  -21.37%  (p=0.008 n=5+5)
Int/Mixed-1K       19.9µs ± 3%  17.1µs ± 3%  -13.87%  (p=0.008 n=5+5)
Int/Mixed-10K       219µs ± 1%   231µs ± 0%   +5.53%  (p=0.008 n=5+5)
Int/Mixed-100K     2.54ms ± 0%  2.55ms ± 0%     ~     (p=0.310 n=5+5)
Int/Mixed-1M       29.5ms ± 0%  28.0ms ± 0%   -5.11%  (p=0.008 n=5+5)
Hybrid/5%          4.09ms ± 0%  3.06ms ± 0%  -25.07%  (p=0.008 n=5+5)
Hybrid/10%         7.12ms ± 0%  5.33ms ± 0%  -25.04%  (p=0.008 n=5+5)
Hybrid/20%         13.1ms ± 0%   9.9ms ± 0%  -24.73%  (p=0.008 n=5+5)
Hybrid/30%         19.2ms ± 1%  14.4ms ± 0%  -25.00%  (p=0.008 n=5+5)
Hybrid/50%         31.2ms ± 1%  23.5ms ± 0%  -24.70%  (p=0.008 n=5+5)
Float/1K           58.2µs ± 2%  48.6µs ± 1%  -16.58%  (p=0.008 n=5+5)
Float/10K           751µs ± 0%   549µs ± 0%  -26.80%  (p=0.008 n=5+5)
Float/100K         9.28ms ± 0%  6.41ms ± 0%  -30.91%  (p=0.008 n=5+5)
Float/1M            110ms ± 0%    73ms ± 0%  -33.44%  (p=0.008 n=5+5)
Str/1K              130µs ± 0%   125µs ± 0%   -4.00%  (p=0.008 n=5+5)
Str/10K            1.69ms ± 0%  1.64ms ± 0%   -2.73%  (p=0.008 n=5+5)
Str/100K           21.6ms ± 1%  21.1ms ± 1%   -2.41%  (p=0.008 n=5+5)
Str/1M              272ms ± 1%   269ms ± 2%     ~     (p=0.095 n=5+5)
Struct/1K           104µs ± 1%    86µs ± 0%  -17.34%  (p=0.008 n=5+5)
Struct/10K         1.36ms ± 0%  1.07ms ± 0%  -20.93%  (p=0.008 n=5+5)
Struct/100K        16.8ms ± 0%  13.0ms ± 0%  -22.48%  (p=0.008 n=5+5)
Struct/1M           201ms ± 0%   152ms ± 0%  -24.21%  (p=0.008 n=5+5)
Stable/1K           183µs ± 1%    84µs ± 0%  -53.88%  (p=0.008 n=5+5)
Stable/10K         2.71ms ± 0%  1.14ms ± 0%  -58.16%  (p=0.008 n=5+5)
Stable/100K        37.9ms ± 0%  17.4ms ± 0%  -54.15%  (p=0.016 n=5+4)
Stable/1M           498ms ± 1%   175ms ± 1%  -64.74%  (p=0.008 n=5+5)
Pointer/1K         78.7µs ± 1%  66.1µs ± 1%  -15.96%  (p=0.008 n=5+5)
Pointer/10K        1.05ms ± 0%  0.90ms ± 0%  -14.87%  (p=0.008 n=5+5)
Pointer/100K       14.0ms ± 0%  12.1ms ± 0%  -13.52%  (p=0.008 n=5+5)
Pointer/1M          171ms ± 0%   159ms ± 0%   -6.82%  (p=0.008 n=5+5)

Yitian-710 (ARM64)

name               std time/op  new time/op  delta
Int/Small-1K       21.4µs ± 0%  17.4µs ± 0%  -18.95%  (p=0.008 n=5+5)
Int/Small-10K       253µs ± 0%   207µs ± 0%  -18.18%  (p=0.008 n=5+5)
Int/Small-100K     3.02ms ± 0%  2.47ms ± 0%  -18.42%  (p=0.008 n=5+5)
Int/Small-1M       35.7ms ± 0%  28.8ms ± 0%  -19.54%  (p=0.008 n=5+5)
Int/Random-1K      37.7µs ± 0%  27.6µs ± 0%  -26.84%  (p=0.008 n=5+5)
Int/Random-10K      492µs ± 0%   362µs ± 0%  -26.46%  (p=0.008 n=5+5)
Int/Random-100K    6.09ms ± 0%  4.51ms ± 0%  -25.96%  (p=0.008 n=5+5)
Int/Random-1M      72.5ms ± 0%  53.9ms ± 0%  -25.69%  (p=0.008 n=5+5)
Int/Constant-1K    1.08µs ± 1%  0.70µs ± 0%  -35.23%  (p=0.016 n=5+4)
Int/Constant-10K   9.60µs ± 0%  6.55µs ± 0%  -31.77%  (p=0.008 n=5+5)
Int/Constant-100K  93.1µs ± 0%  63.5µs ± 1%  -31.78%  (p=0.008 n=5+5)
Int/Constant-1M     929µs ± 0%   640µs ± 1%  -31.09%  (p=0.008 n=5+5)
Int/Ascent-1K      1.08µs ± 1%  0.71µs ± 4%  -34.36%  (p=0.008 n=5+5)
Int/Ascent-10K     9.61µs ± 0%  6.57µs ± 0%  -31.57%  (p=0.008 n=5+5)
Int/Ascent-100K    93.1µs ± 0%  63.1µs ± 0%  -32.20%  (p=0.008 n=5+5)
Int/Ascent-1M       929µs ± 0%   639µs ± 1%  -31.27%  (p=0.008 n=5+5)
Int/Descent-1K     1.45µs ± 0%  1.07µs ± 1%  -26.66%  (p=0.008 n=5+5)
Int/Descent-10K    13.2µs ± 0%   9.9µs ± 0%  -25.07%  (p=0.016 n=5+4)
Int/Descent-100K    130µs ± 0%   100µs ± 0%  -23.09%  (p=0.008 n=5+5)
Int/Descent-1M     1.30ms ± 0%  1.02ms ± 1%  -22.02%  (p=0.008 n=5+5)
Int/Mixed-1K       16.3µs ± 4%  12.1µs ± 0%  -25.65%  (p=0.008 n=5+5)
Int/Mixed-10K       187µs ± 0%   150µs ± 0%  -19.82%  (p=0.008 n=5+5)
Int/Mixed-100K     2.20ms ± 0%  1.77ms ± 0%  -19.24%  (p=0.008 n=5+5)
Int/Mixed-1M       25.3ms ± 0%  20.5ms ± 0%  -19.21%  (p=0.008 n=5+5)
Hybrid/5%          3.50ms ± 0%  2.59ms ± 0%  -26.14%  (p=0.008 n=5+5)
Hybrid/10%         5.92ms ± 0%  4.36ms ± 0%  -26.33%  (p=0.008 n=5+5)
Hybrid/20%         10.8ms ± 0%   7.9ms ± 0%  -26.48%  (p=0.008 n=5+5)
Hybrid/30%         15.6ms ± 0%  11.5ms ± 0%  -26.43%  (p=0.008 n=5+5)
Hybrid/50%         25.3ms ± 0%  18.6ms ± 0%  -26.49%  (p=0.008 n=5+5)
Float/1K           46.1µs ± 0%  35.6µs ± 0%  -22.75%  (p=0.008 n=5+5)
Float/10K           606µs ± 0%   467µs ± 0%  -22.85%  (p=0.008 n=5+5)
Float/100K         7.51ms ± 0%  5.79ms ± 0%  -22.83%  (p=0.008 n=5+5)
Float/1M           89.5ms ± 0%  69.0ms ± 0%  -22.84%  (p=0.008 n=5+5)
Str/1K              124µs ± 0%   119µs ± 0%   -4.40%  (p=0.008 n=5+5)
Str/10K            1.54ms ± 0%  1.48ms ± 0%   -3.70%  (p=0.008 n=5+5)
Str/100K           20.1ms ± 1%  19.6ms ± 0%   -2.37%  (p=0.008 n=5+5)
Str/1M              290ms ± 1%   286ms ± 1%   -1.10%  (p=0.032 n=5+5)
Struct/1K           100µs ± 0%    85µs ± 0%  -14.55%  (p=0.008 n=5+5)
Struct/10K         1.32ms ± 0%  1.07ms ± 0%  -19.19%  (p=0.008 n=5+5)
Struct/100K        16.4ms ± 0%  12.7ms ± 0%  -22.38%  (p=0.008 n=5+5)
Struct/1M           196ms ± 0%   141ms ± 2%  -28.06%  (p=0.008 n=5+5)
Stable/1K           186µs ± 0%    73µs ± 0%  -60.49%  (p=0.008 n=5+5)
Stable/10K         2.84ms ± 0%  1.22ms ± 0%  -57.16%  (p=0.008 n=5+5)
Stable/100K        40.5ms ± 0%  14.7ms ± 0%  -63.75%  (p=0.008 n=5+5)
Stable/1M           534ms ± 0%   163ms ± 0%  -69.43%  (p=0.016 n=5+4)
Pointer/1K         63.7µs ± 0%  55.3µs ± 0%  -13.18%  (p=0.008 n=5+5)
Pointer/10K         873µs ± 0%   775µs ± 1%  -11.23%  (p=0.008 n=5+5)
Pointer/100K       12.4ms ± 0%  11.3ms ± 0%   -9.58%  (p=0.008 n=5+5)
Pointer/1M          159ms ± 0%   147ms ± 0%   -7.59%  (p=0.008 n=5+5)

By the way, this new implement is also shorter and simpler.

The text was updated successfully, but these errors were encountered:

Dual-pivot quicksort has smaller recursion depth. Some pattern-defeating technique is added to make this variant. Replace standalone rotate function in symmerge with rotateLeft. Benchmark on Xeon-8374C name old time/op new time/op delta SlicesSortInts 7.67ms ± 0% 6.23ms ± 0% -18.82% (p=0.000 n=10+10) SlicesSortInts_Sorted 113µs ± 1% 88µs ± 2% -22.31% (p=0.000 n=10+10) SlicesSortInts_Reversed 148µs ± 0% 126µs ± 1% -15.04% (p=0.000 n=10+9) SlicesSortStrings 28.0ms ± 1% 27.3ms ± 4% -2.35% (p=0.022 n=9+10) SlicesSortStrings_Sorted 628µs ± 1% 567µs ± 0% -9.80% (p=0.000 n=10+9) SortFuncStructs 14.1ms ± 2% 13.4ms ± 2% -5.53% (p=0.000 n=10+10) SortStableFuncStructs 32.2ms ± 2% 33.3ms ± 1% +3.43% (p=0.000 n=9+10) Benchmark on EPYC-7K83 name old time/op new time/op delta SlicesSortInts 7.10ms ± 0% 5.69ms ± 0% -19.82% (p=0.000 n=9+10) SlicesSortInts_Sorted 96.7µs ± 1% 90.8µs ± 1% -6.06% (p=0.000 n=10+10) SlicesSortInts_Reversed 133µs ± 2% 113µs ± 2% -15.20% (p=0.000 n=10+10) SlicesSortStrings 26.9ms ± 4% 26.0ms ± 2% -3.36% (p=0.000 n=10+9) SlicesSortStrings_Sorted 658µs ± 1% 644µs ± 0% -2.20% (p=0.000 n=10+8) SortFuncStructs 13.5ms ± 8% 12.9ms ± 2% -4.42% (p=0.002 n=10+8) SortStableFuncStructs 35.5ms ± 7% 31.6ms ± 7% -11.18% (p=0.000 n=9+10) For golang#61027

ianlancetaylor · 2023-06-27T17:46:02Z

CC @eliben

rsc · 2023-06-27T17:48:06Z

I think we already used this decade's "change to a new sort algorithm" coupon.

eliben · 2023-06-27T20:31:40Z

cc @zhangyunhao116

PeterRK · 2023-06-28T01:27:14Z

I think we already used this decade's "change to a new sort algorithm" coupon.

Dual-pivot quicksort is not a new sort algorithm. It can be found in Java standard library many years ago. New algorithm like pdqsort is not always best.

rsc · 2023-06-28T02:38:37Z

My point is we just put the ecosystem through the churn of a sort change in Go 1.19 (2022). I don't see a strong argument that slices.Sort is somehow fundamentally different from sort.Slice and merits a different algorithm, nor do I see a strong argument for invalidating our decision just last year to switch to pdqsort. We could spend lots of time flip-flopping between different sort algorithms in the standard library, or we could do other work. Having just changed the sort algorithm last year, I think we owe it to ourselves and our users not to change it again for at least five years (that is, Go 1.29 at the earliest) unless we have very good reasons.

We should definitely not make a change for Go 1.21 - it is too late in the cycle.

PeterRK · 2023-06-28T03:10:30Z

My point is we just put the ecosystem through the churn of a sort change in Go 1.19 (2022). I don't see a strong argument that slices.Sort is somehow fundamentally different from sort.Slice and merits a different algorithm, nor do I see a strong argument for invalidating our decision just last year to switch to pdqsort. We could spend lots of time flip-flopping between different sort algorithms in the standard library, or we could do other work. Having just changed the sort algorithm last year, I think we owe it to ourselves and our users not to change it again for at least five years (that is, Go 1.29 at the earliest) unless we have very good reasons.

We should definitely not make a change for Go 1.21 - it is too late in the cycle.

I agree that it’s not a good idea to make changes frequently. People now can use different sort algorithms with third-party libraries. Just let it fly.

eliben · 2023-06-28T03:41:42Z

To add to @rsc 's point, we're also putting the community through a "sorting change" right now, in the upcoming Go 1.21, where the slices package with its sorting functionality is entirely new. It provides better performance than the sort package in many cases, so we're going to recommend Go users to shift to it, if possible. This is a fairly major change.

pdqsort was a large performance improvement for some cases. The slices.Sort (vs. sort) is also a significant speedup. Compared to these, the quoted improvements with dual-pivot quicksort are relatively modest (except sorting a slice of ints). So I definitely agree that we should give the community time to settle with the recent changes, and if people require the absolutely fastest sort of an int slice, they can use a 3rd party package.

PeterRK · 2023-06-28T04:39:59Z

To add to @rsc 's point, we're also putting the community through a "sorting change" right now, in the upcoming Go 1.21, where the slices package with its sorting functionality is entirely new. It provides better performance than the sort package in many cases, so we're going to recommend Go users to shift to it, if possible. This is a fairly major change.

pdqsort was a large performance improvement for some cases. The slices.Sort (vs. sort) is also a significant speedup. Compared to these, the quoted improvements with dual-pivot quicksort are relatively modest (except sorting a slice of ints). So I definitely agree that we should give the community time to settle with the recent changes, and if people require the absolutely fastest sort of an int slice, they can use a 3rd party package.

Patern-defeating techique just makes fastest cases faster. It helps little in real hybrid workload. New apis are needed to achieve significant speedup. But I don't think it's worthy to change standard library apis for speed now.

PeterRK · 2023-06-28T04:59:03Z

I didn't really care which sort algorithm is used in go standard library until someone claimed pdqsort is fastest algorithm in Go. That's not the truth, but got credit from go tream. @eliben @rsc

zhangyunhao116 · 2023-06-28T07:30:44Z

@PeterRK

I'm the author of this article, and I don't think I claimed that "PDQSORT is the fastest sorting algorithm in Go".

We can see the summary of this article(translated from Chinese directly):

At present, most of the unstable sorting algorithms used in the industry have basically changed from a single sorting algorithm in textbooks to a hybrid sorting algorithm to deal with various sequences in practical scenarios.

Relying on its performance advantages in common scenarios compared with previous algorithms, pdqsort has gradually become the mainstream implementation of unstable sorting algorithms. Based on the generics brought by Go1.18, the implementation of sorting algorithms is greatly simplified, and it also gives us the possibility to implement new algorithms. But pdqsort is not a panacea. In some cases, other algorithms still maintain advantages (for example, timsort of the Python standard library is better than pdqsort in mixed ascending && descending scenarios). However, in most cases, pdqsort has become a better choice for unstable algorithms depending on its specific optimization for different situations.

The title of this article is "Building the fastest sorting algorithm"(it is ongoing), PDQSORT is just an important milestone on the road to building the fastest sorting algorithm. There are still many other languages or libraries that use this algorithm, and I still haven't found any algorithm that is faster than PDQSORT in all cases for now. PDQSORT is still one of the fastest hybrid sorting algorithms in some cases.

In addition, I am not the author of the PDQSORT algorithm itself. This implementation is first used internally at my work and then contributed to the community.

This article is more to let everyone understand the principles of the algorithm (in fact, this article is part of a course for college students within the company), and to attract more people to participate in community building, if you have a better idea for sorting algorithms, feel free.

PeterRK · 2023-06-29T02:19:32Z

Pattern-defeating technique can be found in timsort. Branch-eliminating technique comes from blockquicksort. Cache-exploiting technique comes from dual-pivot quicksort.
Pdqsort trys to combine those techniques, but fail to handle the conflict between branch-eliminating technique and cache-exploiting technique. The author of pdqsort explains in paper why he choose branch-eliminating technique after trade-off. Cache-exploiting technique should be used in where branch-eliminating technique doesn’t work.
Implement in go standard library distorts the idea of pdqsort and exaggerate its power.

Most worryingly, go team seems to support this mistake.

zhangyunhao116 · 2023-06-29T05:57:27Z

In the process of implementation, we often need some engineering compromises, which means that the optimization of the paper wouldn't always work. For example, the implementation in the standard library is about 10%~ 20% slower than the original implementation I used. This is because it needs to support both interface-sorting and integer-sorting, and these algorithms need to be generated using the same file.

If the algorithm only needs to support integer sorting or doesn't need to be generated via the same file, many other optimizations can be used, but it doesn't mean these engineering compromises are a mistake.

PeterRK added the Proposal label Jun 27, 2023

ianlancetaylor added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed Proposal labels Jun 27, 2023

ianlancetaylor added this to the Backlog milestone Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slices: use dual-pivot quicksort instead of pdqsort #61027

slices: use dual-pivot quicksort instead of pdqsort #61027

PeterRK commented Jun 27, 2023 •

edited

ianlancetaylor commented Jun 27, 2023

rsc commented Jun 27, 2023

eliben commented Jun 27, 2023

PeterRK commented Jun 28, 2023

rsc commented Jun 28, 2023 •

edited

PeterRK commented Jun 28, 2023

eliben commented Jun 28, 2023

PeterRK commented Jun 28, 2023

PeterRK commented Jun 28, 2023

zhangyunhao116 commented Jun 28, 2023

PeterRK commented Jun 29, 2023

zhangyunhao116 commented Jun 29, 2023 •

edited

slices: use dual-pivot quicksort instead of pdqsort #61027

slices: use dual-pivot quicksort instead of pdqsort #61027

Comments

PeterRK commented Jun 27, 2023 • edited

ianlancetaylor commented Jun 27, 2023

rsc commented Jun 27, 2023

eliben commented Jun 27, 2023

PeterRK commented Jun 28, 2023

rsc commented Jun 28, 2023 • edited

PeterRK commented Jun 28, 2023

eliben commented Jun 28, 2023

PeterRK commented Jun 28, 2023

PeterRK commented Jun 28, 2023

zhangyunhao116 commented Jun 28, 2023

PeterRK commented Jun 29, 2023

zhangyunhao116 commented Jun 29, 2023 • edited

PeterRK commented Jun 27, 2023 •

edited

rsc commented Jun 28, 2023 •

edited

zhangyunhao116 commented Jun 29, 2023 •

edited