logoalt Hacker News

Branchless Quicksort faster than std:sort and pdqsort with C and C++ API

93 pointsby birdculturelast Tuesday at 8:00 PM14 commentsview on HN

Comments

orlpyesterday at 10:51 PM

Since pdqsort (an older project of mine) was mentioned, I felt it wouldn't be entirely inappropriate to mention that I've since then collaborated with Lukas Bergdoll to provide two high-quality sort implementations for the Rust standard library, ipnsort (unstable) and driftsort (stable).

So if you use Rust, you get these by simply calling [T]::sort(_unstable). Great performance out of the box :)

On my machine (Apple M2), using the benchmarks from the repository on Apple clang 17 and Rust 1.98 nightly:

    Sorting 50 million doubles:
    ipnsort             0.79s
    blqs                0.90s
    driftsort           1.13s   (stable)
    std::sort           1.22s
    std::stable_sort    4.64s   (stable)

    Sorting 50 million (i32, i32) structs:
    ipnsort             0.82s
    blqs                0.89s
    driftsort           1.07s   (stable)
    std::sort           3.09s
    std::stable_sort    3.15s   (stable)

And now for a cool party trick, let's repeat the 50 million doubles experiment again, but have the first 90% already sorted, last 10% random:

    driftsort           0.29s   (stable)
    ipnsort             0.81s
    std::sort           1.15s
    std::stable_sort    1.63s   (stable)
    blqs                1.89s
quuxplusonetoday at 1:38 AM

It's unfortunate that the C++ version of the code assumes the type T is default-constructible (and that the default constructor is cheap). It also assumes that the type T is copy-constructible; at a glance I can't tell if the algorithm depends on making a copy in every place that it does make a copy. E.g. in the `heap_sort` helper we have

    T k;                       // default-construct
    if (i > 0) k = left[--i];  // copy-assign
This fairly obviously could be replaced with "copy-construct." Could it be replaced with "move-construct"? I don't know. Again, in `partition_small`, we have

    T swbuf[SMALLPART];
which default-constructs a bunch of Ts. I think we're just going to overwrite that memory in a moment anyway, so constructing all those Ts is a waste of cycles; but I'm not sure.

All of my "I don't knows" and "I'm not sures" are due to my own lack of digging into the code; I'm sure one could find out if one really looked.

None of that matters if you're just sorting `int` or the benchmarked `struct entry`. But it matters a great deal if you're taking the README literally and trying to sort "types with higher copy costs [...] (such as strings)".

kvujtoday at 1:17 AM

>On modern CPUs, avoiding branch misprediction is a key technique to speed up programs. This branchless approach:

>

>for (int i = 0; i < 1000; i++) {

> small_numbers[smlen] = numbers[i];

> smlen += (numbers[i] < 500);

>}

Excuse my terrible ignorance but isn't there still a branch? If numbers[i] < 500 then 1 else 0? I would think something like addition plus a bit comparison would avoid said branch. Unless compilers already optimize the code, but then wouldn't they also optimize the naive piece of code?

show 2 replies
mgaunardyesterday at 10:11 PM

Aren't there several bitonic sort network implementations that are vectorized, Intel's in particular?

Why not compare against that?

show 3 replies
davidkwastyesterday at 10:07 PM

It is so simple that I had to look very slowly to understand. Nicely done.

show 1 reply