And similarly, entire generations of programmers were never taught Horner's scheme. You can see...

Sesse__ • today at 5:29 PM • 9 replies • view on HN

And similarly, entire generations of programmers were never taught Horner's scheme. You can see it in the article, where they write stuff like

  A * x * x * x * x * x * x + B * x * x * x * x + C * x * x + D

(10 muls, 3 muladds)

instead of the faster

  tmp = x * x;
  ((A * tmp + B) * tmp + C) * tmp + D

(1 mul, 3 muladds)

Replies

anematode • today at 7:35 PM

Yep, good stuff. Another nice trick to extract more ILP is to split it into even/odd exponents and then recombine at the end (not sure if this has a name). This also makes it amenable to SLP vectorization (although I doubt the compiler will do this nicely on its own). For example something like

    typedef double v2d __attribute__ ((vector_size (16)));

    v2d packed = { x, x };
    packed = fma(packed, As, Bs);
    packed = fma(packed, Cs, Ds);
    // ...
    return x * packed[0] + packed[1]

smth like that

Actually one project I was thinking of doing was creating SLP vectorized versions of libm functions. Since plenty of programs spend a lot of time in libm calling single inputs, but the implementation is usually a bunch of scalar instructions.

woadwarrior01 • today at 5:50 PM

The common subexpression elimination (CSE) pass in compilers takes care of that.

➕ show 1 reply

eska • today at 6:01 PM

The problem with Horner’s scheme is that it creates a long chain of data dependencies, instead of making full use of all execution units. Usually you’d want more of a binary tree than a chain.

➕ show 2 replies

def-pri-pub • today at 6:30 PM

The reason for writing out all of the x multiplications like that is that I was hoping the compiler detect such a pattern perform an optimization for me. Mat Godbolt's "Advent of Compiler Optimizations" series mentions some of these cases where the compiler can do more auto-optimizations for the developer.

➕ show 1 reply

arkmm • today at 5:44 PM

Didn't know this technique had a name, but I would think a modern compiler could make this optimization on its own, no?

➕ show 1 reply

3371 • today at 5:42 PM

Isn't that for... readability...?

zahlman • today at 5:38 PM

Is this outside of what compilers can do nowadays? (Or do they refuse because it's floating-point?)

owlbite • today at 6:32 PM

Not just for speed, Horner can also be essential for numerical stability.

boothby • today at 5:57 PM

Thinking about speed like this used to be necessary in C and C++ but these days you should feel free to write the most legible thing (Horner's form) and let the compiler find the optimal code for it (probably similar to Horner's form but broken up to have a shallower dependency chain).

But if you're writing in an interpreted language that doesn't have a good JIT, or for a platform with a custom compiler, it might be worth hand-tweaking expressions with an eye towards performance and precision.

➕ show 3 replies

alt Hacker News

Replies