And similarly, entire generations of programmers were never taught Horner's scheme. You can see it in the article, where they write stuff like
A * x * x * x * x * x * x + B * x * x * x * x + C * x * x + D
(10 muls, 3 muladds)instead of the faster
tmp = x * x;
((A * tmp + B) * tmp + C) * tmp + D
(1 mul, 3 muladds)The common subexpression elimination (CSE) pass in compilers takes care of that.
The problem with Horner’s scheme is that it creates a long chain of data dependencies, instead of making full use of all execution units. Usually you’d want more of a binary tree than a chain.
The reason for writing out all of the x multiplications like that is that I was hoping the compiler detect such a pattern perform an optimization for me. Mat Godbolt's "Advent of Compiler Optimizations" series mentions some of these cases where the compiler can do more auto-optimizations for the developer.
Didn't know this technique had a name, but I would think a modern compiler could make this optimization on its own, no?
Isn't that for... readability...?
Is this outside of what compilers can do nowadays? (Or do they refuse because it's floating-point?)
Not just for speed, Horner can also be essential for numerical stability.
Thinking about speed like this used to be necessary in C and C++ but these days you should feel free to write the most legible thing (Horner's form) and let the compiler find the optimal code for it (probably similar to Horner's form but broken up to have a shallower dependency chain).
But if you're writing in an interpreted language that doesn't have a good JIT, or for a platform with a custom compiler, it might be worth hand-tweaking expressions with an eye towards performance and precision.
Yep, good stuff. Another nice trick to extract more ILP is to split it into even/odd exponents and then recombine at the end (not sure if this has a name). This also makes it amenable to SLP vectorization (although I doubt the compiler will do this nicely on its own). For example something like
smth like thatActually one project I was thinking of doing was creating SLP vectorized versions of libm functions. Since plenty of programs spend a lot of time in libm calling single inputs, but the implementation is usually a bunch of scalar instructions.