Faster asin() was hiding in plain sight

152 points • by def-pri-pub • today at 2:35 PM • 87 comments • view on HN

Comments

While I'm glad to see the OP got a good minimax solution at the end, it seems like the article missed clarifying one of the key points: error waveforms over a specified interval are critical, and if you don't see the characteristic minimax-like wiggle, you're wasting easy opportunity for improvement.

Taylor series in general are a poor choice, and Pade approximants of Taylor series are equally poor. If you're going to use Pade approximants, they should be of the original function.

I prefer Chebyshev approximation: https://www.embeddedrelated.com/showarticle/152.php which is often close enough to the more complicated Remez algorithm.

xt00 • today at 4:43 PM

To be accurate, this is originally from Hastings 1955, Princeton "APPROXIMATIONS FOR DIGITAL COMPUTERS BY CECIL HASTINGS", page 159-163, there are actually multiple versions of the approximation with different constants used. So the original work was done with the goal of being performant for computers of the 1950's. Then the famous Abramowitz and Stegun guys put that in formula 4.4.45 with permission, then the nvidia CG library wrote some code that was based upon the formula, likely with some optimizations.

LegionMammal978 • today at 3:28 PM

In general, I find that minimax approximation is an underappreciated tool, especially the quite simple Remez algorithm to generate an optimal polynomial approximation [0]. With some modifications, you can adapt it to optimize for either absolute or relative error within an interval, or even come up with rational-function approximations. (Though unfortunately, many presentations of the algorithm use overly-simple forms of sample point selection that can break down on nontrivial input curves, especially if they contain small oscillations.)

[0] https://en.wikipedia.org/wiki/Remez_algorithm

➕ show 2 replies

debo_ • today at 7:42 PM

https://bash-org-archive.com/?427792

cmovq • today at 5:44 PM

> After all of the above work and that talk in mind, I decided to ask an LLM.

Impressive that an LLM managed to produce the answer from a 7 year old stack overflow answer all on its own! [1] This would have been the first search result for “fast asin” before this article was published.

[1]: https://stackoverflow.com/a/26030435

➕ show 1 reply

exmadscientist • today at 4:06 PM

This line:

> This amazing snippet of code was languishing in the docs of dead software, which in turn the original formula was scrawled away in a math textbook from the 60s.

was kind of telling for me. I have some background in this sort of work (and long ago concluded that there was pretty much nothing you can do to improve on existing code, unless either you have some new specific hardware or domain constraint, or you're just looking for something quick-n-dirty for whatever reason, or are willing to invest research-paper levels of time and effort) and to think that someone would call Abramowitz and Stegun "a math textbook from the 60s" is kind of funny. It's got a similar level of importance to its field as Knuth's Art of Computer Programming or stuff like that. It's not an obscure text. Yeah, you might forget what all is in it if you don't use it often, but you'd go "oh, of course that would be in there, wouldn't it...."

➕ show 2 replies

AlotOfReading • today at 3:24 PM

I'm pretty sure it's not faster, but it was fun to write:

    float asin(float x) {
      float x2 = 1.0f-fabs(x);
      u32 i = bitcast(x2);
      i = 0x5f3759df - (i>>1);
      float inv = bitcast(i);
      return copysign(pi/2-pi/2*(x2*inv),x);
    }

Courtesy of evil floating point bithacks.

➕ show 5 replies

scottlamb • today at 3:16 PM

Isn't the faster approach SIMD [edit: or GPU]? A 1.05x to 1.90x speedup is great. A 16x speedup is better!

They could be orthogonal improvements, but if I were prioritizing, I'd go for SIMD first.

I searched for asin on Intel's intrinsics guide. They have a AVX-512 instrinsic `_mm512_asin_ps` but it says "sequence" rather than single-instruction. Presumably the actual sequence they use is in some header file somewhere, but I don't know off-hand where to look, so I don't know how it compares to a SIMDified version of `fast_asin_cg`.

https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

➕ show 3 replies

orangepanda • today at 2:57 PM

> Nobody likes throwing away work they've done

I like throwing away work I've done. Frees up my mental capacity for other work to throw away.

sixo • today at 4:44 PM

It appears that the real lesson here was to lean quite a bit more on theory than a programmer's usual roll-your-own heuristic would suggest.

A fantastic amount of collective human thought has been dedicated to function approximations in the last century; Taylor methods are over 200 years old and unlikely to come close to state-of-the-art.

glitchc • today at 3:45 PM

The 4% improvement doesn't seem like it's worth the effort.

On a general note, instructions like division and square root are roughly equal to trig functions in cycle count on modern CPUs. So, replacing one with the other will not confer much benefit, as evidenced from the results. They're all typically implemented using LUTs, and it's hard to beat the performance of an optimized LUT, which is basically a multiplexer connected to some dedicated memory cells in hardware.

➕ show 4 replies

empiricus • today at 4:24 PM

Does anyone knows the resources for the algos used in the HW implementations of math functions? I mean the algos inside the CPUs and GPUs. How they make a tradeoff between transistor number, power consumption, cycles, which algos allow this.

erichocean • today at 2:51 PM

Ideal HN content, thanks!

varispeed • today at 7:06 PM

If you are interested in such "tricks", you should check out the classic Hacker's Delight by Henry Warren

ok123456 • today at 4:22 PM

Chebyshev approximation for asin is sum(2T_n(x) / (pi*n*n),n), the even terms are 0.

stephc_int13 • today at 3:43 PM

My favorite tool to experiment with math approximation is lolremez. And you can easily ask your llm to do it for you.

drsopp • today at 3:10 PM

Did some quick calculations, and at this precision, it seems a table lookup might be able to fit in the L1 cache depending on the CPU model.

➕ show 3 replies

adampunk • today at 3:07 PM

We love to leave faster functions languishing in library code. The basis for Q3A’s fast inverse square root had been sitting in fdlibm since 1986, on the net since 1993: https://www.netlib.org/fdlibm/e_sqrt.c

➕ show 1 reply

patchnull • today at 5:23 PM

This keeps happening across numerical computing. Abramowitz and Stegun is basically a cheat code that entire generations of developers forgot existed. I hit the same thing a few years back with a fast atan2 approximation buried in a 1971 naval research paper that turned out to be nearly identical to what a colleague spent two weeks deriving from scratch. The real problem is that modern CS education treats numerical methods as a solved problem delegated to libraries, so nobody learns to read the original tables and error bounds. Meanwhile game and graphics devs independently rediscover these approximations under deadline pressure and the knowledge stays siloed in engine-specific codebases instead of flowing back to the broader community.

➕ show 2 replies

patchnull • today at 3:07 PM

The huge gap between Intel (1.5x) and M4 (1.02x) speedups is the most interesting result here. Apple almost certainly uses similar polynomial approximations inside their libm already, tuned for the M-series pipeline. glibc on x86 tends to be more conservative with precision, leaving more room on the table. The Cg version is from Abramowitz and Stegun formula 4.4.45, which has been a staple in shader math for decades. Funny how knowledge gets siloed, game devs and GPU folks have known about this class of approximation forever but it rarely crosses into general systems programming.

➕ show 2 replies

alt Hacker News

Faster asin() was hiding in plain sight

Comments