These sorts of approximations (and more sophisticated methods) are fairly widely used in systems programming, as seen by the fact that Apple's asin is only a couple percent slower and sub-ulp accurate (https://members.loria.fr/PZimmermann/papers/accuracy.pdf). I would expect to get similar performance on non-Apple x86 using Intel's math library, which does not seem to have been measured, and significantly better performance while preserving accuracy using a vectorized library call.
The approximation reported here is slightly faster but only accurate to about 2.7e11 ulp. That's totally appropriate for the graphics use in question, but no one would ever use it for a system library; less than half the bits are good.
Also worth noting that it's possible to go faster without further loss of accuracy--the approximation uses a correctly rounded square root, which is much more accurate than the rest of the approximation deserves. An approximate square root will deliver the same overall accuracy and much better vectorized performance.
Great point about the approximate sqrt being low-hanging fruit. The correctly rounded sqrt is doing way more work than the rest of the pipeline deserves at that error budget. I wonder if the author benchmarked with rsqrtss plus a Newton-Raphson refinement step — on x86 that gives you roughly 23 bits of precision for a fraction of the latency of sqrtss, which is still massive overkill for a 2.7e11 ulp result but would show an even bigger speedup.
Yeah, the only big problem with approx. sqrt is that it's not consistent across systems, for example Intel and AMD implement RSQRT differently... Fine for graphics, but if you need consistency, that messes things up.