The 4% improvement doesn't seem like it's worth the effort.
On a general note, instructions like division and square root are roughly equal to trig functions in cycle count on modern CPUs. So, replacing one with the other will not confer much benefit, as evidenced from the results. They're all typically implemented using LUTs, and it's hard to beat the performance of an optimized LUT, which is basically a multiplexer connected to some dedicated memory cells in hardware.
> The 4% improvement doesn't seem like it's worth the effort.
People have gotten PhDs for smaller optimizations. I know. I've worked with them.
> instructions like division and square root are roughly equal to trig functions in cycle count on modern CPUs.
What's the x86-64 opcode for arcsin?
> The 4% improvement doesn't seem like it's worth the effort.
I've spent the past few months improving the performance of some work thing by ~8% and the fun I've been having reminds me of the nineties, when I tried to squeeze every last % of performance out of the 3D graphics engine that I wrote as a hobby.
The effort of typing about 10 words into a LLM is minimal.
You'd be surprised how it actually is worth the effort, even just a 1% improvement. If you have the time, this is a great talk to listen to: https://www.youtube.com/watch?v=kPR8h4-qZdk
For a little toy ray tracer, it is pretty measly. But for a larger corporation (with a professional project) a 4% speed improvement can mean MASSIVE cost savings.
Some of these tiny improvements can also have a cascading effect. Imagining finding a +4%, a +2% somewhere else, +3% in neighboring code, and a bunch of +1%s here and there. Eventually you'll have built up something that is 15-20% faster. Down the road you'll come across those optimizations which can yield the big results too (e.g. the +25%).