In TFA it is said that the C version used "-ffast-math" for the compiler to generate fused multiply-add operations.
ML/AI is one of the few applications where the use of "-ffast-math" may be acceptable, but in general one must not use "-ffast-math" to get FMA.
To enable FMA generation by the compiler, the right flag for both gcc and clang is "-ffp-contract=fast".
"-ffast-math" enables "-ffp-contract=fast", but it also enables a bunch of other code transformations that are very undesirable in any application where numerical accuracy matters, and which seldom bring any noticeable performance improvement.
Outside of ML/AI and graphics/games, "-ffast-math" should be used only by experts who fully understand the implications. Actually, even for experts, it is unlikely for "-ffast-math" to be useful, instead of selectively enabling only some of the many options that are aggregated into "-ffast-math".
The fact that most compilers still do not generate by default fused multiply-add operations in 2026, 36 years after the invention of this operation at IBM, is quite dumb.
In the overwhelming majority of cases, using FMA produces more accurate results than not using FMA. (The only cases when this is not true are encountered in certain expressions computed without FMA where some roundings happen to cancel each other.)
The reason why it has not been the default option is that the numeric results are different from those obtained on legacy computers without FMA, which was surprising for naive users. So FMA was disabled to ensure the same results as before, even if the old results were less correct.
This policy of mimicking legacy systems, just to avoid user confusion, should have become obsolete a long time ago.