Understanding Fast-Math

bee_rider · on Dec 9, 2021

Nitpicky, but saying

> It turns out that, like with almost anything else relating to IEEE floating-point math, it’s a rabbit hole full of surprising behaviors.

immediately before describing that they disabled IEEE floating-point math is a bit funny. The standard isn't surprising, floating point numbers (arguably) are. The whole point of standardizing floating point path was to reduce this surprisingness. Can't complain about IEEE floating point numbers if you tell the compiler not to use them.

sampo · on Dec 9, 2021

It takes some time to learn how floating point and numerical calculations work. Not too long, but more than one evening. If you take a numerical analysis course in a university, the first 2 or 3 lectures might be about floating point, error propagation and error analysis. Or the first chapter of a numerical analysis textbook.

But almost nobody spends this much effort to familiarize themselves with the floating point system. So it keeps surprising people.

josefx · on Dec 10, 2021

At least it should be a bit less surprising now that every CPU has SSE instructions where a 32 bit float is actually stored in a 32 bit vector register instead of a 80 bit FPU register that may be dropped to a 32 bit memory location whenever your compiler feels like it. Makes the results a lot more consistent.

gumby · on Dec 9, 2021

I am shocked by the plethora of posts by people surprised when -ffast-math breaks their code (no insult to the pspdfkit folks intended).

If it were a harmless flag you'd think it would be enabled by default. That's a clue that you should look before you leap.

We have a few files in our code base that compile with -ffast-math but the rest don't. Those files were written with that flag in mind.

Sharlin · on Dec 9, 2021

No, it’s easy to assume (even after reading the documentation) that -ffast-math just makes some calculations less precise or otherwise not strictly IEEE 754 compliant. That would be plenty reason enough not to enable it by default, but make it available for those who desire maximum performance at the expense of precision (or accuracy), as may be a perfectly reasonable tradeoff in cases like computer graphics. It’s very understandable that people are surprised by the fact that -ffast-math can actually break code in unintuitive ways.

gumby · on Dec 10, 2021

The documentation literally says that the results may be incorrect, not to mention documenting that it enables a flag with unsafe in its name.

I feel like a lot of programmers just type stuff and assume if it compiles it should ship.

usefulcat · on Dec 10, 2021

I was definitely not expecting it to completely break std::isnan and std::isinf.

asveikau · on Dec 9, 2021

People sometimes have unreasonable faith in compilers or libraries. It's a common adage to look at bugs in application code before suspecting these lower layers. That is not the same thing as the lower layers never producing unexpected results. So when people get evidence that a compiler feature may be rough around the edges, they resist it.

gumby · on Dec 9, 2021

From Cygnus it felt like the opposite: more than half the reports were from people blaming the compiler when it was their own code that was at fault.

But that perspective aside, I agree: generally people correctly assume that their tools are OK and their code is at fault.

I wouldn't consider -ffast-math a bug: it's a sharp tool that should only be used by experienced users. If there's a bug at all it's that, in retrospect, the flag should have had a different name.

assbuttbuttass · on Dec 9, 2021

> To my surprise, there were no measurable differences outside of the standard deviation we already see when repeating tests multiple times.

This should be the main takeaway. Don't enable ffast-math if floating point calculations aren't a bottleneck, and especially not if you don't understand all the other optimizations it enables.

nickelpro · on Dec 9, 2021

Counter argument: Always enable fast-math unless your application has a demonstrable need for deterministically consistent floating point math. If you're just trying to get a "good enough" calculation (which is the vast majority of floating point work, physics and rendering calculations for 3D graphics), there's no reason to leave the performance on the floor.

josefx · on Dec 9, 2021

> Always enable fast-math unless your application has a demonstrable need for deterministically consistent floating point math.

As far as I remember fast math also breaks things like NaN and infinity, which makes filtering out invalid values before they hit something important fun since they will obviously still exist and mess up your results but you can no longer check for them.

nickelpro · on Dec 10, 2021

No tight rendering loop building a transform matrix to ship off to the a graphics card has an isnan() in it to begin with. If your code cares about isnan() or infinities, that would be an example of a demonstrable need to not use fast-math

usefulcat · on Dec 10, 2021

If the compiler is going to assume that all FP values and calculations are finite, then you really need to be able to check for any non-finite input values, which you can’t do with -ffast-math because it silently breaks isnan and isinf. WTF.

Also, there are other uses of FP besides games..

nemetroid · on Dec 9, 2021

> there's no reason to leave the performance on the floor.

As the article demonstrates, there is: with ffast-math, floating point subtly behaves in ways that don't match the ways you've been taught it behaves.

nickelpro · on Dec 10, 2021

And my point is the majority of floating point applications, which is to say _not_ scientific computing, don't care about subtle rounding-error differences in results

josefx · on Dec 10, 2021

You cannot even stop a NaN from propagating, I would think that affects a lot more than just scientific computing. Of course games probably don't check, I ran into having NaN money in at least one game I played.

a_e_k · on Dec 9, 2021

This is the approach that I take! (Note: I write graphics and rendering code for a living. "Good enough" for me tends to mean quantizing to either identical pixels at the display bit depth or at least perceptually identical pixels. Also, I usually do see a measurable performance benefit to -Ofast over -O3 or -O2. YMMV.)

Just like cranking the warning levels as high as I can at the beginning of a project, I also like to build and test with -ffast-math (really -Ofast) from the very beginning. Keeping it warning-free and working under -ffast-math as I go is a lot easier than trying to do it all at once later!

And much like the warnings, I find that any new code that fails under -ffast-math tends to be a bit suspect. I've found stuff that tends to break under -ffast-math will also frequently break with a different compiler or on a different hardware architecture. So -ffast-math is a nice canary for that.

chpatrick · on Dec 9, 2021

Unless you upgrade your compiler and now your climate model produces different results... I think "good enough" is actually pretty rare unless it's for games or something.

nickelpro · on Dec 10, 2021

"Games or something" is the majority of floating point work. There are far more video games and 3D renderings and people working on these technologies than climate models or scientific computing.

nemetroid · on Dec 10, 2021

[citation needed], at least for the "people working on them" part.

samhwr · on Dec 11, 2021

“There are far more people doing [thing that I personally do] than [thing that I do not personally do, and thus know fewer people who do it].” — everyone ever

kristofferc · on Dec 9, 2021

https://simonbyrne.github.io/notes/fastmath/ also has a nice discussion about the possible pitfalls of "fast"-math.

optimalsolver · on Dec 9, 2021

"-fno-math-errno" and "-fno-signed-zeros" can be turned on without any problems.

I got a four times speedup on <cmath> functions with no loss in accuracy.

See also "Why Standard C++ Math Functions Are Slow":

https://medium.com/@ryan.burn/why-standard-c-math-functions-...

AstralStorm · on Dec 9, 2021

There are a few critical algorithms where fp error cancellation or simple ifs get optimized out if you disable signed zeros. Typically you would know which these are, they tend to appear in statistical machine learning which use sign or expect monotonicity near zero, or filters with coefficients that are near zero (and filter out NaNs explicitly).

CamperBob2 · on Dec 9, 2021

What would be an example of a filter with coefficients near zero that would be adversely affected by the loss of signed-zero support?

You're already in mortal peril if you're working with "coefficients near zero" because of denormals, another bad idea that should have been disabled by default and turned on only in the vanishingly-few applications that benefit from them.

iamcreasy · on Dec 9, 2021

Sorry, if it's a basic question - but does recompiling C/C++ code(with/without flags) produce more efficient code most of the time? For example - let say I am using a binary that was compiled on a processor that didn't have support for SIMD. Assuming the program is capable of taking advantage of SIMD instructions, and also assuming my processor support SIMD - would it make sense to recompile the C/C++ code on my system again hoping the newer binary would run faster?

optimalsolver · on Dec 9, 2021

The more CPU-bound your program, the more benefit you'll see from the optimization flags. If your program is constantly waiting around for input on a network channel, then it may not help as much.

My currently used optimization flags are: -O3 -fno-math-errno -fno-signed-zeros -march=native -flto

Only use -march=native if the program is only intended to run on your own machine. It carries out architecture-specific optimizations that make the program non-portable.

Also look into profile-guided optimization, where you compile and run your program, automatically generate a statistical report, then recompile using that generated information. It can result in some dramatic speedups.

https://ddmler.github.io/compiler/2018/06/29/profile-guided-...

iamcreasy · on Dec 10, 2021

Thank you. I want to make sure I understood it clearly...

I mostly use Java, and my impression is JIT inside JVM introduces hardware specific optimization without any user intervention. But for C/C++ if dependency is included as source - I can use compiler flag to enable platform specific optimizations. But if the dependency was included in the form of a pre-compiled binary, such as a .dll or .so, I am probably not using the the most optimally compiled version of the dependency. Am I right so far?

mhh__ · on Dec 9, 2021

3 thoughts:

1. Using SIMD can be a big win, so yes.

2. SIMD (vectorization) is not the only optimization your compiler can do, the compiler has a model of the processor so it can pick the right instructions and lay them out properly with as many tricks as they can describe generically.

3. Compilers have PGO. Use it (if you can). Compilers without PGO are a bit like an engine management unit with no sensors - all the gear, no idea. The compiler has to assume a hazy middle-of-the-road estimate of what branches will be exercised, whereas with PGO enabled your compiler can make the cold code smaller, and be more aggressive with hot code etc. etc.

bee_rider · on Dec 9, 2021

> all the gear, no idea

I like this because it only makes sense in some accents. For example it wouldn't work in Boston where the r would only be pronounced on one of the words (idea).

KMag · on Dec 9, 2021

Yes. Careful selection of compilation flags can greatly improve performance.

My employer spends many millions of dollars annually running numerical simulations using double precision floating point numbers. Some years ago when we retired the last machines that didn't support SSE2, adding a flag to allow the compiler to generate SSE2 instructions had a big time and cost savings for our simulations.

jcranmer · on Dec 9, 2021

> Some years ago when we retired the last machines that didn't support SSE2, adding a flag to allow the compiler to generate SSE2 instructions had a big time and cost savings for our simulations.

That's kind of a special case, though. Without SSE2, you're using x87 for floating-point numbers, and even using scalar floating point on x87 is going to be a fair bit slower than using scalar floating point SSE instructions. Of course, enabling SSE also allows you to vectorize floating point at all, but you'll still be seeing improvements just from scalar SSE instead of x87.

bee_rider · on Dec 9, 2021

gcc has the ability to target different architectures (look up the -march and -mtune flags for example). Linux distributions are typically set to be compatible with a pretty wide range of devices, so they often don't take advantage of recent instructions.

Compiling a big program can be a bit of a pain, though, so it is probably only worthwhile if you have a program that you use very frequently. Also compilers aren't magic, the bottleneck in the program you want to run could be various things: CPU stuff, memory bandwidth, weird memory access patterns, disk access, network access, etc. The compiler mostly just helps with the first one.

Also, note that some libraries, like Intel's MKL, are able to check what processor you are using and just dispatch the appropriate code (your mileage may vary, they sometimes don't keep up with changes in AMD processors, causing great annoyance).

josefx · on Dec 10, 2021

> "-fno-math-errno" ... can be turned on without any problems.

There is even a good reason for that, math-errno is a posix requirement and completely optional in both C and C++ standards. If your code is intended to be portable it should avoid relying on this anti-feature anyway.

bla3 · on Dec 9, 2021

Recent post on the same topic that gets the same information across with fewer words: https://kristerw.github.io/2021/10/19/fast-math/

toolslive · on Dec 9, 2021

Once spent 2 days trying to find out why a python prototype yielded different results on a different platform (in casu Solaris). Eventually discovered that an unnamed culprit implemented the cunning plan to compile the interpreter with `-ffast-math`. Fun times

Symmetry · on Dec 9, 2021

I sort of assume the main benefit of assuming no NaNs is being able to replace `a == a` with `true`?

The handling of isnan is certainly a big question. I can see wanting to respect that, but I can also see littering your code with assertions that isnan is false and compiling with normal optimizations and then hoping that later recompiling with fast-math and all its attendant "fun, safe optimizations" will let you avoid any performance penalty for all those asserts.

jcranmer · on Dec 9, 2021

There's also several cases where handling NaNs "correctly" may require a fair amount of extra work (C99's behavior for complex multiplication is far more complicated when NaNs are involved).

dahart · on Dec 9, 2021

I’ve long wondered if changing the name to something like -ffast_inaccurate_math might help stem the tide of surprises and mistakes. Having only “fast” in the name makes it sounds like a good thing, rather than a tradeoff to consider.