You Can't Fool the Optimizer

141 points • by HeliumHydride • today at 12:14 PM • 73 comments • view on HN

Comments

For people who enjoy these blogs, you would definitely like the Julia REPL as well. I used to play with this a lot to discover compiler things.

For example:

    $ julia
    julia> function f(n)
             total = 0
             for x in 1:n
               total += x
             end
             return total
           end
    julia> @code_native f(10)
        ...
        sub    x9, x0, #2
        mul    x10, x8, x9
        umulh    x8, x8, x9
        extr    x8, x8, x10, #1
        add    x8, x8, x0, lsl #1
        sub    x0, x8, #1
        ret
        ...

it shows this with nice colors right in the REPL.

In the example above, you see that LLVM figured out the arithmetic series and replaced the loop with a simple multiplication.

➕ show 2 replies

Findecanor • today at 3:58 PM

I'm wondering how the compiler optimised add_v3() and add_v4() though.

Was it through "idiom detection", i.e. by recognising those specific patterns, or did the compiler deduce the answers them through some more involved analysis?

anon-3988 • today at 3:16 PM

What I am curious about is, is the compiler smart enough to be lazy with computation and or variables? For example consider:

let a = expr let b = expr2

if (a || b) { return true; }

is the compiler allowed to lazily compute this if it is indeed faster to do that way? Or declaring a bunch of variables that may or may not be used in all of the branches. Is the compiler smart enough to only compute them whenever it is necessary? AFAIK this is now allowed in C-like languages. Things have to materialize. Another one is, I like to do memcpy every single time eventhough it might not even be used or overwritten by other memcpys. Is the compiler smart enough to not perform those and reorder my program so that only the last relevant memcpy is performed?

A lot of times, my code becomes ugly because I don't trust that it does any of this. I would like t write code in consistent and simple ways but I need compilers to be much smarter than it is today.

A bad example recently is something like

const S * s =;

let a = constant; let b = constant; let c = constant; let d = constant; let e = constant; let f = constant; let g = constant; let h = constant; let i = constant; let j = constant; let k = constant; let l = constant;

if (s->a == a && s->b == b /* etc */ ) { return true; }

It did not turn all of this into a SIMD mask or something like that.

➕ show 1 reply

jagged-chisel • today at 12:35 PM

I always code with the mindset “the compiler is smarter than me.” No need to twist my logic around attempting to squeeze performance out of the processor - write something understandable to humans, let the computer do what computers do.

➕ show 11 replies

senfiaj • today at 3:45 PM

I wonder if compilers do multiple passes on the intermediate code in order to optimize / simplify it. For example, during each pass the optimizer searches some known harcoded patterns and replaces them with something else and repeats until no possible improvement is found.

Also optimizers have a limit, they can't reason as abstractly as humans, for example:

  bool is_divisible_by_6(int x) {
      return x % 2 == 0 && x % 3 == 0;
  }

  bool is_divisible_by_6_optimal(int x) {
      return x % 6 == 0;
  }

I tried with both gcc and clang, the asm code for is_divisible_by_6 is still less optimal. So no, there are many easy ways to fool the optimizer by obfuscation.

The morale is that you still have to optimize algorithms (O notation) and math operations / expressions.

➕ show 3 replies

derefr • today at 3:51 PM

Even better / potentially more surprising:

    unsigned mult(unsigned x, unsigned y) {
        unsigned y0 = y;
        while (x--) y = add_v1(y, y0);
        return y;
    }

optimizes to:

    mult(unsigned int, unsigned int):
      madd w0, w1, w0, w1
      ret

(and this produces the same result when substituting any of the `add_vN`s from TFA)

alecco • today at 3:53 PM

Optimizing some critical code section and beating the compiler gives you a rush. But last time for me was around 8 years ago. They got really good.

Only higher level stuff like data reordering and latency hiding are left. At least for me. And even some of that can be automated with profile-guided optimization.

Scene_Cast2 • today at 1:11 PM

This post assumes C/C++ style business logic code.

Anything HPC will benefit from thinking about how things map onto hardware (or, in case of SQL, onto data structures).

I think way too few people use profilers. If your code is slow, profiling is the first tool you should reach for. Unfortunately, the state of profiling tools outside of NSight and Visual Studio (non-Code) is pretty disappointing.

➕ show 1 reply

matja • today at 1:29 PM

You can fool the optimizer, but you have to work harder to do so:

    unsigned add(unsigned x, unsigned y) {
        unsigned a, b;
        do {
            a = x & y;
            b = x ^ y;
            x = a << 1;
            y = b;
        } while (a);
        return b;
    }

becomes (with armv8-a clang 21.1.0 -O3) :

    add(unsigned int, unsigned int):
    .LBB0_1:
            ands    w8, w0, w1
            eor     w1, w0, w1
            lsl     w0, w8, #1
            b.ne    .LBB0_1
            mov     w0, w1
            ret

➕ show 1 reply

gpderetta • today at 3:44 PM

Interesting, even this can't fool the optimizer (tried with a recent gcc and clang):

  unsigned add(unsigned x, unsigned y) {
   std::vector vx {x};
   std::vector vy {y};
   auto res = vx[0]+vy[0];
   return res;
  }

jmcomets • today at 2:49 PM

Obvious caveat: pushing this a bit further it can quickly fallback to the default case. The optimizer is a superpower but you still need to try to write efficient code.

    unsigned add_v5(unsigned x, unsigned y) {
      if (x == y) return 2 * x;
      return x + y;
    }

Results in:

    add_v5(unsigned int, unsigned int):
      lsl w8, w0, #1
      add w9, w1, w0
      cmp w0, w1
      csel w0, w8, w9, eq
      ret

(armv8-a clang 21.1.0 with O3)

If compiler folks can chime in, I'm curious why incrementing in a loop can be unrolled and inspected to optimize to an addition, but doubling the number when both operands are equal can't?

➕ show 4 replies

toonewbie • today at 12:35 PM

Sometimes you can fool the compiler :-)

See "Example 2: Tricking the compiler" in my blog post about O3 sometimes being slower than O2: https://barish.me/blog/cpp-o3-slower/

sureglymop • today at 12:47 PM

With this one I instead wondered: If there are 4 functions doing exactly the same thing, couldn't the compiler also only generate the code for one of them?

E.g. if in `main` you called two different add functions, couldn't it optimize one of them away completely?

It probably shouldn't do that if you create a dynamic library that needs a symbol table but for an ELF binary it could, no? Why doesn't it do that?

➕ show 6 replies

Scubabear68 • today at 3:42 PM

I liked the idea behind this post, but really the author fairly widely missed the mark in my opinion.

The extent to which you can "fool the optimizer" is highly dependent on the language and the code you're talking about. Python is a great example of a language that is devilishly hard to optimize for precisely because of the language semantics. C and C++ are entirely different examples with entirely different optimization issues, usually which have to do with pointers and references and what the compiler is allowed to infer.

The point? Don't just assume your compiler will magically make all your performance issues go away and produce optimal code. Maybe it will, maybe it won't.

As always, the main performance lessons should always be "1) Don't prematurely optimize", and "2) If you see perf issues, run profilers to try to definitively nail where the perf issue is".

➕ show 1 reply

317070 • today at 12:43 PM

"The compiler" and "The optimizer" are doing a lot of the heavy lifting here in the argument. I definitely know compilers and optimizers which are not that great. Then again, they are not turning C++ code into ARM instructions.

You absolutely can fool a lot of compilers out there! And I am not only looking at you, NVCC.

➕ show 1 reply

torginus • today at 1:46 PM

Awesome blog post - thanks to this I found out that you can view what the LLVM optimizer pipeline does, and which pass is actually responsible for doing which instruction.

It's super cool to see this in practice, and for me it helps putting more trust in the compiler that it does the right thing, rather than me trying to micro-optimize my code and peppering inline qualifiers everywhere.

Joker_vD • today at 1:45 PM

Wait, why does GAS use Intel syntax for ARM instead of AT&T? Or something that looks very much like it: the destination is the first operand, not the last, and there is no "%" prefix for the register names?

amelius • today at 12:47 PM

One undesirable property of optimizers is that in theory one day they produce good code and the next day they don't.

asah • today at 1:11 PM

I want an AI optimization helper that recognizes patterns that could-almost be optimized if I gave it a little help, e.g. hints about usage, type, etc.

dlenski • today at 1:54 PM

Today I learned that Matt Godbolt is British!

mkornaukhov • today at 1:07 PM

Better tell me how to make the compiler not fool me!

raverbashing • today at 1:37 PM

I'm curious what is the theoreme-proving magic behind add_v4 and if this is prior LLVM ir

daft_pink • today at 12:47 PM

Is this an argument for compiled code?

➕ show 1 reply

alt Hacker News

You Can't Fool the Optimizer

Comments