logoalt Hacker News

When O3 is 2x slower than O2

95 pointsby keylelast Tuesday at 11:29 PM87 commentsview on HN

Comments

toonewbieyesterday at 6:38 PM

Nice read. Last week I wrote a blog post about two noteworthy cases of O3 being slower than O2 in C++: https://barish.me/blog/cpp-o3-slower/

spacecadet_yesterday at 5:24 PM

In the branchy function, id is only compared if distance is equal, and since distance is a random float, this almost never happens and the corresponding branch is nearly perfectly predicted. The branchless function always compares both id and distance, effectively doing twice the work. It's only part of the reason why there's a 2x performance difference, but I thought it was interesting.

I really don't know how LLVM picks between branches or conditional moves, but my guess is that it doesn't assume that float equality is any less likely than other conditions, and some optimization pass in O3 turns unpredictable branches into conditional moves. I base this on the fact that adding std::hint::unlikely to the "equal" branch produces the same assembly for the function in both modes.

https://godbolt.org/z/erGPKaPcx

Whether it's safe to assume in general that float equality is unlikely for the purpose of program optimization, I'll leave to the compiler engineers. If you know the data your program will be handling, adding hints will avoid these surprises.

show 1 reply
fifilurayesterday at 2:57 PM

It seems to me that if there is one way that is always faster it would hardly deserve a secret setting, it should be the default?

show 1 reply
YesBoxyesterday at 3:53 PM

While working on my game, I tried implementing a custom priority queue.

It ended up being > 2x faster in debug build, but 2x-5x slower in the release build (??!!?) [1]. I haven't learned much about compilers/"lower level" C++, so I moved on at that point.

How it worked:

1.) The P.Q. created a vector and resized it to known bounds.

2.) The P.Q. kept tract of and updated the "active sorting range" each time an element was inserted or popped.

2B.) So each time an element is added, it uses the closest unused vector element and updates the ".end" range to sort

2C.) Each time an element was removed, it updated the ".start" range

3.) In theory this should have saved reallocation overhead.

[1] I believe Visual Studio uses -O0 for debug, and -O2 for release.

show 2 replies
cat_plus_plusyesterday at 11:51 AM

As a denser gas, Ozone would have greater friction getting through small pores, so that would be one example?

show 1 reply
pclmulqdqyesterday at 12:46 PM

-O3 -march=haswell

The second one is your problem. Haswell is 15 years old now. Almost nobody owns a CPU that old. -O3 makes a lot of architecture-dependent decisions, and tying yourself to an antique architecture gives you very bad results.

show 5 replies
johnisgoodyesterday at 11:02 AM

Off-topic, but I seriously dislike the syntax of Rust. It is chaotic and mind-boggling to me. See the "insert" function.

Good post though.

show 11 replies