It is not about outperforming the compiler - it’s about being comfortable with measuring where your clock cycles are spent, and for that you first need to be comfortable with clock cycle scale of timing. You’re not expected to rewrite the program in assembly. But you should have a general idea given an instruction what its execution entails, and where the data is actually coming from. A read from different busses means different timings.
Compilers make mistakes too and they can output very erroneous code. But that’s a different topic.
Excellent corrective summary.
"Compilers can do all these great transformations, but they can also be incredibly dumb"
-Mike Acton, CPPCON 2014