And if you know in advance that a function will be in the critical path, and it needs to perform some operation on N items, and N will be large, it’s not premature to consider the speed of that loop.
Another thought: many (most?) of these "rules" were before widespread distributed computing. I don't think Knuth had in mind a loop that is reading from a database at 100ms each time.
I've seen people write some really head shaking code that makes remote calls in a loop that don't actually depend on each other. I wonder to what extend they are thinking "don't bother with optimization / speed for now"
Just remember Rob Pike's 1st rule: don't assume where bottlenecks will occur, but verify it.
From my understanding you still need to care care about the algorithms and architecture. If N is sufficiently large, you should pick O(N) algorithm over O(N^2). But usually there is a tradeoff, simple code (or hiding something behind some abstraction) might be easier to understand and maintain, but it might work slower with large input data and vice versa. I would rather write code that will be easier to optimize if there is some bottleneck than to optimize it overagressivelly. Also, different code needs different kind of optimization. Sometimes the code might be more IO heavy (disk / DB or network), for this type of code, the IO operation planning and caching is more critical than the optimization of the raw CPU time. Sometimes the input is too small to have any significant performance effects, and, what's paradoxical, choosing smarter algorithms might even hurt performance (alongside the maintanability). For example, for 10 - 100 items a simple linear scan in an array might be faster than using a O(log n) binary search tree. It's also well known that faster executable code (regardless of being hand written, machine generated, high level or machine code) usually has larger size (mostly because it's more "unrolled", duplicated and more complex when advanced algorithms are used). If you optimize the speed everywhere, the binary size tends to increase, causing more cache misses, which might hurt the performance more than improve. This is why some profiling is often needed for large software than simply passing O3.
Charles Eames said that "design depends largely on constraints".
If you know in advance, then one of the constraints on the design of that function is that it's on the critical path and that it has to meet the specification.
You're stating the obvious that you need to consider its implementation.
But that's not the same as "I have a function that is not on the critical path and the performance constraint is not the most important, but I'll still spend time on optimizing it instead of making it clear and easy to understand."