agreed. Strong emphasis on "profiling and identifying the actual bottleneck". Every benchmark will show a nested stack of performance offenders, but a solid interpretation requires a much deeper understanding of systems in general. My biggest aha moment yrs ago was when I realized that removing the function I was trying to optimize will still result in a benchmark output that shows top offenders and without going into too many details that minor perspective shift ended up paying dividends as it helped me rebuild my perspective on what benchmarks tell us.
Yeah ... and so it happens that this particular function in the profile is just a symptom, merely being an observation (single) data point of system behavior under given workload, and not the root cause for, let's say, load instruction burning 90% of the CPU cycles by waiting on some data from the memory, and consequently giving you a wrong clue about the actual code creating that memory bus contention.
I have to say that up until I grasped a pretty good understanding of CPU internals, memory subsystem, kernel, and generally the hardware, reading into the perf profiles was just a fun exercise giving me almost no meaningful results.