Same thing applies to other system aspects:
compressing the kernel loads it faster on RAM even if it still has to execute the un compressing operation. Why?
Load from disk to RAM is a larger bottleneck than CPU uncompressing.
Same is applied to algorithms, always find the largest bottleneck in your dependent executions and apply changes there as the rest of the pipeline waits for it. Often picking the right algorithm “solves it” but it may be something else, like waiting for IO or coordinating across actors (mutex if concurrency is done as it used to).
That’s also part of the counterintuitive take that more concurrency brings more overhead and not necessarily faster execution speeds (topic largely discussed a few years ago with async concurrency and immutable structures).