I think these are 2 different layers of "latency". The latency in the article is referring to the transport of the audio stream itself while the latency in your scenario is about how quickly to start responding inside the audio stream.
I think he’s saying they are doing an insane level of complexity to shave ~100ms off response times in a scenario where that isn’t important and might even be a negative
They are orthogonal.
Suppose you have 100ms audio latency and no wait time. Then, natural pause will trigger response immediately but you won't notice it has started until after ~200ms (round-trip time). Twice as annoying.