Considering other metrics then p99 for user impact is unwise. All users will at some point experience a <1% request, it's not like half of all users will only send requests what will be under your median latency, some of their requests will hit your worst-case.
By focusing on the tail and optimizing worst cases you help users more than by improving your median latency.
This article contains very little substance. Show me the math!
I've grown to dislike the typical tail measurements completely. What I usually look at these days is what share of unique users experience an "unacceptable experience" over a measurement period instead.
I find it much more inquisitive and visceral, to the extent that p99 now boggles my mind. 2N would be dreadful as an availability figure, yet for UX it's treated very different. So much so that my measurements corroborate exactly that; good UX requires the same many-nines reliability as e.g. DCs, not one or two.
I wonder if it's p90 and p99 to blame for the shoddy services we have, in a way. It's pretty hard to argue for fixing something when it's presented as only going wrong 0.5% or less of the time after all. Even if at scale that means most of your users are experiencing it weekly.
Is the formula for E_a[X] trivial? I don't see it immediately...
Interesting you work at Amazon and show how end user experience weights to their pessimal experience.
So.. apply that to Amazon design heuristics like author name search on books, and how Amazon return "in the style of" and "not a book but this guy called Charles Dickens makes jigsaws" as high order matches and consider how the end user experience weights to the pessimal yet Amazon can show on average they make more money doing this..
(Understood that engineers and AWS don't influence UX in the storefront or search)