I've grown to dislike the typical tail measurements completely. What I usually look at these days is what share of unique users experience an "unacceptable experience" over a measurement period instead.
I find it much more inquisitive and visceral, to the extent that p99 now boggles my mind. 2N would be dreadful as an availability figure, yet for UX it's treated very different. So much so that my measurements corroborate exactly that; good UX requires the same many-nines reliability as e.g. DCs, not one or two.
I wonder if it's p90 and p99 to blame for the shoddy services we have, in a way. It's pretty hard to argue for fixing something when it's presented as only going wrong 0.5% or less of the time after all. Even if at scale that means most of your users are experiencing it weekly.
How does one measure unique users here in a way different from classic p99? I usually associate p99 with an SLO of some kind, and each request as a "unique user" for the service, so at first it seems like the same thing - measuring p99 with a SLO would say 1% of users are allowed to experience a time longer than our acceptable minimum T, and you're measuring the percentage of requests ("users") experiencing T and trying to keep it below 1% (e.g.).
Is the difference more about measuring a request "across services"? That is, the total cumulative p99 across services must be small i.e. linking all requests to a user and then measuring that? Or is the difference elsewhere?
If the former: are you taking traces and graphing that? What's your methodology?