logoalt Hacker News

fragmedeyesterday at 5:41 PM2 repliesview on HN

What if you're the SRE and the code fixes mean the site goes from 99% uptime to 99.9% up? How do you measure the revenue from that?


Replies

kevin_nisbetyesterday at 6:34 PM

On this side of the equation I think you start pulling in customer context and risk analysis on the downside. What is the churn risk for operation at 99% vs 99.9% availability.

If your site is for B2B and impacts customers own operations or revenue, you'll likely be wanting to chase the 99.9%, customers won't tolerate the 1.5 hours per week of downtime and will churn.

However, if the value you're site creates is tolerant to those sorts of disruptions, someone is just inconvenienced and can come back later, a large investment to move from 99% to 99.9% wouldn't be justified. There is literally no impact from the investment. The harder part will be the reality, most investments will be somewhere in the middle with ambiguity on the impact. IIRC, SRE principles do talk about this when setting SLOs in different terms.

I've heard some companies refer to the concept as economical thinking, which is I think a great way to think about it. Doesn't mean you'll always get it right, more so that we embed being conscious about the ROI in our work.

I also believe this is an area that I've observed several engineers really struggle with, especially when moving from big tech to startups, where it's really easy to import culture from another company, and in earlier stages of startup life... if you don't have product-market-fit, it doesn't matter how good you're availability is. Attention is a resource, make sure it's allocated to what creates value for the customer.

linkregisteryesterday at 6:14 PM

Depending if the site has a direct competitor and non-sticky customers, you can often get accurate loss estimates from outages. For example, friends of mine at Doordash would know when UberEats was down by the corresponding spike in traffic to their app. The competitor captures all the lost traffic.

Most enterprises will have a harder time quantifying losses, as some percentage of customers will come back later. To understand that, you need to look for a drop in completed purchase rates compared to site visits.

For a SaaS, it's even more difficult, as customers are often held captive by long contracts and might tolerate SLA breaches up to a certain point. A reasonable, though fictional, proxy would be the revenue for the contract pro-rated against the uptime during that period.

show 1 reply