logoalt Hacker News

rdohertytoday at 9:02 PM2 repliesview on HN

This is probably one of the best summarizations of the past 10 years of my career in SRE. Once your systems get complex enough, something is always broken and you have to prepare for that. Detection & response become just as critical as pre-deploy testing.

I do worry about all the automation being another failure point, along with the IaC stuff. That is all software too! How do you update that safely? It's turtles all the way down!


Replies

wparadtoday at 9:52 PM

Thank you!

One of the question I frequently get is "do you automatically rollback". And I have hide in the corner and say "not really". Often, if you knew a rollback would work, you probably could also have known to not roll out in the first place. I've seen a lot of failures that only got worse when automation attempted to turn the thing on and off again.

Luckily from an automation roll-out standpoint, it's not that much harder to test in isolation. The harder parts to validate are things like "Does a Route 53 Failover Record really work in practice at the moment we actually need it to work?"

Usually the answer is yes, but then there's always the "but it too could be broken", and as you said, it's turtles all the way down.

The nice part is realistically, the automation for dealing with rollout and IaC is small and simple. We've split up our infrastructure to go with individual services, so each piece of infra is also straight forward.

In practice, our infra is less DRY and more repeated, which has the benefit of avoiding complexity that often comes from attempting to reduce code duplication. The ancillary benefit is that, simple stuff changes less frequently. Less frequent changes because less opportunity for issues.

Not-surprisingly, most incidents comes from changes humans make. Where the second most amount of incidents come from assumptions humans make about how a system operates in edge conditions. If you know these two things to be 100% true, you spend more time designing simple systems and attempting to avoid making changes as much as possible, unless it is absolutely required.

evanmorantoday at 9:16 PM

Iac is definitely a failure point, but the manual alternative is much worse! I’ve had a lot of benefit from using pulumi, simply because the code can be more compact than the terraform hcl was.

For example, for the fall over regions (from the article) you could make a pulumi function that parameterizes only the n things that are different per fall over env and guarantee / verify the scripts are nearly identical. Of course, many people use modules / terragrunt for similar reasons, but it ends up being quite powerful.

show 3 replies