logoalt Hacker News

zmgsabstlast Sunday at 11:38 AM0 repliesview on HN

I think it’s more subtle:

Imagine that you’re trying to access an API, which for some reason fails.

“Let it crash” isn’t an argument against handling the timeout, but rather that you should only retry a few, bounded times rather than (eg) exponentially back off indefinitely.

When you design from that perspective, you just fail your request processing (returning the request to the queue) and make that your manager’s problem. Your managing process can then restart you, reassign the work to healthy workers, etc. If your manager can’t get things working and the queue overflows, it throws it into dead letters and crashes. That might restart the server, it might page oncall, etc.

The core idea is that within your business logic is the wrong place to handle system health — and that many problems can be solved by routing around problems (ie, give task to a healthy worker) or restarting a process. A process should crash when it isn’t scoped to handle the problem it’s facing (eg, server OOM, critical dependency offline, bad permissions). Crashing escalates the problem until somebody can resolve it.