When a process crashes, its supervisor restarts it according to some policy. These specify whether t...

dmsnell • last Sunday at 12:29 PM • 0 replies • view on HN

When a process crashes, its supervisor restarts it according to some policy. These specify whether to restart the sibling process in their startup order or to only restart the crashed process.

But a supervisor also sets limits, like “10 restarts in a timespan of 1 second.” Once the limits are reached, the supervisor crashes. Supervisors have supervisors.

In this scenario the fault cascades upward through the system, triggering more broad restarts and state-reinitializations until the top-level supervisor crashes and takes the entire system down with it.

An example might bee losing a connection to the database. It’s not an expected fault to fail while querying it, so you let it crash. That kills the web request, but then the web server ends up crashing too because too many requests failed, then a task runner fails for similar reasons. The logger is still reporting all this because it’s a separate process tree, and the top-level app supervisor ends up restarting the entire thing. It shuts everything off, tries to restart the database connection, and if that works everything will continue, but if not, the system crashes completely.

Expected faults are not part of “let it crash.” E.g. if a user supplies a bad file path or network resource. The distinction is subjective and based around the expectations of the given app. Failure to read some asset included in the distribution is both unlikely and unrecoverable, so “let it crash” allows the code to be simpler in the happy path without giving up fault handling or burying errors deeper into the app or data.

alt Hacker News