Both of your examples look like infinite crash-loops if your work needs to be correct more than it n...

gopher_space • last Sunday at 9:32 AM • 3 replies • view on HN

Both of your examples look like infinite crash-loops if your work needs to be correct more than it needs to be available. E.g. there aren't any known good states prior to an unexpected crash, you're just throwing a hail mary because the alternatives are impractical.

Replies

dmsnell • last Sunday at 12:29 PM

When a process crashes, its supervisor restarts it according to some policy. These specify whether to restart the sibling process in their startup order or to only restart the crashed process.

But a supervisor also sets limits, like “10 restarts in a timespan of 1 second.” Once the limits are reached, the supervisor crashes. Supervisors have supervisors.

In this scenario the fault cascades upward through the system, triggering more broad restarts and state-reinitializations until the top-level supervisor crashes and takes the entire system down with it.

An example might bee losing a connection to the database. It’s not an expected fault to fail while querying it, so you let it crash. That kills the web request, but then the web server ends up crashing too because too many requests failed, then a task runner fails for similar reasons. The logger is still reporting all this because it’s a separate process tree, and the top-level app supervisor ends up restarting the entire thing. It shuts everything off, tries to restart the database connection, and if that works everything will continue, but if not, the system crashes completely.

Expected faults are not part of “let it crash.” E.g. if a user supplies a bad file path or network resource. The distinction is subjective and based around the expectations of the given app. Failure to read some asset included in the distribution is both unlikely and unrecoverable, so “let it crash” allows the code to be simpler in the happy path without giving up fault handling or burying errors deeper into the app or data.

masklinn • last Sunday at 11:20 AM

> there aren't any known good states prior to an unexpected crash

If there aren't any good states then the program straight up doesn't work in the first place, which gets diagnosed pretty quickly before it hits the field.

> your work needs to be correct more than it needs to be available.

"correctness over availability" tends to not be a thing, if you assume you can reach perfect and full correctness then either you never release or reality quickly proves you wrong in the field. So maximally resilient and safe systems generally plan for errors happening and how to recover from them instead of assuming they don't. There are very few fully proven non-trivial programs, and there were even less 40 years ago.

And Erlang / BEAM was designed in a telecom context, so availability is the prime directive. Which is also why distribution is built-in: if you have a single machine and it crashes you have nothing.

Muromec • last Sunday at 11:29 AM

If it has no good states you probably know it before deploying to production.

alt Hacker News

Replies