Don't “let it crash”, let it heal

156 points • by ahamez • 08/06/2025 • 87 comments • view on HN

Comments

It is very strange that a post trying to explain the concept of "let it crash" in Elixir (which runs on the BEAM VM) does not mention the doctoral thesis of Joe Armstrong: "Making reliable distributed systems in the presence of software errors".

It must be compulsory lecture for anybody interested in reliable systems, even if they do not use the BEAM VM.

https://www.diva-portal.org/smash/record.jsf?pid=diva2%3A104...

➕ show 2 replies

valenterry • last Sunday at 5:17 AM

There are a few stages, and each improves on the previous ones:

1. Detect crashes at runtime and by default stop/crash to prevent continuing with invalid program state

2. Detect crashes at runtime and handle them according to the business context (e.g. crash or retry or fallback-to or ...) to prevent bad UX through crashes.

3. Detect potential crashes at compile-time to prevent the dev from forgetting to handle them according to the business context

4. Don't just detect the possibility of crashes but also the specific type and context to prevent the dev from making a logical mistake and causing a potential runtime error during error handling according to the business context

An example for stage 4 would be that the compiler checks that a fall-back option will actually always resolve the errors and not potentially introduce a new error / error type. Such as falling back to another URL does not actually always resolve the problem, there still needs to be handling for when the request to the alternative URL fails.

The philosophy described in the article is basically just stage 1 and a (partial) default restart instead of a default crash, which is maybe a slight improvement but not really sufficient, at least not by my personal standards.

➕ show 1 reply

goosejuice • last Sunday at 3:59 AM

https://erlang.org/pipermail/erlang-questions/2003-March/007...

The origin, as far as I know it. I think it still holds, is insightful, as a general case. Let it heal seems pretty close to what Joe was getting at.

➕ show 1 reply

HexDecOctBin • last Sunday at 4:23 AM

How does restarting the process fix the crash? If the process crashed because a file was missing, it will still be missing when the process is restarted. Is an infinite crash-loop considered success in Erlang?

➕ show 7 replies

stcg • last Sunday at 12:01 PM

"Let it crash" is a sentence that gets attention. It makes a person want to know more about it, as it sounds controversial and different. "Let it heal" doesn't have that.

➕ show 1 reply

tmcb • last Sunday at 10:40 AM

It is very common to interpret taglines by their face value, and I believe the author did just that, although the point brought up is valid.

In order to “let it crash”, we must design the system in a way that crashes would not be catastrophic, stability wise. Letting it crash is not a commandment, though: it is a reminder that, in most cases, a smart healing strategy might be overkill.

➕ show 1 reply

IshKebab • last Sunday at 10:28 AM

Ah this makes sense. I always thought "let it crash" made it sound like Elixir devs just don't bother with error checking, like writing Java without any `catch`es, or writing Rust that only uses `.unwrap()`.

If they just mean "processes should be restartable" then that sounds way more reasonable. Similar idea to this but less fancy: https://flawless.dev/

It's a pretty terrible slogan if it makes your language sound worse than it actually is.

➕ show 5 replies

bgdkbtv • last Sunday at 3:01 AM

This is great, thanks for sharing! I've been thinking about improving error handling in my liveview app and this might be a nice way to start.

praptak • last Sunday at 6:20 AM

A condition that "should not happen" might still be a problem specific to a particular request. If you "just crash" it turns this request from one that only triggers a http 500 response to one that crashes the process. This increases the risk of Query of Death scenarios where the frontend that needs to serve this particular request starts retrying it with different backends and triggers restarts faster than the processes come back up.

So being too eager to "just crash" may turn a scenario where you fail to serve 1% of requests into a scenario where you serve none because all your processes keep restarting.

➕ show 6 replies

refactor_master • last Sunday at 4:01 AM

Question as a complete outsider: If I run idempotent Python applications in Kubernetes containers and they crash, Kubernetes will eventually restart them. Of course, knowing what to do on IO errors is nicer than destroying and restarting everything with a really bigger hammer (as the article also mentions, you can serve a better error message for whoever has to “deal” with the problem), but eventually they should end up in the same workable state.

Is this conceptually similar, but perhaps at code-level instead?

➕ show 3 replies

monkeyelite • last Sunday at 5:26 AM

This seems specific to BEAM as crashing a fast-cgi process is fine and response will be handled correctly with Apache or nginx.

anthk • last Sunday at 8:29 AM

Unix/BSD -> Crash, fix, restart.

GNU/MIT/Lisp -> Detect, offer a fix, continue.

bitwize • last Sunday at 3:04 AM

I don't code in Erlang or Elixir, aside from messing about. But I've found that letting an entire application crash is something that I can do under certain circumstances, especially when "you have a very big problem and will not go to space today". For example, if there's an error reading some piece of data that's in the application bundle and is needed to legitimately start up in the first place (assets for my game for instance). Then upon error it just "screams and dies" (spits out a stack trace and terminates).

➕ show 1 reply

adastra22 • last Sunday at 4:21 AM

“Reset on error” might be a better phrasing.

Jtsummers • last Sunday at 5:02 PM

I think a lot of folks who have never looked at Erlang or Elixir and BEAM before misunderstand this concept because they don't understand how fine-grained processes are, or can be, in Erlang. A very important note: Processes in BEAM languages are cheap, both to create and for context switching, compared to OS threads. While design-wise they offer similar capabilities, this cost difference results in a substantially different approach to design in Erlang than in systems where the cost of introducing and switching between threads is more expensive.

In a more conventional language where concurrency is relatively expensive, and assuming you're not an idiot who writes 1-10k SLOC functions, you end up with functions that have a "single responsibility" (maybe not actually a single responsibility, but closer to it than having 100 duties in one function) near the bottom of your call tree, but they all exist in one thread of execution. In a system, hypothetical, created in this model if your lowest level function is something like:

  retrieve_data(db_connection, query_parameters) -> data

And the database connection fails, would you attempt to restart the database connection in this function? Maybe, but that'd be bad design. You'd most likely raise an exception or change the signature so you could express an error return, in Rust and similar it would become something like:

  retrieve_data(db_connection, query_parameters) -> Result<data, error>

Somewhere higher in the call stack you have a handler which will catch the exception or process the error and determine what to do. That is, the function `retrieve_data` crashes, it fails to achieve its objective and does not attempt any corrective action (beyond maybe a few retries in case the error is transient).

In Erlang, you have a supervision tree which corresponds to this call tree concept but for processes. The process handling data retrieval, having been given some db_conn handler and the parameters, will fail for some reason. Instead of handling the error in this process, the process crashes. The failure condition is passed to the supervisor which may or may not have a handler for this situation.

You might put the simple retry policy in the supervisor (that basic assumption of transient errors, maybe a second or third attempt will succeed). It might have other retry policies, like trying the request again but with a different db_connection (that other one must be bad for some reason, perhaps the db instance it references is down). If it continues to fail, then this supervisor will either handle the error some other way (signaling to another process that the db is down, fix it or tell the supervisor what to do) or perhaps crash itself. This repeats all the way up the supervision tree, ultimately it could mean bringing down the whole system if the error propagates to a high enough level.

This is conceptually no different than how errors and exceptions are handled in sequential, non-concurrent systems. You have handlers that provide mechanisms for retrying or dealing with the errors, and if you don't the error is propagated up (hopefully you don't continue running in a known-bad state) until it is handled or the program crashes entirely.

In languages that offer more expensive concurrency (traditional OS threads), the cost of concurrency (in memory and time) means you end up with a policy that sits somewhere between Erlang's and a straight-line sequential program. Your threads will be larger than Erlang processes so they'll include more error handling within themselves, but ultimately they can still fail and you'll have a supervisor of some sort that determines what happens next (hopefully).

As more languages move to cheap concurrency (Go's goroutines, Java's virtual threads), system designs have a chance to shift closer to Erlang than that straight-line sequential approach if people are willing to take advantage of it.

BobbyTables2 • last Sunday at 4:48 AM

Hackers also love auto-restarting services.

Exploitation of vulnerabilities isn’t always 100% reliable. Heap grooming might be limited or otherwise inadequate.

A quick automatic restart keeps them in business without any other human interaction involved.

➕ show 1 reply

juped • last Sunday at 1:43 PM

There's really not more that's useful to say than the relevant section (4.4) of Joe Armstrong's thesis says:

>How does our philosophy of handling errors fit in with coding practices? What kind of code must the programmer write when they find an error? The philosophy is let some other process fix the error, but what does this mean for their code? The answer is let it crash. By this I mean that in the event of an error, then the program should just crash. But what is an error? For programming purpose we can say that:

>• exceptions occur when the run-time system does not know what to do.

>• errors occur when the programmer doesn’t know what to do.

>If an exception is generated by the run-time system, but the programmer had foreseen this and knows what to do to correct the condition that caused the exception, then this is not an error. For example, opening a file which does not exist might cause an exception, but the programmer might decide that this is not an error. They therefore write code which traps this exception and takes the necessary corrective action.

>Errors occur when the programmer does not know what to do. Programmers are supposed to follow specifications, but often the specification does not say what to do and therefore the programmer does not know what to do.

>[...]

>The defensive code detracts from the pure case and confuses the reader—the diagnostic is often no better than the diagnostic which the compiler supplies automatically.

Note that this "program" is a process. For a process doing work, encountering something it can't handle is an error per the above definitions, and the process should just die, since there's nothing better for it to do; for a supervisor process supervising such processes-doing-work, "my child process exited" is an exception at worst, and usually not even an exception since the standard library supervisor code already handles that.

PicassoCTs • last Sunday at 9:02 AM

https://fsharpforfunandprofit.com/rop/

Railway orientated programming to the rescue?

➕ show 1 reply

atoav • last Sunday at 6:51 AM

The truth is that different errors have to lead to different results if you want a good organisational outcome. These could be:

- Fundamental/Fatal error: something without the process cannot function, e.g. we are missing an essential config option. Exiting with an error is totally adequate. You can't just heal from that as it would involve guessing information you don't have. Admins need to fix it

- Critical error: something that should not ever occur, e.g. having an active user without password and email. You don't exit, you skip it if thst is possible and ensure the first occurance is logged and admins are contacted

- Expected/Regular error: something that is expected to happen during the normal operations of the service, e.g. the other server you make requests to is being restarted and thus unreachable. Here the strategy may vary, but it could be something like retrying with random exponential backoff. Or you could briefly accept the values provided by that server are unknown and periodically retry to fill the unknown values. Or you could escalate that into a critical error after a certain amount of retries.

- Warnings: These are usually about something being not exactly ideal, but do not impede with the flow of the program at all. Usually has to do with bad data quality

If you can proceed without degrading the integrity of the system you should, the next thing is to decide jow important it is for humans to hear about it.

snickerbockers • last Sunday at 6:04 AM

>When people say “let it crash”, they are referring to the fact that practically any exited process in your application will be subsequently restarted. Because of this, you can often be much less defensive around unexpected errors. You will see far fewer try/rescue, or matching on error states in Elixir code.

I just threw up in my mouth when I read this. I've never used this language so maybe my experience doesn't apply here but I'm imagining all the different security implications that ive seen arise from failing to check error codes.

➕ show 3 replies

alt Hacker News

Don't “let it crash”, let it heal

Comments