Ah this makes sense. I always thought "let it crash" made it sound like Elixir devs just don't bother with error checking, like writing Java without any `catch`es, or writing Rust that only uses `.unwrap()`.
If they just mean "processes should be restartable" then that sounds way more reasonable. Similar idea to this but less fancy: https://flawless.dev/
It's a pretty terrible slogan if it makes your language sound worse than it actually is.
I think the slogan was meant to be provocative but unfortunately it has been misinterpreted more often than not.
For example, imagine you're working with a 3rd party API and, according to the documentation, it is supposed to return responses in a certain format. What if suddenly that API stops working? Or what if the format changes?
You could write code to handle that "what if" scenario, but then trying to handle every hypothetical your code becomes bloated, more complicated, and hard to understand.
So in these cases, you accept that the system will crash. But to ensure reliability, you don't want to bring down the whole system. So there are primitives that let you control the blast radius of the crash if something unexpected happens.
Let it crash does not mean you skip validating user input. Those are issues that you expect to happen. You handle those just as you would in any programming language.
Flawless is interesting.
It can't work in the general case because replaying a sequence of syscalls is not sufficient to put the machine back in the same state as it was last time. E.g. second time around open behaves differently so you need to follow the error handling.
However sometimes that approach would work. I wonder how wide the area of effective application is. It might be wide enough to be very useful. The all or nothing database transaction model fits it well.
I've been seeing a lot of these durable workflow engines around lately, for some reason. I'm not sure I understand the pitch. It just seems like a thin wrapper around some very normal patterns for running background jobs. Persist your jobs in a db, checkpoint as necessary, periodically retry. I guess they're meant to be a low-code alternative to writing the db tables yourself, but it seems like you're not saving much code in practice.
As someone has linked it: https://erlang.org/pipermail/erlang-questions/2003-March/007...
It is about self-healing, too.
I think it’s more subtle:
Imagine that you’re trying to access an API, which for some reason fails.
“Let it crash” isn’t an argument against handling the timeout, but rather that you should only retry a few, bounded times rather than (eg) exponentially back off indefinitely.
When you design from that perspective, you just fail your request processing (returning the request to the queue) and make that your manager’s problem. Your managing process can then restart you, reassign the work to healthy workers, etc. If your manager can’t get things working and the queue overflows, it throws it into dead letters and crashes. That might restart the server, it might page oncall, etc.
The core idea is that within your business logic is the wrong place to handle system health — and that many problems can be solved by routing around problems (ie, give task to a healthy worker) or restarting a process. A process should crash when it isn’t scoped to handle the problem it’s facing (eg, server OOM, critical dependency offline, bad permissions). Crashing escalates the problem until somebody can resolve it.