> Fundamentally, if there is nothing useful your code can do on a failure, no mitigation, no meaningful fallback, then you might as well have it blow up and rely on the underlying tree to make whatever is most useful out of it.
I would add this is possible only because process heaps are isolated. Other systems with traditional threads/co-routines/green threads can mimic supervision trees, but unless they have isolated heaps and can safely crash at any time never affecting any other processes, it would be hard to achieve the same safety properties.
> Scheduling and preempting was intended to produce consistently low latencies but it also protects you from heavy workloads bogging things down, it prevents infinitely looping bugs from slowing the system to a crawl and in general makes things resilient to things not being ideal all the time.
Well put. Reliability was one of the initial requirements of Erlang, but soft real-time properties was another one. This is hard to do, but BEAM VM does a great job there. To get this right the VM has to be able to preempt processes, even if they run a tight CPU bound loop. A misbehaving process may not just crash, it could just endlessly recurse, burning CPU cycles, but still shouldn't bring the system down.
> The strategies are normally one_for_one, one_for_all and one_for_rest
Minor correction, the last one is rest_for_one https://www.erlang.org/doc/man/supervisor.html#supervision-p...
Obviously it depends on what you’re building and you can craft your own supervision structures if you want to. But most of the time you can get the benefits of BEAM resiliency without having to drop down into the details.
I can give an example where this could be useful, I once worked on a node-app that couldn't use clustering mode because reasons and we only had 1 server running this particular piece of code. It turned out that we had a bug that made the node app crash whenever a user posted some invalid data. The problem here was that the app itself was a SPA and the users could just keep on posting even if it failed and they did in frustration.
So then app crashed and took a few seconds to reload, then crashed again. This mean that the entire api went down while the user was posting and thus could not respond to any other requests. This would never happen in Elixir and the load would just continue being ok even if 100 users at the same time would keep posting bad data.
The bad thing about Elixir resilience is that it is only applied to application logic. The rest of the time shit can go wrong is just the same as any other app since most elixir projects use the same kind of tooling (postgres, some web server in front etc). Not that many seems to use the built in mnesia database, no downtime deployment etc. The BEAM comes with many cool feature in theory but very few actually utilizes them so this 99.99999% uptime rarely comes into effect. The amount of time I've had downtime on apps because of things in the application logic like the first story I mentioned has been very, very few and most of the time it's something else entirely and that thing does Elixir not really help with most of the time.
Sure you could utilize all the cool features of the BEAM but it seems like in the absolute majority of cases the amount of work is simply too great for it to be worth the time investment required.
There was an article about the implementation of supervisors and you can’t use the “let is crash philosophy” in the whole erlang/beam stack but there is a small (loc) implementation in C which has to be proven safely in order for the “let it crash philosophy” to work.
I don’t know if this was the definition and implementation of a supervisor itself or a process itself.
Does anybody know what I mean and which article I am referring to? If so, I would be glad if you could post a link to the article