logoalt Hacker News

j45yesterday at 4:13 PM3 repliesview on HN

Respectfully, this type of "high availability" strawman is a dated take.

This is a general response to it.

I have run hosting on bare metal for millions of users a day. Tens of thousdands of concurrent connections. It can scale way up by doing the same thing you do in a cloud, provision more resources.

For "downtime" you do the same thing with metal, as you do with digital ocean, just get a second server and have them failover.

You can run hypervisors to split and manage a metal server just like Digital Ocean. Except you're not vulnerable to shared memory and cpu exploits on shared hosting like Digital Ocean. When Intel CPU or memory flaws or kernel exploits come out like they have, one VM user can read the memory and data of all the other processes belonging to other users.

Both Digital Ocean, and IaaS/PaaS are still running similar linux technologies to do the failover. There are tools that even handle it automatically, like Proxmox. This level of production grade fail over and simplicity was point and click, 10 years ago. Except no one's kept up with it.

The cloud is convenient. Convenience can make anyone comfortable. Comfort always costs way more.

It's relatively trivial to put the same web app on a metal server, with a hypervisor/IaaS/Paas behind the same Cloudflare to access "scale".

Digital Ocean and Cloud providers run on metal servers just like Hetzner.

The software to manage it all is becoming more and more trivial.


Replies

nh2yesterday at 8:06 PM

While I generally agree, this is an exaggeration:

> This level of production grade fail over and simplicity was point and click, 10 years ago.

While some of the tools are _designed_ for point and click, they don't always work. Mostly because of bugs.

We run Ceph clusters under our product, and have seen a fair share of non-recoveries after temporary connection loss [1], kernel crashes [2], performance degradations on many small files, and so on.

Similarly, we run HA postgres (Stolon), and found bugs in its Go error checking cause failure to recover from crashes and full-disk conditions [3] [4]. This week, we found that full-disk situations will not necessarily trigger failovers. We also found that if DB connections are exhausted, the dameon that's supposed to trigger postgres failover cannot connect to do that (currently testing the fix).

I believe that most of these things will be more figured out with hosted cloud solutions.

I agree that self-hosting HA with open-source software is the way to. These softwares are good, and the more people use them, the less bugs they will have.

But I wouldn't call it "trivial".

If you have large data, it is also brutally cheaper; we could hire 10 full-time sysadmins for the cost of hosting on AWS, vs doing our own Hetzner HA with Free Software, and we only need ~0.2 sysadmins. And it still has higher uptime than AWS.

It is true that Proxomox is easy to setup and operate. For many people it will probably work well for a long time. But when things aren't working, it's not so easy anymore.

[1]: "Ceph does not recover from 5 minute network outage because OSDs exit with code 0" - https://tracker.ceph.com/issues/73136

[2]: "Kernel null pointer derefecence during kernel mount fsync on Linux 5.15" - https://tracker.ceph.com/issues/53819

[3]: https://github.com/sorintlab/stolon/issues/359#issuecomment-...

[4]: https://github.com/sorintlab/stolon/issues/247

grey-areayesterday at 4:28 PM

I'm not arguing for cloud or against bare metal hosting, just saying there is a broad range of requirements in hosting and not everyone needs or wants load balancers etc - it clearly will cost more than this particular poster wants to pay as they want to pay the bare minimum to host quite a large setup.

semcheckyesterday at 9:38 PM

[dead]