I don't understand why, if you are creating a distributed db, that you don't at least try using eg. aphyrs jepsen library (1).
The story seems to repeat itself for distributed database: Documentation looks more like advertisement. Promises a lot but contains multiple errors, and failures that can corrupt the data. It's great that jepsen doing the work they do!
While Jepsen (and this article) is focused on behavior under node failure and network partitions, this caught my eye:
> It also exhibits Stale Read, Lost Update, and other forms of G-single in healthy clusters
This looks like quite a fundamental issue.
I would kinda expect that. MySQL hasn't been designed to be a distributed database from the beginning and it's usually hard to make it work later on.
I really like glaera for low volume clustering, because of the true multi-master nature. I've been using it for over a decade on a clustered mail server for storing account information, and more recently I've pumped the log information in there so each user can see their related log messages, for a user base of around 6,000 users, and it's been a real workhorse.
[flagged]
I realize that we like to use the page title here on HN, but this really should be something like "Data loss cases with MariaDB Glaera Cluster 12.1.2".
It's the "healthy cluster" aspect that makes this scary. Partition errors are expected—that's what Jepsen is testing. However, stale reads during normal operation mean that most Galera deployments behind a round-robin load balancer silently encounter this problem. The classic scenario: we create a user on node A, the next request goes to node B, and the user doesn't exist yet. The solution is wsrep_sync_wait or pinning reads to the writer node, but most setups don't use either of these methods because they assume a healthy cluster equals consistent reads.