A simple way to remember, I think by Devops Borat: Redundancy/Replication fix hardware problems...

tetha • 12/10/2024 • 0 replies • view on HN

A simple way to remember, I think by Devops Borat: Redundancy/Replication fix hardware problems. Backups fix stupid human problems.

> And, does restoring from backups always mean losing more _recent_ data than replication?

This depends on the archiving technology and what you're archiving.

Our file and object stores take one full backup every day. This means, we could lose up to 24 hours of data changes on these stores if something happens within these 24 hours. If this is acceptable or not depends on the RPO - the recovery point objective, or the "maximum acceptable data loss". However, especially for documents, 24 hours can be acceptable, because users and customers do tend to have files they uploaded to the system around for a few days. Especially if you have a chance to identify the lost documents.

Both on MySQL with the InnoDB driver, as well as on postgres, you can use PITR backup solutions - point in time recovery. With this, pgbackrest or e.g. xtrabackup store a full backup of the database usually once a day at our place, and then keep archiving the WAL / transaction logs of the system. And we, in turn, archive snapshots of these into the longterm archiving once a day.

If we need a restore, we'd first restore a pgbackrest or xtrabackup state from the long term archiving onto a system. And then we can use the PITR recovery mechanisms to restore at a specific point in time.

Technically, we could precisely recover down to the last transaction before the disastrous transaction to minimize data loss. In fact, I've done so one or two times after some database migrations went haywire. That involved scrolling through transaction logs with a viewer to identify when the migration tool starts running, noting down the transaction ID of the transaction tool starting it's check and then restoring to the transaction before. Very cool tbh.

This is important for an RDBMS, because the data in the relational database tends to be much more volatile than the data in a file or object store. With a filestore, users upload a file and then move it to their recycling bin or their "done" folder on the local system and can easily drag it back out tomorrow. With the database, the user spent 30 minutes to an hour writing up some text or a comment and expects it to be saved and sound once they hit "Reply". Losing this kinda data creates a lot more work & effort for our customers, because then they have to figure out what state the data is in and what to redo. This may also cause their business processes to run haywire and... it's not great.

alt Hacker News