What about ZFS Snapshots and send/recv for backup and restore?. For us this is the cleanest approach, since we use it not only for PostgreSQL, but for all the data in our organization. Of course, the underlying filesystem must be ZFS.
pg_bulkload[1] has saved me so much time cold restoring large (1+ TB) databases. It went from 24-72 hours to an hour or two.
I also recommend pg_repack[2] to squash tables on a live system and reclaim disk space. It has saved me so much space.
Postgres backups are tricky for sure. Even if you have a DR plan you should assume your incremental backups are no good and you need to restore the whole thing from scratch. That’s your real DR SLA.
If things go truly south, just hope you have a read replica you can use as your new master. Most SLAs are not written with 72h+ of downtime. Have you tried the nuclear recovery plan, from scratch? Does it work?
While these optimizations are solid improvements, I was hoping to see more advanced techniques beyond the standard bulk insert and deferred constraint patterns. These are well-established PostgreSQL best practices - would love to see how pgstream handles more complex scenarios like parallel workers with partition-aware loading, or custom compression strategies for specific data types.
Slightly related but how does WAL-G stack up as far as backup/restoration options go for Postgres? https://github.com/wal-g/wal-g
One thing that's sorely needed in the official documentation is a "best practice" for backup/restore from "cold and dark" where you lose your main db in a fire and are now restoring from offsite backups for business continuity. Particularly in the 100-2TB range where probably most businesses lie, and backup/restore can take anywhere from 6 to 72 hours, often in less than ideal conditions. Like many things with SQL there's many ways to do it, but an official roadmap for order of operations would be very useful for backup/restore of roles/permissions, schema etc. You will figure it out eventually, but in my experience the dev and prod db size delta is so large many things that "just work" in the sub-1gb scale really trip you up over 200-500gb. Finding out you did one step out of order (manually, or badly written script) halfway through the restore process can mean hours and hours of rework. Heaven help you if you didn't start a screen session on your EC2 instance when you logged in.