> What we’re doing here is instantaneous point-in-time recovery (PITR), expressed simply in SQL and SQLite pragmas.
> Ever wanted to do a quick query against a prod dataset, but didn’t want to shell into a prod server and fumble with the sqlite3 terminal command like a hacker in an 80s movie? Or needed to do a quick sanity check against yesterday’s data, but without doing a full database restore? Litestream VFS makes that easy. I’m so psyched about how it turned out.
Man this is cool. I love the unix ethos of Litestream's design. SQLite works as normal and Litestream operates transparently on that process.
This is such a clean interface design:
export LITESTREAM_REPLICA_URL="s3://my-bucket/my.db"
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
sqlite3
.load litestream.so
.open file:///my.db?vfs=litestream
PRAGMA litestream_time = '5 minutes ago';
select * from sandwich_ratings limit 3;This is great... just got it working using bun:sqlite! Just need to have "LITESTREAM_REPLICA_URL" and the key id and secret env vars set when running the script.
import { Database } from "bun:sqlite";
Database.setCustomSQLite("/opt/homebrew/opt/sqlite/lib/libsqlite3.dylib");
// Load extension first with a temp db
const temp = new Database(":memory:");
temp.loadExtension("/path/to/litestream.dylib", "sqlite3_litestreamvfs_init");
// Now open with litestream VFS
const db = new Database("file:my.db?vfs=litestream");
const fruits = db.query("SELECT * FROM fruits;").all();
console.log(fruits);Love the progress being made here. I've been really enjoying learning about another embedded database - DuckDB - the OLAP to SQLite's OLTP.
DuckDB has a lakehouse extension called "DuckLake" which generates "snapshots" for every transaction and lets you "time travel" through your database. Feels kind of analogous to LiteStream VFS PITR - but it's fascinating to see the nomenclature used for similar features. The OLTP world calls it Point In Time Recovery, while in the OLAP/data lake world, they call it Time Travel and it feels like a first-class feature.
In SQLite Litestream VFS, you use `PRAGMA litestream_time = ‘5 minutes ago’` ( or a timestamp ) - and in DuckLake, you use `SELECT * FROM tbl AT (VERSION => 3);` ( or a time stamp ).
DuckDB (unlike SQLite) doesn't allow other processes to read while one process is writing to the same file - all processes get locked out during writes. DuckLake solves this by using an external catalog database (PostgreSQL, MySQL, or SQLite) to coordinate concurrent access across multiple processes, while storing the actual data as Parquet files. It's a clever architecture for "multiplayer DuckDB.” - deliciously dependent on an OLTP to manage their distributed multiple user OLAP. Delta Lake uses uploaded JSON files to manage the metadata skipping the OLTP.
Another interesting comparison is the Parquet files used in the OLAP world - they’re immutable, column oriented and contain summaries of the content in the footers. LTX seems analogous - they’re immutable, stored on shared storage s3, allowing multiple database readers. No doubt they’re row oriented, being from the OLTP world.
Parquet files (in DuckLake) can be "merged" together - with DuckLake tracking this in its PostgreSQL/SQLite catalog - and in SQLite Litestream, the LTX files get “compacted” by the Litestream daemon, and read by the LitestreamVFS client. They both use range requests on s3 to retrieve the headers so they can efficiently download only the needed pages.
Both worlds are converging on immutable files hosted on shared storage + metadata + compaction for handling versioned data.
I'd love to see more cross-pollination between these projects!
This is awesome. Especially for sqlite db’s that are read only from a website user perspective. My use case would be an sqlite DB that would live on S3 and get updated by cron or some other task runner/automation means (eg some other facility independent of the website that is using the db), and the website would use litestream vfs and just make use of that “read only” (the website will never change or modify the db) db straightup. Can it be used in this described fashion? Also/if so, how will litestream vfs react to the remote db updating itself within this scenario? Will it be cool with that? Also I’m assuming there is or will be Python modules/integration for doing the needful around Litestream VFS?
Currently on this app, I have the Python/flask app just refreshing the sqlite db from a Google spreadsheet as the auth source (via dataframe then convert to sqlite) for the sqlite db on a daily scheduled basis done within the app.
For reference this is the current app: (yes the app is kinda shite but I’m just a sysadmin trying to learn Python!) https://github.com/jgbrwn/my-upc/blob/main/app.py
I also have this implemented and ready to go in my Go SQLite driver: https://github.com/ncruces/go-sqlite3/blob/main/litestream/e...
Slightly different API (programmatic, no env variables, works with as many databases as you may want), but otherwise, everything should work.
Note that PRAGMA litestream_time is per connection, so some care is necessary when using a connection pool.
As a sandwich enthusiast, I would like to know more about these sandwich ratings.
This sounds pretty cool, but I’m confused about what software being announced. Is there a new release of Litestream?
Does this work with sqlite extensions? If I were using e.g. sqlite-vec or -vss or some other vector search extension would I be able to use litestream to back it up to S3 live, and then litestream-vfs to query it remotely without downloading the whole thing?
I remember when Litestream not being a VFS was a plus https://news.ycombinator.com/item?id=29461406 ;)
I'm glad they did this! I've always thought VFS was a better fit for the objectives of Litestream than the original design.
SQLite VFS is really cool tech, and pretty easy to work with (IMO easier than FUSE).
I had made a _somewhat similar_ VFS [1] (with a totally different set of guarantees), and it felt pretty magical how it "just worked" with normal SQLite
Been tinkering with litestream... the read-only VFS is neat but I'm curious about eventual write capabilities... using VFS for distributed DBs could unlock some interesting patterns.
ALSO I'm thinking about mixing this with object store caching... maybe combining memfs with remote metadata; would love to see more details on performance.
BUT I might be overthinking it... just excited to see SQLite exploring beyond local files...
I work with many distributed, often offline, hosts with varied levels of internet speeds. Does this do any offline caching? Like if I load a vfs litestream database on one of my nodes and it goes offline can it still query or will it fall over unless the data was recently fetched?
From a Litestream user’s perspective:
Litestream continues to work as always, making continuous backups to S3.
Like always, I can restore from those backups to my local system.
But now I have the option of doing “virtual restores” where I can query a database backup directly on S3.
I dont fully understand this, would this be useful for scaling sqlite on systems that have really high read needs and a single writer? I thought that was what LiteFS was for, or am i off on that too?
Does this mean that I can run an application in K8s via one or many horizontally scaled pods all running off DB in s3? No StatefulSet required?
more goodies nice!
I am going to integrate Litestream into the thing I am going to building[1]. I experimented with a lot of ways, but it turns out there is WebDAV support recently merged, not in the docs.
dumb question: can this be used for versioned tables then ? what to see the state of a table 1 hour ago ?
Would this work with other object stores or is it s3 specific?
So much fun streaming/sync/cdc stuff happening, all so cool. Having an underlying FUSE driver doing the Change Data Capture is really neat. This looks like such an incredibly lightweight way to remote-connect to sqlite. And to add a sort of exterior transaction management.
Different use case, but makes me think of sqlite Rewrite-it-it-Rust Turso announcing AgentFS. Here the roles are flipped, sqlite is acting as a file store to back FUSE, to allow watching/transaction-managing the filesystem/what agents are doing. Turso also has a sick CDC system built in, that just writes all changes to a cdc table. Which is related to this whole meta question, of what is happening to my sqlite DB. https://turso.tech/blog/agentfs
Really nice. We should have this as an add-on to https://app.codecrafters.io/courses/sqlite/overview It can probably teach one a lot about the value of good replication and data formats.
If you are not familiar with data systems, havea read DDIA(Designing Data Intensive Applications) Chapter 3. Especially the part on building a database from the ground up — It almost starts with sthing like "Whats the simplest key value store?": `echo`(O(1) write to end of file, super fast) and `grep`(O(n) read, slow) — and then build up all the way to LSMTrees and BTrees. It will all make a lot more sense why this preserves so many of those ideas.
Are people still trying to shoehorn sqlite to run in a server-side context? I thought that was a fad that everyone gave up on.
Oh hey this is using my go sqlite vfs module[0]. I love it when I find out some code I wrote is useful to others!
[0]: https://github.com/psanford/sqlite3vfs