Hi HN! We're Evgeny and Arjun, and we’re building a better way to do analytics with Postgres.
We love Postgres for its simplicity, power, and rich ecosystem. But engineers have to still get bogged down with heavyweight and expensive OLAP systems when connecting an analytics data stack.
Postgres is amazing at OLTP queries, but not for OLAP queries (large data scans and aggregations). Even in this case, we’ve still heard from countless scaling startups that they still try to use only a read replica to run analytics workloads since they don’t want to deal with the data engineering complexity of the alternative. This actually works surprising well initially, but starts to break for them as they scale or when integrating multiple data sources. Adding lots of indexes to support analytics also slows down their transactional write performance.
When growing out of “just use Postgres”, companies have to understand and wrangle complex ETL pipelines, CDC processes, and data warehouses — adding layers of complexity that defeat the simplicity that undermines their initial choice for Postgres as their data storage in the first place.
We thought there had to be a better way, so we’re building BemiDB. It’s designed to handle complex analytical queries at scale without the usual overhead. It’s a single binary that automatically syncs with Postgres data and is Postgres-compatible, so it’s like querying standard Postgres and works with all existing tools.
Under the hood, we use Apache Iceberg (with Parquet data files) stored in S3. This allows for bottomless inexpensive storage, compressed data in columnar files, and an open format that guarantees compatibility with other data tools.
We embed DuckDB as the query engine for in-memory analytics that work for complex queries. With efficient columnar storage and vectorized execution, we’re aiming for faster results without heavy infra. BemiDB communicates over the Postgres wire protocol to make all querying Postgres-compatible.
We want to simplify data stacks for companies that use Postgres by reducing complexity (single binary and S3), using non-proprietary data formats (Iceberg open tables), and removing vendor lock-in (open source). We'd love to hear your feedback! What do you think?
Really cool! I have an IoT use-case where I ingest data, I want to keep like the last 3 months or so in Postgresql, and then store the old data as parquet files on S3
I planned initially to do chunks on S3 and do the analytical queries using duckdb, I'm wondering if your tool would be a good replacement?
For now I don't have that many analytical queries, I'm mostly doing visualization of the data points by querying a range (eg last 2 weeks of data for a device)
Does it then make sense to use columnar storage or am I better off with "regular Postgres"?
Or in my case does your approach provide "best of both worlds" in the sense that I could do some occasional analytical queries on past data stored on S3, and regularly access "last 3 months" data for visualization using the data stored in the regular Postgres?
This is probably the most streamlined/all-inclusive solution out of all that I've seen, but this has definitely been an extremely saturated space in 2024
Question: is it possible to use BemiDB in ... don't know how to spell it, maybe read-only mode? And by that I mean one Bemi instance that is connected to postgres source, and others that use the produced Iceberg tables to answer queries? Poor man's scalability of the query engine :) ... I would also imagine having an instance that is write-only (reads from postgres and produces the Iceberg tables) and one or many query-only engines. Other than that, great work, will definitely start using it!
We do quite similar but through Debezium/Kafka CDC pipeline to clickhouse. This way primary database is protected. Directly querying from files from postgres might get slower eventually IMO compare.
Moving data between systems is problematic. Where this product is actually needed (multi-TB databases under load) is where logical replication won't be able to sync your tables in time. Conversely, small databases where this will work don't really need columnar storage optimizations.
How does update or continuous inserts get written/updated to parquet files? Architecture doesn't show nor anything in docs.
1. All the benchmarks/most of the companies, show one time data exists and try querying/compressing in different formats which is far from reality
2. Do you rewrite parquet data every time new data comes? Or partitioned by something? No examples
3. How does update/delete works. Update might be niche case. But deletion/data retention/truncation is must and I don't see how you support that
Definitely checking this out today! I use postgres for ~30 GB of machine learning data (object detection) and have a couple workflows which go through the Postgres->Parquet->DuckDB processing route.
A couple questions, if you have time:
1. How do you guys handle multi-dimensional arrays? I've had issues with a few postgres-facing interfaces (libraries or middleware) where they believe everything is a 1D array!
2. I saw you are using pg_duckdb/duckdb under the hood. I've had issues calling plain-SQL functions defined on the postgres server, when duckdb is involved. Does BemiDB support them?
Thanks for sharing, and good luck with it!
So it looks like you don't use postgres extensions so you can run this on an EC2 against an Aurora Postgres instance and dump files to S3 Iceberg right?
And can you then have Glue Catalog auto-crawl them and expose them in Athena? Or are they DuckDB-managed Iceberg tables essentially?
Cool. Every database or data source (e.g. CRM) should produce Iceberg format for you.
Though a little sceptical of embedding DuckDB. It is easy and better to isolate Read/Write paths, and it has a lot of other benefits.
The AGPL license is a no-go for me.
While it’s technically true that it’s an OSI license it’s mostly used to scare away competing cloud vendors from hosting the software, which isn’t in spirit of OSS.
Have you looked into the more modern choices?
Like the Business Source License that MariaDB created and uses or the Functional Source License that Sentry created as an improvement over the Business Source License? https://fsl.software/
Both those licenses have a fair source phase that automatically resolves into an open source phase over time.
Thus one gets the best of two worlds: An honest descriptive license for protecting one’s business model + a normal permissive OSS license that ensures longevity and prevents lock-in.
How does this replicate Postgres data? I glanced at the code and saw that it exports to a CSV file then writes out an Iceberg table for an initial snapshot--does it use Postgres logical replication?
I'm looking for low latency queries over not-very-big data (40-100M rows) in user-facing dashboards
How does the latency of Iceberg-on-S3 compare to say an EBS volume?
Query Engine: embeds the DuckDB query engine to run analytical queries. Storage Layer: uses the Iceberg table format to store data in columnar compressed Parquet files.
Smart. Imma test this out for sure.
Congratulations! I was looking and pg_analytics from ParadeDB hoping this use case would be solved (the dump from pg to parquet part), but it doesnt yet do it.
How does it handle updates?
Can you give an example if I have 5gig (2 million rows)
How will it be created differently for columnar access?
Amazing work, guys! Looking forward to seeing where BemiDB is going.
How does this compare to ParadeDB? Seems to occupy the same space
I don't get how this would do away with the need for some kind of ETL. Most apps use highly normalised schemas that are completely unsuitable for analytical users. Not to mention you wouldn't want to couple your app schema to your warehouse schema. Am I missing something? How would this replace traditional data warehousing?
As much as DuckDB is cute I've mostly come to believe that Clickhouse is the perfect thing to pair Postgres with. This is especially true now that they've acquired PeerDB and are integrating it into the Clickpipes cloud product.
DuckDB is neat, and I understand why a company like BemiDB would build their product on top of it, but as a prospective customer embedded databases are a weird choice for serious workloads when there are other good open-source solutions like Clickhouse available.
I don’t totally understand the fascination with storing analytical data on S3. It’s not fast, and if you’re in a write heavy environment it’s definitely not cheap either.
What’s with the avoidance of clickhouse or duckdb paired with insanely fast EBS or even physically attached storage? You can still backup to s3, but using s3 for live analytics queries is missing out on so much of the speed.
What I would really love is a dead simple way to: 1) connect to my transactional Postgres db 2) define my materialized views 3) have these views update in realtime 4) query these views with a fast engine
And ideally have the whole thing open source and be able to run it in CI
We tried peerdb + clickhouse but Clickhouse materialized views are not refreshed when joining tables.
Right now we’re back to standard materialized views inside Postgres refreshed once a day but the full refreshes are pretty slow… the operational side is great though, a single db to manage.
[flagged]
very cool!! We have the same vision with pg_moooncake: https://github.com/Mooncake-Labs/pg_mooncake/tree/main
From what I understand, the BemiDB experience is akin to PeerDB + Clickhouse. It's not really a Postgres extension?
Glad open table formats are becoming mainstream, for everyone.
Looking at the syncer it seems like copying data to csv from the whole table everytime (?) Code: https://github.com/BemiHQ/BemiDB/blob/6d6689b392ce6192fe521a...
I cant imagine until at what scale can you do this and is there anything better we can do before using debezium to sync the data via cdc?
Edit: add code permalink