Show HN: BemiDB – Postgres read replica optimized for analytics

209 points • by exAspArk • 11/07/2024 • 117 comments • view on HN

Hi HN! We're Evgeny and Arjun, and we’re building a better way to do analytics with Postgres.

We love Postgres for its simplicity, power, and rich ecosystem. But engineers have to still get bogged down with heavyweight and expensive OLAP systems when connecting an analytics data stack.

Postgres is amazing at OLTP queries, but not for OLAP queries (large data scans and aggregations). Even in this case, we’ve still heard from countless scaling startups that they still try to use only a read replica to run analytics workloads since they don’t want to deal with the data engineering complexity of the alternative. This actually works surprising well initially, but starts to break for them as they scale or when integrating multiple data sources. Adding lots of indexes to support analytics also slows down their transactional write performance.

When growing out of “just use Postgres”, companies have to understand and wrangle complex ETL pipelines, CDC processes, and data warehouses — adding layers of complexity that defeat the simplicity that undermines their initial choice for Postgres as their data storage in the first place.

We thought there had to be a better way, so we’re building BemiDB. It’s designed to handle complex analytical queries at scale without the usual overhead. It’s a single binary that automatically syncs with Postgres data and is Postgres-compatible, so it’s like querying standard Postgres and works with all existing tools.

Under the hood, we use Apache Iceberg (with Parquet data files) stored in S3. This allows for bottomless inexpensive storage, compressed data in columnar files, and an open format that guarantees compatibility with other data tools.

We embed DuckDB as the query engine for in-memory analytics that work for complex queries. With efficient columnar storage and vectorized execution, we’re aiming for faster results without heavy infra. BemiDB communicates over the Postgres wire protocol to make all querying Postgres-compatible.

We want to simplify data stacks for companies that use Postgres by reducing complexity (single binary and S3), using non-proprietary data formats (Iceberg open tables), and removing vendor lock-in (open source). We'd love to hear your feedback! What do you think?

Comments

banditelol • 11/08/2024

Looking at the syncer it seems like copying data to csv from the whole table everytime (?) Code: https://github.com/BemiHQ/BemiDB/blob/6d6689b392ce6192fe521a...

I cant imagine until at what scale can you do this and is there anything better we can do before using debezium to sync the data via cdc?

Edit: add code permalink

➕ show 1 reply

oulipo • 11/07/2024

Really cool! I have an IoT use-case where I ingest data, I want to keep like the last 3 months or so in Postgresql, and then store the old data as parquet files on S3

I planned initially to do chunks on S3 and do the analytical queries using duckdb, I'm wondering if your tool would be a good replacement?

For now I don't have that many analytical queries, I'm mostly doing visualization of the data points by querying a range (eg last 2 weeks of data for a device)

Does it then make sense to use columnar storage or am I better off with "regular Postgres"?

Or in my case does your approach provide "best of both worlds" in the sense that I could do some occasional analytical queries on past data stored on S3, and regularly access "last 3 months" data for visualization using the data stored in the regular Postgres?

➕ show 3 replies

dantodor • 11/08/2024

Question: is it possible to use BemiDB in ... don't know how to spell it, maybe read-only mode? And by that I mean one Bemi instance that is connected to postgres source, and others that use the produced Iceberg tables to answer queries? Poor man's scalability of the query engine :) ... I would also imagine having an instance that is write-only (reads from postgres and produces the Iceberg tables) and one or many query-only engines. Other than that, great work, will definitely start using it!

➕ show 1 reply

dangoodmanUT • 11/07/2024

This is probably the most streamlined/all-inclusive solution out of all that I've seen, but this has definitely been an extremely saturated space in 2024

➕ show 2 replies

nitinreddy88 • 11/08/2024

How does update or continuous inserts get written/updated to parquet files? Architecture doesn't show nor anything in docs.

1. All the benchmarks/most of the companies, show one time data exists and try querying/compressing in different formats which is far from reality

2. Do you rewrite parquet data every time new data comes? Or partitioned by something? No examples

3. How does update/delete works. Update might be niche case. But deletion/data retention/truncation is must and I don't see how you support that

➕ show 1 reply

levkk • 11/07/2024

Moving data between systems is problematic. Where this product is actually needed (multi-TB databases under load) is where logical replication won't be able to sync your tables in time. Conversely, small databases where this will work don't really need columnar storage optimizations.

➕ show 2 replies

VoxPelli • 11/07/2024

The AGPL license is a no-go for me.

While it’s technically true that it’s an OSI license it’s mostly used to scare away competing cloud vendors from hosting the software, which isn’t in spirit of OSS.

Have you looked into the more modern choices?

Like the Business Source License that MariaDB created and uses or the Functional Source License that Sentry created as an improvement over the Business Source License? https://fsl.software/

Both those licenses have a fair source phase that automatically resolves into an open source phase over time.

Thus one gets the best of two worlds: An honest descriptive license for protecting one’s business model + a normal permissive OSS license that ensures longevity and prevents lock-in.

➕ show 3 replies

leighleighleigh • 11/07/2024

Definitely checking this out today! I use postgres for ~30 GB of machine learning data (object detection) and have a couple workflows which go through the Postgres->Parquet->DuckDB processing route.

A couple questions, if you have time:

1. How do you guys handle multi-dimensional arrays? I've had issues with a few postgres-facing interfaces (libraries or middleware) where they believe everything is a 1D array!

2. I saw you are using pg_duckdb/duckdb under the hood. I've had issues calling plain-SQL functions defined on the postgres server, when duckdb is involved. Does BemiDB support them?

Thanks for sharing, and good luck with it!

➕ show 1 reply

globular-toast • 11/07/2024

I don't get how this would do away with the need for some kind of ETL. Most apps use highly normalised schemas that are completely unsuitable for analytical users. Not to mention you wouldn't want to couple your app schema to your warehouse schema. Am I missing something? How would this replace traditional data warehousing?

➕ show 1 reply

gregw2 • 11/07/2024

So it looks like you don't use postgres extensions so you can run this on an EC2 against an Aurora Postgres instance and dump files to S3 Iceberg right?

And can you then have Glue Catalog auto-crawl them and expose them in Athena? Or are they DuckDB-managed Iceberg tables essentially?

➕ show 1 reply

darkbatman • 11/08/2024

We do quite similar but through Debezium/Kafka CDC pipeline to clickhouse. This way primary database is protected. Directly querying from files from postgres might get slower eventually IMO compare.

➕ show 1 reply

jakozaur • 11/07/2024

Cool. Every database or data source (e.g. CRM) should produce Iceberg format for you.

Though a little sceptical of embedding DuckDB. It is easy and better to isolate Read/Write paths, and it has a lot of other benefits.

➕ show 1 reply

partdavid • 11/07/2024

How does this replicate Postgres data? I glanced at the code and saw that it exports to a CSV file then writes out an Iceberg table for an initial snapshot--does it use Postgres logical replication?

➕ show 1 reply

gigatexal • 11/07/2024

Query Engine: embeds the DuckDB query engine to run analytical queries. Storage Layer: uses the Iceberg table format to store data in columnar compressed Parquet files.

Smart. Imma test this out for sure.

➕ show 1 reply

canadiantim • 11/07/2024

How does this compare to ParadeDB? Seems to occupy the same space

➕ show 1 reply

neeleshs • 11/07/2024

Congratulations! I was looking and pg_analytics from ParadeDB hoping this use case would be solved (the dump from pg to parquet part), but it doesnt yet do it.

How does it handle updates?

➕ show 1 reply

anentropic • 11/07/2024

I'm looking for low latency queries over not-very-big data (40-100M rows) in user-facing dashboards

How does the latency of Iceberg-on-S3 compare to say an EBS volume?

➕ show 1 reply

hoerzu • 11/07/2024

Can you give an example if I have 5gig (2 million rows)

How will it be created differently for columnar access?

➕ show 1 reply

nwhnwh • 11/08/2024

Any benchmarks against ClickHouse?

➕ show 1 reply

lucasfcosta • 11/07/2024

Amazing work, guys! Looking forward to seeing where BemiDB is going.

➕ show 1 reply

mrbluecoat • 11/07/2024

I hadn't heard of Devbox before, so thanks for sharing

➕ show 1 reply

winddude • 11/07/2024

difference to something like duckdb?

➕ show 2 replies

woodhull • 11/07/2024

As much as DuckDB is cute I've mostly come to believe that Clickhouse is the perfect thing to pair Postgres with. This is especially true now that they've acquired PeerDB and are integrating it into the Clickpipes cloud product.

DuckDB is neat, and I understand why a company like BemiDB would build their product on top of it, but as a prospective customer embedded databases are a weird choice for serious workloads when there are other good open-source solutions like Clickhouse available.

➕ show 3 replies

hipadev23 • 11/08/2024

I don’t totally understand the fascination with storing analytical data on S3. It’s not fast, and if you’re in a write heavy environment it’s definitely not cheap either.

What’s with the avoidance of clickhouse or duckdb paired with insanely fast EBS or even physically attached storage? You can still backup to s3, but using s3 for live analytics queries is missing out on so much of the speed.

➕ show 5 replies

cocoflunchy • 11/08/2024

What I would really love is a dead simple way to: 1) connect to my transactional Postgres db 2) define my materialized views 3) have these views update in realtime 4) query these views with a fast engine

And ideally have the whole thing open source and be able to run it in CI

We tried peerdb + clickhouse but Clickhouse materialized views are not refreshed when joining tables.

Right now we’re back to standard materialized views inside Postgres refreshed once a day but the full refreshes are pretty slow… the operational side is great though, a single db to manage.

➕ show 5 replies

paurora • 11/07/2024

[dead]

➕ show 2 replies

mun763272dfuwhz • 11/08/2024

[flagged]

fanyang01 • 11/08/2024

[flagged]

➕ show 1 reply

alt Hacker News

Show HN: BemiDB – Postgres read replica optimized for analytics

Comments