logoalt Hacker News

oulipo11/07/20243 repliesview on HN

Really cool! I have an IoT use-case where I ingest data, I want to keep like the last 3 months or so in Postgresql, and then store the old data as parquet files on S3

I planned initially to do chunks on S3 and do the analytical queries using duckdb, I'm wondering if your tool would be a good replacement?

For now I don't have that many analytical queries, I'm mostly doing visualization of the data points by querying a range (eg last 2 weeks of data for a device)

Does it then make sense to use columnar storage or am I better off with "regular Postgres"?

Or in my case does your approach provide "best of both worlds" in the sense that I could do some occasional analytical queries on past data stored on S3, and regularly access "last 3 months" data for visualization using the data stored in the regular Postgres?


Replies

Onavo11/07/2024

Use a clickhouse FDW or something similar, clickhouse has excellent integration with postgres. They also have a great embedded Python version. Their marketing isn't as good as Duckdb but in terms of stability and performance they are so much better. Duckdb is very very buggy and full of sharp edges but because of their VC funded developer marketing, you don't really hear people talking about it.

show 2 replies
jconline54311/08/2024

If you just need the Postgres -> S3 archival pattern you described, I built a simpler focused tool: pg-archiver (https://github.com/johnonline35/pg-archiver)

It:

- Auto-archives old Postgres data to Parquet files on S3

- Keeps recent data (default 90 days) in Postgres for fast viz queries

- Uses year/month partitioning in S3 for basic analytical queries

- Configures with just PG connection string and S3 bucket

Currently batch-only archival (no real-time sync yet). Much lighter than running a full analytical DB if you mainly need timeseries visualization with occasional historical analysis.

Let me know if you try it out!

show 1 reply
exAspArk11/07/2024

Thank you!

Yes, absolutely!

1) You could use BemiDB to sync your Postgres data (e.g., partition time-series tables) to S3 in Iceberg format. Iceberg is essentially a "table" abstraction on top of columnar Parquet data files with a schema, history, etc.

2) If you don't need strong consistency and fine with delayed data (the main trade-off), you can use just BemiDB to query and visualize all data directly from S3. From a query perspective, it's like DuckDB that talks Postgres (wire protocol).

Feel free to give it a try! And although it's a new project, we plan to keep building and improving it based on user feedback.

show 1 reply