Pg_lake: Postgres with Iceberg and data lake access

371 points • by plaur782 • 11/04/2025 • 118 comments • view on HN

Comments

boshomi • 11/04/2025

Why not just use Ducklake?[1] That reduces complexity[2] since only DuckDB and PostgreSQL with pg_duckdb are required.

[2] DuckLake - The SQL-Powered Lakehouse Format for the Rest of Us by Prof. Hannes Mühleisen: https://www.youtube.com/watch?v=YQEUkFWa69o

➕ show 3 replies

ozgune • 11/04/2025

This is huge!

When people ask me what’s missing in the Postgres market, I used to tell them “open source Snowflake.”

Crunchy’s Postgres extension is by far the most ahead solution in the market.

Huge congrats to Snowflake and the Crunchy team on open sourcing this.

➕ show 1 reply

anentropic • 11/04/2025

When Snowflake bought Crunchy Data I was hoping they were going to offer a managed version of this

It's great that I can run this locally in a Docker container, I'd love to be able to run a managed instance on AWS billed through our existing Snowflake account

gajus • 11/04/2025

Man, we are living in the golden era of PostgreSQL.

➕ show 1 reply

NeutralCrane • 11/04/2025

I’m not a data engineer but work in an adjacent role. Is there anyone here who could dumb the use case down? Maybe an example of a problem this solves. I am struggling to understand the value proposition here.

➕ show 1 reply

ayhanfuat • 11/04/2025

With S3 Table Buckets, Cloudflare R2 Data Catalog and now this, Iceberg seems to be winning.

fifilura • 11/05/2025

How do you use your data lake? For me it is much more than just storing data, it is just as much for crunching numbers in unpredictable ways.

And this is where postgres does not cut it.

You need some more CPU and RAM than what you pay for in your postgres instance. I.e. a distributed engine where you don't have to worry about how big your database instance is today.

➕ show 1 reply

hamasho • 11/05/2025

I like data lakes and their SQL-like query languages. They feel like an advanced version of the "Everything is a file" philosophy.

Under "Everything is a file", you can read or manipulate a wide variety of information via simple, open/read/write() APIs. Linux provides APIs to modify system settings via filesystem. Get the screen brightness setting? `cat /sys/class/backlight/device0/brightness`. Update? `echo 500 > /sys/class/backlight/device0/brightness`. No need for special APIs, just generic file operations and the kernel handles everything.

FUSE (Filesystem in Userspace) provides even more flexibility by allowing user space programs to build their own drivers that handle any data operation via the filesystem. You can mount remote systems (via SSH) and google drive, and copying files is as easy as `cp /mnt/remote/data/origin /mnt/googledrive/data/`. Or using unique FUSE like pgfuse and redisfs, updating redis value by postgres DB data is just `cat /mnt/postgres/users/100/full_name > /mnt/redis/user_100_full_name`.

But filesystems are only good for hierarchical data while a lot of real world data is relational. Many FUSE software tries hard to represent inherently non-hierarchical data in a filesystem. Data lake allows to use SQL, the elegant abstraction for relational data, across different underlying data sources. They can be physically distant and have different structures. A lot of real world applications are just CRUD on relational data. You can accomplish much more much easier if those data are just a big single database.

dharbin • 11/04/2025

Why would Snowflake develop and release this? Doesn't this cannibalize their main product?

➕ show 4 replies

enether • 11/07/2025

Do I understand it correctly that DuckDB would run embedded on the machine running Postgres (i.e through the extension), and this limits query processing ability to whatever that machine can comfortably handle?

What are the deployment implications if one wants to integrate this in production? Surely they'd need a much larger Postgres machine at a minimum.

Is there concern re: "hot neighbour" problems if the DuckDB queries get too heavy? How is that sort of issue potentially handled? I understood from another query that DuckDB is ran in a separate process, so there is room to potentially throttle it

darth_avocado • 11/04/2025

This is so cool! We have files in Iceberg that we then move data to/from to a PG db using a custom utility. It always felt more like a workaround that didn’t fully use the capabilities of both the technologies. Can’t wait to try this out.

max_streese • 11/05/2025

Two questions:

(1) Are there any plans to make this compatible with the ducklake specification? Meaning: Instead of using Iceberg in the background, you would use ducklake with its SQL tables? My knowledge is very limited but to me, besides leveraging duckdb, another big point of ducklake is that it's using SQL for the catalog stuff instead of a confusing mixture of files, thereby offering a bunch of advantages like not having to care about number of snapshots and better concurrent writes.

(2) Might it be possible that pg_duckdb will achieve the same thing in some time or do things not work like that?

➕ show 1 reply

oulipo2 • 11/04/2025

Interesting! How does it compare with ducklake?

➕ show 1 reply

dkdcio • 11/04/2025

I was going to ask if you could then put DuckDB over Postgres for the OLAP query engine -- looks like that's already what it does! very interesting development in the data lake space alongside DuckLake and things

➕ show 1 reply

whalesalad • 11/05/2025

RDS really needs to make it easy to install your own PG modules.

➕ show 1 reply

harisund1990 • 11/04/2025

This is cool to see! Looks like a compete against pg_mooncake which Databricks acquired. But how is this different from pg_duckdb?

lysecret • 11/05/2025

A usecase I see for this personally I have encountered a lot of “hot cache for some time then offload for historical queries” usecases which I have built by hand multiple times. This should be a great fit. E.g. write to Postgres then periodically offload to lakehouse and even query together (if needed). Very cool!

spenczar5 • 11/04/2025

Very cool. One question that comes up for me is whether pg_lake expects to control the Iceberg metadata, or whether it can be used purely as a read layer. If I make schema updates and partition changes to iceberg directly, without going through pg_lake, will pg_lake's catalog correctly reflect things right away?

➕ show 2 replies

flarco • 11/05/2025

For anyone looking to easily ingest data into a Postgres Wire compatible database, check out https://github.com/slingdata-io/sling-cli. Use CLI, YAML or Python to run etl jobs.

drchaim • 11/04/2025

More integrations are great. Anyway, the "this is awesome" moment (for me) will be when you could mix row- and column-oriented tables in Postgres, a bit like Timescale but native Postgres and well done. Hopefully one day.

➕ show 3 replies

lysecret • 11/04/2025

Nice does this also allow me to write to parquet from my Postgres table?

➕ show 2 replies

pjd7 • 11/04/2025

This is awesome, I will be trying this out in the coming months. Its just made it to the top of my R&D shortlist for things that could massively simplify our data stack for a b2b saas.

inglor • 11/04/2025

This is really nice though looking at the code - a lot of the postgres types are missing as well a lot of the newer parquet logical types - but this is a great start and a nice use of FDW.

➕ show 2 replies

fridder • 11/04/2025

I love this. There are definitely shops where the data is a bit too much for postgres but something like Snowflake would be overkill. Wish this was around a couple years ago lol

apexalpha • 11/04/2025

I’m not super into the Data sphere but my company relies heavily on Snowflake which is becoming an issue.

This announcement seems huge to me, no?!

Is this really an open source Snowflake covering most use cases?

➕ show 1 reply

iamcreasy • 11/04/2025

Very cool! Was there any inherent limitation with postgresql or its extension system that forced pg_lake to use duckdb as query engine?

➕ show 2 replies

claudeomusic • 11/04/2025

Can someone dumb this down a bit for a non data-engineer? Hard to fully wrap my head around who this is/isn’t best suited for.

➕ show 1 reply

mberning • 11/04/2025

Does anyone know how access control works to the underlying s3 objects? I didn’t see anything regarding grants in the docs.

➕ show 2 replies

chaps • 11/04/2025

I love postgres and have created my own "data lake" sorta systems -- what would this add to my workflows?

iamcreasy • 11/04/2025

If anyone from Supabase is reading, it would be awesome to have this extension!

logicartisan • 11/05/2025

It’s amazing to see Postgres growing into something this powerful

scirob • 11/04/2025

Crunchydata did it first :) but nice to get more options

➕ show 1 reply

beoberha • 11/04/2025

Curious why pgduck_server is a totally separate process?

➕ show 3 replies

hamonrye • 11/04/2025

[dead]

CJlll • 11/05/2025

[dead]

patokkuyak • 11/05/2025

[dead]

rizky05 • 11/04/2025

[dead]

hexo • 11/05/2025

Oh datalakes. The most ridiculous idea in data processing, right after data frames in python.

We've had this discussion like a week ago about how stupid is to use filesystem for this kind of data storage and here we go again. Actually i had to implement this "idea" in practice. What a nonsense.

alt Hacker News

Pg_lake: Postgres with Iceberg and data lake access

Comments