logoalt Hacker News

Pg_lake: Postgres with Iceberg and data lake access

371 pointsby plaur78211/04/2025118 commentsview on HN

Comments

boshomi11/04/2025

Why not just use Ducklake?[1] That reduces complexity[2] since only DuckDB and PostgreSQL with pg_duckdb are required.

[1] https://ducklake.select/

[2] DuckLake - The SQL-Powered Lakehouse Format for the Rest of Us by Prof. Hannes Mühleisen: https://www.youtube.com/watch?v=YQEUkFWa69o

show 3 replies
ozgune11/04/2025

This is huge!

When people ask me what’s missing in the Postgres market, I used to tell them “open source Snowflake.”

Crunchy’s Postgres extension is by far the most ahead solution in the market.

Huge congrats to Snowflake and the Crunchy team on open sourcing this.

show 1 reply
anentropic11/04/2025

When Snowflake bought Crunchy Data I was hoping they were going to offer a managed version of this

It's great that I can run this locally in a Docker container, I'd love to be able to run a managed instance on AWS billed through our existing Snowflake account

gajus11/04/2025

Man, we are living in the golden era of PostgreSQL.

show 1 reply
NeutralCrane11/04/2025

I’m not a data engineer but work in an adjacent role. Is there anyone here who could dumb the use case down? Maybe an example of a problem this solves. I am struggling to understand the value proposition here.

show 1 reply
ayhanfuat11/04/2025

With S3 Table Buckets, Cloudflare R2 Data Catalog and now this, Iceberg seems to be winning.

fifilura11/05/2025

How do you use your data lake? For me it is much more than just storing data, it is just as much for crunching numbers in unpredictable ways.

And this is where postgres does not cut it.

You need some more CPU and RAM than what you pay for in your postgres instance. I.e. a distributed engine where you don't have to worry about how big your database instance is today.

show 1 reply
hamasho11/05/2025

I like data lakes and their SQL-like query languages. They feel like an advanced version of the "Everything is a file" philosophy.

Under "Everything is a file", you can read or manipulate a wide variety of information via simple, open/read/write() APIs. Linux provides APIs to modify system settings via filesystem. Get the screen brightness setting? `cat /sys/class/backlight/device0/brightness`. Update? `echo 500 > /sys/class/backlight/device0/brightness`. No need for special APIs, just generic file operations and the kernel handles everything.

FUSE (Filesystem in Userspace) provides even more flexibility by allowing user space programs to build their own drivers that handle any data operation via the filesystem. You can mount remote systems (via SSH) and google drive, and copying files is as easy as `cp /mnt/remote/data/origin /mnt/googledrive/data/`. Or using unique FUSE like pgfuse and redisfs, updating redis value by postgres DB data is just `cat /mnt/postgres/users/100/full_name > /mnt/redis/user_100_full_name`.

But filesystems are only good for hierarchical data while a lot of real world data is relational. Many FUSE software tries hard to represent inherently non-hierarchical data in a filesystem. Data lake allows to use SQL, the elegant abstraction for relational data, across different underlying data sources. They can be physically distant and have different structures. A lot of real world applications are just CRUD on relational data. You can accomplish much more much easier if those data are just a big single database.

dharbin11/04/2025

Why would Snowflake develop and release this? Doesn't this cannibalize their main product?

show 4 replies
enether11/07/2025

Do I understand it correctly that DuckDB would run embedded on the machine running Postgres (i.e through the extension), and this limits query processing ability to whatever that machine can comfortably handle?

What are the deployment implications if one wants to integrate this in production? Surely they'd need a much larger Postgres machine at a minimum.

Is there concern re: "hot neighbour" problems if the DuckDB queries get too heavy? How is that sort of issue potentially handled? I understood from another query that DuckDB is ran in a separate process, so there is room to potentially throttle it

darth_avocado11/04/2025

This is so cool! We have files in Iceberg that we then move data to/from to a PG db using a custom utility. It always felt more like a workaround that didn’t fully use the capabilities of both the technologies. Can’t wait to try this out.

max_streese11/05/2025

Two questions:

(1) Are there any plans to make this compatible with the ducklake specification? Meaning: Instead of using Iceberg in the background, you would use ducklake with its SQL tables? My knowledge is very limited but to me, besides leveraging duckdb, another big point of ducklake is that it's using SQL for the catalog stuff instead of a confusing mixture of files, thereby offering a bunch of advantages like not having to care about number of snapshots and better concurrent writes.

(2) Might it be possible that pg_duckdb will achieve the same thing in some time or do things not work like that?

show 1 reply
oulipo211/04/2025

Interesting! How does it compare with ducklake?

show 1 reply
dkdcio11/04/2025

I was going to ask if you could then put DuckDB over Postgres for the OLAP query engine -- looks like that's already what it does! very interesting development in the data lake space alongside DuckLake and things

show 1 reply
whalesalad11/05/2025

RDS really needs to make it easy to install your own PG modules.

show 1 reply
harisund199011/04/2025

This is cool to see! Looks like a compete against pg_mooncake which Databricks acquired. But how is this different from pg_duckdb?

lysecret11/05/2025

A usecase I see for this personally I have encountered a lot of “hot cache for some time then offload for historical queries” usecases which I have built by hand multiple times. This should be a great fit. E.g. write to Postgres then periodically offload to lakehouse and even query together (if needed). Very cool!

spenczar511/04/2025

Very cool. One question that comes up for me is whether pg_lake expects to control the Iceberg metadata, or whether it can be used purely as a read layer. If I make schema updates and partition changes to iceberg directly, without going through pg_lake, will pg_lake's catalog correctly reflect things right away?

show 2 replies
flarco11/05/2025

For anyone looking to easily ingest data into a Postgres Wire compatible database, check out https://github.com/slingdata-io/sling-cli. Use CLI, YAML or Python to run etl jobs.

drchaim11/04/2025

More integrations are great. Anyway, the "this is awesome" moment (for me) will be when you could mix row- and column-oriented tables in Postgres, a bit like Timescale but native Postgres and well done. Hopefully one day.

show 3 replies
lysecret11/04/2025

Nice does this also allow me to write to parquet from my Postgres table?

show 2 replies
pjd711/04/2025

This is awesome, I will be trying this out in the coming months. Its just made it to the top of my R&D shortlist for things that could massively simplify our data stack for a b2b saas.

inglor11/04/2025

This is really nice though looking at the code - a lot of the postgres types are missing as well a lot of the newer parquet logical types - but this is a great start and a nice use of FDW.

show 2 replies
fridder11/04/2025

I love this. There are definitely shops where the data is a bit too much for postgres but something like Snowflake would be overkill. Wish this was around a couple years ago lol

apexalpha11/04/2025

I’m not super into the Data sphere but my company relies heavily on Snowflake which is becoming an issue.

This announcement seems huge to me, no?!

Is this really an open source Snowflake covering most use cases?

show 1 reply
iamcreasy11/04/2025

Very cool! Was there any inherent limitation with postgresql or its extension system that forced pg_lake to use duckdb as query engine?

show 2 replies
claudeomusic11/04/2025

Can someone dumb this down a bit for a non data-engineer? Hard to fully wrap my head around who this is/isn’t best suited for.

show 1 reply
mberning11/04/2025

Does anyone know how access control works to the underlying s3 objects? I didn’t see anything regarding grants in the docs.

show 2 replies
chaps11/04/2025

I love postgres and have created my own "data lake" sorta systems -- what would this add to my workflows?

iamcreasy11/04/2025

If anyone from Supabase is reading, it would be awesome to have this extension!

logicartisan11/05/2025

It’s amazing to see Postgres growing into something this powerful

scirob11/04/2025

Crunchydata did it first :) but nice to get more options

show 1 reply
beoberha11/04/2025

Curious why pgduck_server is a totally separate process?

show 3 replies
hamonrye11/04/2025

[dead]

CJlll11/05/2025

[dead]

patokkuyak11/05/2025

[dead]

rizky0511/04/2025

[dead]

hexo11/05/2025

Oh datalakes. The most ridiculous idea in data processing, right after data frames in python.

We've had this discussion like a week ago about how stupid is to use filesystem for this kind of data storage and here we go again. Actually i had to implement this "idea" in practice. What a nonsense.