This is huge!
When people ask me what’s missing in the Postgres market, I used to tell them “open source Snowflake.”
Crunchy’s Postgres extension is by far the most ahead solution in the market.
Huge congrats to Snowflake and the Crunchy team on open sourcing this.
When Snowflake bought Crunchy Data I was hoping they were going to offer a managed version of this
It's great that I can run this locally in a Docker container, I'd love to be able to run a managed instance on AWS billed through our existing Snowflake account
I’m not a data engineer but work in an adjacent role. Is there anyone here who could dumb the use case down? Maybe an example of a problem this solves. I am struggling to understand the value proposition here.
With S3 Table Buckets, Cloudflare R2 Data Catalog and now this, Iceberg seems to be winning.
How do you use your data lake? For me it is much more than just storing data, it is just as much for crunching numbers in unpredictable ways.
And this is where postgres does not cut it.
You need some more CPU and RAM than what you pay for in your postgres instance. I.e. a distributed engine where you don't have to worry about how big your database instance is today.
I like data lakes and their SQL-like query languages. They feel like an advanced version of the "Everything is a file" philosophy.
Under "Everything is a file", you can read or manipulate a wide variety of information via simple, open/read/write() APIs. Linux provides APIs to modify system settings via filesystem. Get the screen brightness setting? `cat /sys/class/backlight/device0/brightness`. Update? `echo 500 > /sys/class/backlight/device0/brightness`. No need for special APIs, just generic file operations and the kernel handles everything.
FUSE (Filesystem in Userspace) provides even more flexibility by allowing user space programs to build their own drivers that handle any data operation via the filesystem. You can mount remote systems (via SSH) and google drive, and copying files is as easy as `cp /mnt/remote/data/origin /mnt/googledrive/data/`. Or using unique FUSE like pgfuse and redisfs, updating redis value by postgres DB data is just `cat /mnt/postgres/users/100/full_name > /mnt/redis/user_100_full_name`.
But filesystems are only good for hierarchical data while a lot of real world data is relational. Many FUSE software tries hard to represent inherently non-hierarchical data in a filesystem. Data lake allows to use SQL, the elegant abstraction for relational data, across different underlying data sources. They can be physically distant and have different structures. A lot of real world applications are just CRUD on relational data. You can accomplish much more much easier if those data are just a big single database.
Why would Snowflake develop and release this? Doesn't this cannibalize their main product?
Do I understand it correctly that DuckDB would run embedded on the machine running Postgres (i.e through the extension), and this limits query processing ability to whatever that machine can comfortably handle?
What are the deployment implications if one wants to integrate this in production? Surely they'd need a much larger Postgres machine at a minimum.
Is there concern re: "hot neighbour" problems if the DuckDB queries get too heavy? How is that sort of issue potentially handled? I understood from another query that DuckDB is ran in a separate process, so there is room to potentially throttle it
This is so cool! We have files in Iceberg that we then move data to/from to a PG db using a custom utility. It always felt more like a workaround that didn’t fully use the capabilities of both the technologies. Can’t wait to try this out.
Two questions:
(1) Are there any plans to make this compatible with the ducklake specification? Meaning: Instead of using Iceberg in the background, you would use ducklake with its SQL tables? My knowledge is very limited but to me, besides leveraging duckdb, another big point of ducklake is that it's using SQL for the catalog stuff instead of a confusing mixture of files, thereby offering a bunch of advantages like not having to care about number of snapshots and better concurrent writes.
(2) Might it be possible that pg_duckdb will achieve the same thing in some time or do things not work like that?
I was going to ask if you could then put DuckDB over Postgres for the OLAP query engine -- looks like that's already what it does! very interesting development in the data lake space alongside DuckLake and things
RDS really needs to make it easy to install your own PG modules.
This is cool to see! Looks like a compete against pg_mooncake which Databricks acquired. But how is this different from pg_duckdb?
A usecase I see for this personally I have encountered a lot of “hot cache for some time then offload for historical queries” usecases which I have built by hand multiple times. This should be a great fit. E.g. write to Postgres then periodically offload to lakehouse and even query together (if needed). Very cool!
Very cool. One question that comes up for me is whether pg_lake expects to control the Iceberg metadata, or whether it can be used purely as a read layer. If I make schema updates and partition changes to iceberg directly, without going through pg_lake, will pg_lake's catalog correctly reflect things right away?
For anyone looking to easily ingest data into a Postgres Wire compatible database, check out https://github.com/slingdata-io/sling-cli. Use CLI, YAML or Python to run etl jobs.
More integrations are great. Anyway, the "this is awesome" moment (for me) will be when you could mix row- and column-oriented tables in Postgres, a bit like Timescale but native Postgres and well done. Hopefully one day.
Nice does this also allow me to write to parquet from my Postgres table?
This is awesome, I will be trying this out in the coming months. Its just made it to the top of my R&D shortlist for things that could massively simplify our data stack for a b2b saas.
This is really nice though looking at the code - a lot of the postgres types are missing as well a lot of the newer parquet logical types - but this is a great start and a nice use of FDW.
I love this. There are definitely shops where the data is a bit too much for postgres but something like Snowflake would be overkill. Wish this was around a couple years ago lol
I’m not super into the Data sphere but my company relies heavily on Snowflake which is becoming an issue.
This announcement seems huge to me, no?!
Is this really an open source Snowflake covering most use cases?
Very cool! Was there any inherent limitation with postgresql or its extension system that forced pg_lake to use duckdb as query engine?
Can someone dumb this down a bit for a non data-engineer? Hard to fully wrap my head around who this is/isn’t best suited for.
Does anyone know how access control works to the underlying s3 objects? I didn’t see anything regarding grants in the docs.
I love postgres and have created my own "data lake" sorta systems -- what would this add to my workflows?
If anyone from Supabase is reading, it would be awesome to have this extension!
It’s amazing to see Postgres growing into something this powerful
[dead]
[dead]
[dead]
[dead]
Oh datalakes. The most ridiculous idea in data processing, right after data frames in python.
We've had this discussion like a week ago about how stupid is to use filesystem for this kind of data storage and here we go again. Actually i had to implement this "idea" in practice. What a nonsense.
Why not just use Ducklake?[1] That reduces complexity[2] since only DuckDB and PostgreSQL with pg_duckdb are required.
[1] https://ducklake.select/
[2] DuckLake - The SQL-Powered Lakehouse Format for the Rest of Us by Prof. Hannes Mühleisen: https://www.youtube.com/watch?v=YQEUkFWa69o