The better question is, why is DuckDB so popular when one can use Polars which has a sane, lintable, typesafe API compared to the mess that is SQL:
WITH lagged AS (
SELECT
*,
LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time
FROM events
),
sessions AS (
SELECT
*,
SUM(COALESCE((date_diff('minute', prev_time, event_time) > 30)::INT, 1))
OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
FROM lagged
)
SELECT
user_id,
session_id,
MIN(event_time) AS session_start,
MAX(event_time) AS session_end,
COUNT(*) AS event_count
FROM sessions
GROUP BY ALL
ORDER BY user_id, session_start;
vs result = (
df.sort(["user_id", "event_time"])
.with_columns(
session_id=(
pl.when(pl.col("event_time").diff().is_null())
.then(1)
.when(pl.col("event_time").diff().dt.total_minutes() > 30)
.then(1)
.otherwise(0)
.cum_sum()
.over("user_id")
)
)
.group_by(["user_id", "session_id"])
.agg(
session_start=pl.col("event_time").min(),
session_end=pl.col("event_time").max(),
event_count=pl.col("event_time").count(),
)
.sort(["user_id", "session_start"])
)That does look nicer if you have a Parquet file and want to analyze it. But DuckDB is also a database - if you want a persistent, reliable and mutable data store I don't think Polars would be suitable would it? (Genuine question - you sound like an expert and I'm not.)
I understand the linting aspect but not gonna lie I understood the first one immediately way more than the 2nd one due to knowing SQL well.
"Languages come and go, but SQL is forever"
Polars typesafe? It doesn't show you any errors until runtime right? Kusto query language is the best I've seen at type safety and I wish open source DBs would steal some ideas from it.