This optimization should provide dramatic speed-ups when taking random samples from massive d...

tmoertel • 04/22/2025 • 3 replies • view on HN

This optimization should provide dramatic speed-ups when taking random samples from massive data sets, especially when the wanted columns can contain large values. That's because the basic SQL recipe relies on a LIMIT clause to determine which rows are in the sample (see query below), and this new optimization promises to defer reading the big columns until the LIMIT clause has filtered the data set down to a tiny number of lucky rows.

    SELECT *
    FROM Population
    WHERE weight > 0
    ORDER BY -LN(1.0 - RANDOM()) / weight
    LIMIT 100  -- Sample size.

Can anyone from ClickHouse verify that the lazy-materialization optimization speeds up queries like this one? (I want to make sure the randomization in the ORDER BY clause doesn't prevent the optimization.)

Replies

zX41ZdbW • 04/22/2025

I checked, and yes - it works: https://pastila.nl/?002a2e01/31807bae7e114ca343577d263be7845...

➕ show 1 reply

tschreiber • 04/22/2025

Verified:

  EXPLAIN plan actions = 1
  SELECT *
  FROM amazon.amazon_reviews
  WHERE helpful_votes > 0
  ORDER BY -log(1 - (rand() / 4294967296.0)) / helpful_votes
  LIMIT 3

Lazily read columns: review_body, review_headline, verified_purchase, vine, total_votes, marketplace, star_rating, product_category, customer_id, product_title, product_id, product_parent, review_date, review_id

Note that there is a setting query_plan_max_limit_for_lazy_materialization (default value 10) that controls the max n for which lm kicks in for LIMIT n.

➕ show 2 replies

ethan_smith • 04/22/2025

The optimization should work well for your sampling query since the ORDER BY and LIMIT operations would happen before materializing the large columns, but the randomization function might force early evaluation - worth benchmarking both approaches.

alt Hacker News

Replies