This optimization should provide dramatic speed-ups when taking random samples from massive data sets, especially when the wanted columns can contain large values. That's because the basic SQL recipe relies on a LIMIT clause to determine which rows are in the sample (see query below), and this new optimization promises to defer reading the big columns until the LIMIT clause has filtered the data set down to a tiny number of lucky rows.
SELECT *
FROM Population
WHERE weight > 0
ORDER BY -LN(1.0 - RANDOM()) / weight
LIMIT 100 -- Sample size.
Can anyone from ClickHouse verify that the lazy-materialization optimization speeds up queries like this one? (I want to make sure the randomization in the ORDER BY clause doesn't prevent the optimization.)Verified:
EXPLAIN plan actions = 1
SELECT *
FROM amazon.amazon_reviews
WHERE helpful_votes > 0
ORDER BY -log(1 - (rand() / 4294967296.0)) / helpful_votes
LIMIT 3
Lazily read columns: review_body, review_headline, verified_purchase, vine, total_votes, marketplace, star_rating, product_category, customer_id, product_title, product_id, product_parent, review_date, review_idNote that there is a setting query_plan_max_limit_for_lazy_materialization (default value 10) that controls the max n for which lm kicks in for LIMIT n.
The optimization should work well for your sampling query since the ORDER BY and LIMIT operations would happen before materializing the large columns, but the randomization function might force early evaluation - worth benchmarking both approaches.
I checked, and yes - it works: https://pastila.nl/?002a2e01/31807bae7e114ca343577d263be7845...