You can mount read replicas on object storage in ClickHouse.
Example:
CREATE DATABASE test;
USE test;
CREATE TABLE hackernews_history UUID '66491946-56e3-4790-a112-d2dc3963e68a'
(
`update_time` DateTime DEFAULT now(),
`id` UInt32,
`deleted` UInt8,
`type` Enum8('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5),
`by` LowCardinality(String),
`time` DateTime,
`text` String,
`dead` UInt8,
`parent` UInt32,
`poll` UInt32,
`kids` Array(UInt32),
`url` String,
`score` Int32,
`title` String,
`parts` Array(UInt32),
`descendants` Int32
)
ENGINE = ReplacingMergeTree(update_time)
ORDER BY id
SETTINGS disk = disk(readonly = true, type = 's3_plain_rewritable', endpoint = 'https://clicklake-test-2.s3.eu-central-1.amazonaws.com/', use_environment_credentials = false);
And you can try it right now.Install ClickHouse:
curl https://clickhouse.com/ | sh
./clickhouse local
Run the query above to attach the table.The table is updated in real time. For example, here is your comment:
:) SELECT * FROM hackernews_history WHERE text LIKE '%Clickhouse is amazing%' ORDER BY update_time \G
Row 1:
──────
update_time: 2024-04-06 16:35:28
id: 39785472
deleted: 0
type: comment
by: mightybyte
time: 2024-03-21 22:59:20
text: I'll second this. Clickhouse is amazing. I was actually using it today to query some CSV files. I had to refresh my memory on the syntax so if anyone is interested:<p><pre><code> clickhouse local -q "SELECT foo, sum(bar) FROM file('foobar.csv', CSV) GROUP BY foo FORMAT Pretty"
</code></pre>
Way easier than opening in Excel and creating a pivot table which was my previous workflow.<p>Here's a list of the different input and output formats that it supports.<p><a href="https://clickhouse.com/docs/en/interfaces/formats" rel="nofollow">https://clickhouse.com/docs/en/interfaces/formats</a>
dead: 0
parent: 39784942
poll: 0
kids: [39788575]
url:
score: 0
title:
parts: []
descendants: 0
Row 2:
──────
update_time: 2024-04-06 18:07:34
id: 31334599
deleted: 0
type: comment
by: richieartoul
time: 2022-05-11 00:54:31
text: Not really. Clickhouse is amazing, but if you want to run it at massive scale you’ll have to invest a lot into sharding and clustering and all that. Druid is more distributed by default, but doesn’t support as sophisticated of queries as Clickhouse does.<p>Neither Clickhouse nor Druid can hold a candle to what Snowflake can do in terms of query capabilities, as well as the flexibility and richness of their product.<p>That’s just scratching the surface. They’re completely different product categories IMO, although they have a lot of technical / architectural overlap depending on how much you squint.<p>Devil is in the details basically.
dead: 0
parent: 31334527
poll: 0
kids: [31334736]
url:
score: 0
title:
parts: []
descendants: 0
Row 3:
──────
update_time: 2024-11-07 22:29:09
id: 42081672
deleted: 0
type: comment
by: maxmcd
time: 2024-11-07 22:13:12
text: Using duckdb and apache iceberg means that you can run read replicas without any operational burden. Clickhouse is amazing, but they do not allow you to mount dumb read replicas to object storage (yet).<p>I can imagine this product is a very elegant solution for many types of companies/teams/workloads.
dead: 0
parent: 42080385
poll: 0
kids: []
url:
score: 0
title:
parts: []
descendants: 0
3 rows in set. Elapsed: 3.981 sec. Processed 42.27 million rows, 14.45 GB (10.62 million rows/s., 3.63 GB/s.)
Peak memory usage: 579.26 MiB.
Very interesting, can you give more info on how this could be used for instance in my IoT case where I want to keep the last 3 months (say) of data in Postgres, and dump old data in parquet/iceberg on S3, and be able to do analytical queries on the past data? Would that be hard to do?
And how does the real-time update work? Could I make it so that my latest data is incrementally sync'd on S3 (eg "the last 3-months block" is incrementally updated efficiently each time there is new data) ?
Do you have example code / setup for this?
> 3 rows in set. Elapsed: 3.981 sec. Processed 42.27 million rows, 14.45 GB (10.62 million rows/s., 3.63 GB/s.)
Since you were running $(./clickhouse local) does that mean the query downloaded 14.45GB out of S3 to your machine? The 3.981s seems to imply "no," but I struggle to think what meaning that output would otherwise try to convey
Whoops, I forgot that tables maintained in the db are not the same as remote archives. :|
how about the case of a clickhouse cluster that use zk and a distributed table
When I try your code I get this, any idea?
Query id: daa202a3-874c-4a68-9e3c-974560ba4624
Elapsed: 0.092 sec.
Received exception: Code: 499. DB::Exception: The AWS Access Key Id you provided does not exist in our records. (Code: 23, S3 exception: 'InvalidAccessKeyId'): While processing disk(readonly = true, type = 's3_plain_rewritable', endpoint = 'https://clicklake-test-2.s3.eu-central-1.amazonaws.com/', use_environment_credentials = false). (S3_ERROR)