You can mount read replicas on object storage in ClickHouse. Example: CREATE DATA...

zX41ZdbW • 11/07/2024 • 5 replies • view on HN

You can mount read replicas on object storage in ClickHouse.

Example:

    CREATE DATABASE test;
    USE test;

    CREATE TABLE hackernews_history UUID '66491946-56e3-4790-a112-d2dc3963e68a'
    (
        `update_time` DateTime DEFAULT now(),
        `id` UInt32,
        `deleted` UInt8,
        `type` Enum8('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5),
        `by` LowCardinality(String),
        `time` DateTime,
        `text` String,
        `dead` UInt8,
        `parent` UInt32,
        `poll` UInt32,
        `kids` Array(UInt32),
        `url` String,
        `score` Int32,
        `title` String,
        `parts` Array(UInt32),
        `descendants` Int32
    )
    ENGINE = ReplacingMergeTree(update_time)
    ORDER BY id
    SETTINGS disk = disk(readonly = true, type = 's3_plain_rewritable', endpoint = 'https://clicklake-test-2.s3.eu-central-1.amazonaws.com/', use_environment_credentials = false);

And you can try it right now.

Install ClickHouse:

    curl https://clickhouse.com/ | sh
    ./clickhouse local

Run the query above to attach the table.

The table is updated in real time. For example, here is your comment:

    :) SELECT * FROM hackernews_history WHERE text LIKE '%Clickhouse is amazing%' ORDER BY update_time \G

    Row 1:
    ──────
    update_time: 2024-04-06 16:35:28
    id:          39785472
    deleted:     0
    type:        comment
    by:          mightybyte
    time:        2024-03-21 22:59:20
    text:        I&#x27;ll second this.  Clickhouse is amazing.  I was actually using it today to query some CSV files.  I had to refresh my memory on the syntax so if anyone is interested:<p><pre><code>  clickhouse local -q &quot;SELECT foo, sum(bar) FROM file(&#x27;foobar.csv&#x27;, CSV) GROUP BY foo FORMAT Pretty&quot;
    </code></pre>
    Way easier than opening in Excel and creating a pivot table which was my previous workflow.<p>Here&#x27;s a list of the different input and output formats that it supports.<p><a href="https:&#x2F;&#x2F;clickhouse.com&#x2F;docs&#x2F;en&#x2F;interfaces&#x2F;formats" rel="nofollow">https:&#x2F;&#x2F;clickhouse.com&#x2F;docs&#x2F;en&#x2F;interfaces&#x2F;formats</a>
    dead:        0
    parent:      39784942
    poll:        0
    kids:        [39788575]
    url:         
    score:       0
    title:       
    parts:       []
    descendants: 0

    Row 2:
    ──────
    update_time: 2024-04-06 18:07:34
    id:          31334599
    deleted:     0
    type:        comment
    by:          richieartoul
    time:        2022-05-11 00:54:31
    text:        Not really. Clickhouse is amazing, but if you want to run it at massive scale you’ll have to invest a lot into sharding and clustering and all that. Druid is more distributed by default, but doesn’t support as sophisticated of queries as Clickhouse does.<p>Neither Clickhouse nor Druid can hold a candle to what Snowflake can do in terms of query capabilities, as well as the flexibility and richness of their product.<p>That’s just scratching the surface. They’re completely different product categories IMO, although they have a lot of technical &#x2F; architectural overlap depending on how much you squint.<p>Devil is in the details basically.
    dead:        0
    parent:      31334527
    poll:        0
    kids:        [31334736]
    url:         
    score:       0
    title:       
    parts:       []
    descendants: 0

    Row 3:
    ──────
    update_time: 2024-11-07 22:29:09
    id:          42081672
    deleted:     0
    type:        comment
    by:          maxmcd
    time:        2024-11-07 22:13:12
    text:        Using duckdb and apache iceberg means that you can run read replicas without any operational burden. Clickhouse is amazing, but they do not allow you to mount dumb read replicas to object storage (yet).<p>I can imagine this product is a very elegant solution for many types of companies&#x2F;teams&#x2F;workloads.
    dead:        0
    parent:      42080385
    poll:        0
    kids:        []
    url:         
    score:       0
    title:       
    parts:       []
    descendants: 0

    3 rows in set. Elapsed: 3.981 sec. Processed 42.27 million rows, 14.45 GB (10.62 million rows/s., 3.63 GB/s.)
    Peak memory usage: 579.26 MiB.

Replies

oulipo • 11/08/2024

When I try your code I get this, any idea?

Query id: daa202a3-874c-4a68-9e3c-974560ba4624

Elapsed: 0.092 sec.

Received exception: Code: 499. DB::Exception: The AWS Access Key Id you provided does not exist in our records. (Code: 23, S3 exception: 'InvalidAccessKeyId'): While processing disk(readonly = true, type = 's3_plain_rewritable', endpoint = 'https://clicklake-test-2.s3.eu-central-1.amazonaws.com/', use_environment_credentials = false). (S3_ERROR)

➕ show 1 reply

oulipo • 11/08/2024

Very interesting, can you give more info on how this could be used for instance in my IoT case where I want to keep the last 3 months (say) of data in Postgres, and dump old data in parquet/iceberg on S3, and be able to do analytical queries on the past data? Would that be hard to do?

And how does the real-time update work? Could I make it so that my latest data is incrementally sync'd on S3 (eg "the last 3-months block" is incrementally updated efficiently each time there is new data) ?

Do you have example code / setup for this?

➕ show 1 reply

mdaniel • 11/08/2024

> 3 rows in set. Elapsed: 3.981 sec. Processed 42.27 million rows, 14.45 GB (10.62 million rows/s., 3.63 GB/s.)

Since you were running $(./clickhouse local) does that mean the query downloaded 14.45GB out of S3 to your machine? The 3.981s seems to imply "no," but I struggle to think what meaning that output would otherwise try to convey

➕ show 1 reply

maxmcd • 11/08/2024

Whoops, I forgot that tables maintained in the db are not the same as remote archives. :|

➕ show 1 reply

nmduc9988 • 11/08/2024

how about the case of a clickhouse cluster that use zk and a distributed table

alt Hacker News

Replies