Really cool! I'll take a look! - can you then easily query it with duckdb / clickhouse &...

oulipo • 11/08/2024 • 1 reply • view on HN

Really cool! I'll take a look!

- can you then easily query it with duckdb / clickhouse / something else? What do you use yourself? do you have some tutorial / toy example to check?

- would it be complicated to have the real-time data be also stored somehow on S3 so it would be "transparent" to do query on historical data which includes day data?

- what typical "batch data" size makes sense, I guess doing "day batches" might be a bit small and will incurr too many "read" operations (if I have moderate amount of day data), rather than "week batches"? but then the "timelag" increases?

Replies

jconline543 • 11/08/2024

Thanks for your interest! Let me address your questions:

Querying the data: Yes, you can easily query the Parquet files with DuckDB. The files are stored in a year/month partitioned structure (e.g., year=2024/month=03/iot_data_20240315_143022.parquet), which makes it efficient to query specific time ranges. I personally use DuckDB for ad-hoc analysis since it works great with Parquet. Here's a quick example:

sqlCopySELECT * FROM read_parquet('s3://my-iot-archive/year=/month=/iot_data_*.parquet') WHERE timestamp BETWEEN '2023-01-01' AND '2023-12-31'

Real-time data on S3: Currently, the tool is batch-focused. Adding real-time sync would require some architectural changes - either using CDC (Change Data Capture) or implementing a dual-write pattern. I kept it simple for now since most IoT visualization use cases I've seen focus on recent data in Postgres. If you need this feature, I'd be happy to take a look at what you want.

For data processing the tool works like this:

It identifies records older than 90 days (configurable retention period)

Processes these records in batches of 100 (also configurable) to manage memory usage

Creates Parquet files partitioned by year/month in S3

Deletes the archived records from Postgres

The key is that you always have your recent 90 days in Postgres for fast querying, while maintaining older data in a cost-effective S3 storage that you can still query when needed. You can adjust both the retention period and batch size based on your specific needs.

Let me know if you'd like me to clarify anything or if you have other questions!

➕ show 1 reply

alt Hacker News

Replies