logoalt Hacker News

jconline54311/08/20241 replyview on HN

Thanks for your interest! Let me address your questions:

Querying the data: Yes, you can easily query the Parquet files with DuckDB. The files are stored in a year/month partitioned structure (e.g., year=2024/month=03/iot_data_20240315_143022.parquet), which makes it efficient to query specific time ranges. I personally use DuckDB for ad-hoc analysis since it works great with Parquet. Here's a quick example:

sqlCopySELECT * FROM read_parquet('s3://my-iot-archive/year=/month=/iot_data_*.parquet') WHERE timestamp BETWEEN '2023-01-01' AND '2023-12-31'

Real-time data on S3: Currently, the tool is batch-focused. Adding real-time sync would require some architectural changes - either using CDC (Change Data Capture) or implementing a dual-write pattern. I kept it simple for now since most IoT visualization use cases I've seen focus on recent data in Postgres. If you need this feature, I'd be happy to take a look at what you want.

For data processing the tool works like this:

It identifies records older than 90 days (configurable retention period)

Processes these records in batches of 100 (also configurable) to manage memory usage

Creates Parquet files partitioned by year/month in S3

Deletes the archived records from Postgres

The key is that you always have your recent 90 days in Postgres for fast querying, while maintaining older data in a cost-effective S3 storage that you can still query when needed. You can adjust both the retention period and batch size based on your specific needs.

Let me know if you'd like me to clarify anything or if you have other questions!


Replies

jconline54311/08/2024

Also, for querying both recent and historical data together, you wouldn't need to modify this tool at all. You could just add a separate periodic job (e.g. hourly/daily) that copies recent data to S3:

sqlCopyCOPY (SELECT * FROM iot_data WHERE timestamp > current_date - interval '90 days')

TO 's3://bucket/recent/iot_data.parquet' (FORMAT 'parquet')

Then query everything together in DuckDB:

sqlCopySELECT * FROM read_parquet([

  's3://bucket/year=*/month=*/iot_data_\*.parquet',  -- archived data
  's3://bucket/recent/iot_data.parquet'             -- recent data
])

Much simpler than implementing real-time sync, and you still get a unified view of all your data for analysis (just with a small delay on recent data).