[Author here] The whole pipeline runs on a single ~$10/month VPS, but it can process hundreds of TB even with just 12GB RAM and a 200GB SSD.
The main reason I built this was to have HN data that is easy to query and always up to date, without needing to run your own pipeline first. There are also some interesting ideas in the pipeline, like what I call "auto-heal". Happy to share more if anyone is interested :)
A lot of the choices are trade-offs, as usual with data pipelines. I chose Parquet because it is columnar and compressed, so tools like DuckDB or Polars can read only the columns they need. This matters a lot as the dataset grows. I went with Hugging Face mainly because it is simple and already handles distribution and versioning. I can just push data as commits and get a built-in history without managing extra infrastructure (and, more conveniently, if you read the README, you can query it directly using Python or DuckDB).
The pipeline is incremental. Instead of rebuilding everything, it appends small batches every few minutes using the API. That keeps it fresh while staying cheap to run. The data is also partitioned by time, so queries do not need to scan the entire dataset (and I use very simple tech, just a Go binary running in a "screen" session, using only a few MB of RAM for the whole pipeline).