logoalt Hacker News

voxic11yesterday at 7:29 PM2 repliesview on HN

That is just the archive part, if you just would finish reading the paragraph you would know that updates since 2026-03-16 23:55 UTC are "are fetched every 5 minutes and committed directly as individual Parquet files through an automated live pipeline, so the dataset stays current with the site itself."

So to get all the data you need to grab the archive and all the 5 minute update files.

archive data is here https://huggingface.co/datasets/open-index/hacker-news/tree/...

update files are here (I know that its called "today" but it actually includes all the update files which span multiple days at this point) https://huggingface.co/datasets/open-index/hacker-news/tree/...


Replies

mlhpdxtoday at 4:10 AM

That paragraph doesn’t make it clear (to me) that it’s a snapshot with incremental updates. If that’s what it is. Sorry if my obtuse read offended. I just figured it was edge cached HTML, and less likely it was actually broken.

john_strinlaiyesterday at 7:33 PM

>if you just would finish reading the paragraph

probably uncalled for

show 1 reply