Community, All the HN belong to you. This is an archive of hacker news that fits in your browser. When I made HN Made of Primes I realized I could probably do this offline sqlite/wasm thing with the whole GBs of archive. The whole dataset. So I tried it, and this is it. Have Hacker News on your device.
Go to this repo (https://github.com/DOSAYGO-STUDIO/HackerBook): you can download it. Big Query -> ETL -> npx serve docs - that's it. 20 years of HN arguments and beauty, can be yours forever. So they'll never die. Ever. It's the unkillable static archive of HN and it's your hands. That's my Year End gift to you all. Thank you for a wonderful year, have happy and wonderful 2026. make something of it.
I wonder how much smaller it could get with some compression. You could probably encode "This website hijacks the scrollbar and I don't like it" comments into just a few bits.
It'd be great if you could add it to Kiwix[1] somehow (not sure what the process is for that but 100rabbits figured it out for their site) - I use it all the time now that I have a dumb phone - I have the entirety of wikipedia, wiktionary and 100rabbits all offline.
Awesome work.
Minor bug/suggestion: right-aligned text inputs (eg the username input on the “me” page) aren’t ideal since they are often obscured by input helpers (autocomplete or form fill helper icons).
Similar to Single-page applications (SPA), single-table application (STA) might become a thing. Just a shard a table on multiple keys and serve the shards as static files, provided that the data is Ok to share, similar to sharing static html content.
That's pretty neat!
I did something similar. I build a tool[1] to import the Project Arctic Shift dumps[2] of reddit into sqlite. It was mostly an exercise to experiment with Rust and SQLite (HN's two favorite topics). If you don't build a FTS5 index and import without WAL (--unsafe-mode), import of every reddit comment and submission takes a bit over 24 hours and produces a ~10TB DB.
SQLite offers a lot of cool json features that would let you store the raw json and operate on that, but I eschewed them in favor of parsing only once at load time. THat also lets me normalize the data a bit.
I find that building the DB is pretty "fast", but queries run much faster if I immediately vacuum the DB after building it. The vacuum operation is actually slower than the original import, taking a few days to finish.
[1] https://github.com/Paul-E/Pushshift-Importer
[2] https://github.com/ArthurHeitmann/arctic_shift/blob/master/d...
I tried "select * from items limit 10" and it is slowly iterating through the shards without returning. I got up to 60 shards before I stopped. Selecting just one shard makes that query return instantly. As mentioned elsewhere I think duckdb can work faster by only reading the part of a parquet file it needs over http.
I was getting an error that the users and user_domains tables aren't available, but you just need to change the shard filter to the user stats shard.
That repo is throwing up a 404 for me.
Question - did you consider tradeoffs between duckdb (or other columnar stores) and SQLite?
Similar in spirit to a recent tool I recently posted Show HN on, https://exopriors.com/scry. You can use Claude Code to SQL+vector query HackerNews and many other high quality public commons sites, exceptionally well-indexed and usually 5+ minute query timeout limits, so you can run seriously large research queries, to rapidly refine your worldview (particular because you can do easily to EXHAUSTIVE exploration).
Looks like the repo was taken down (404).
That's too bad, I'd like to see the inner-working with a subset of data, even with placeholders for the posts and comments.
Absolutely love this!! I have been analyzing a lot of HN data lately [1] so I backtested my hypothesis on your dataset and ran some stats: https://philippdubach.com/standalone/hackerbook-stats/
threw some heatmaps together of post volume and average score by day and time (15min intervals)
story volume (all time): https://ibb.co/pBTTRznP
average score (all time): https://ibb.co/KcvVjx8p
story volume (since 2020): https://ibb.co/cKC5d7Pp
average score (since 2020): https://ibb.co/WpN20kfh
The query tab looks quite complex with all these content shards: https://hackerbook.dosaygo.com/?view=query
I have a much simpler database: https://play.clickhouse.com/play?user=play#U0VMRUNUIHRpbWUsI...
What a reminder on how text is so much more efficient than video, its crazy! Could you imagine the same amount of knowledge (or dribble) but in video form? I wonder how large that would be.
Site does not load on Firefox console error says 'Uncaught (in promise) TypeError: can't access property "wasm", sqlite3 is null'
Guess its common knowledge that SharedArrayBuffer (SQLite wasm) does not work with FF due to Cross-Origin Attacks (i just found out ;).
Once the initial chunk of data loads the rest load almost instantly on Chrome. Can you please fix the GitHub link (current 404) would like to peak at the code. Thank you!
Wonder if you could turn this into a .zim file for offline browsing with an offline browser like Kiwix, etc. [0]
I've been taking frequent "offline-only-day" breaks to consolidate whatever I've been learning, and Kiwix has been a great tool for reference (offline Wikipedia, StackOverflow and whatnot).
[0] https://kiwix.org/en/the-new-kiwix-library-is-available/
link no workie: https://github.com/DOSAYGO-STUDIO/HackerBook
Nice. I wonder if there’s any way to quickly get a view for a whole year.
Is it a thing that the design is almost unusable on a mobile phone? The tech making this possible is beyond cool, but it's just presented in such a brutal way for phone users, even though fixing it would be super simple.
Neat. I keep wanting to build something like this for GitHub audit logs, but at ~5 tb, probably a little much
It's really a shame that comment scores are hidden forever. Would the admins consider publishing them after stories are old enough that voting is closed? It would be great to have them for archives and search indices and projects like this.
Did anyone get a copy of this before it was pulled? If GitHub is not keen, could it be uploaded to HuggingFace or some other service which hosts large assets?
I have always known I could scrape HN, but I would much rather take a neat little package.
Is there a public dump of the data anywhere that this is based upon, or have they scraped it themselves?
Such as DB might be entertaining to play with, and the threadedness of comments would be useful for beginners to practise efficient recursive queries (more so than the StackExchange dumps, for instance).
1 hour passed and it's already nuked?
Thank you btw
This is pretty neat! The calendar didn't work well for me. I could only seem to navigate by month. And when I selected the earliest day (after much tapping), nothing seemed to be updated.
Nonetheless, random access history is cool.
Suddenly occurs to me that it would be neat to pair a small LLM (3-7B) with an HN dataset
22gb for mostly text? tried loading the site, it's pretty slow. curious how the query performance is with this much data in sqlite
Apparently the comment counts are only the top-level comments?
It would be nice for the thread pages to show a comment count.
Is this updated regularly? 404 on GitHub as the other comment.
With all due respect it would be great if there is an official HN public dump available (and not requiring stuff such as BigQuery which is expensive).
How do I download it? That repo is a 404.
Beautiful !
2026 prayer: for all you AI junkies—please don’t pollute H/N with your dirty AI gaming.
Don’t bot posts, comments, or upvote/downvote just to maximize karma. Please.
We can’t identify anymore who’s a bot and who’s human. I just want to hang out with real humans here.
Hahaha, now you can be prepared for the apocalypse when the internet disappears. ;)
> Community, All the HN belong to you. This is an archive of hacker news that fits in your browser.
> 20 years of HN arguments and beauty, can be yours forever. So they'll never die. Ever. It's the unkillable static archive of HN and it's your hands
I'm really sorry to have to ask this, but this really feels like you had an LLM write it?
How much space is needed? ...for the data .... Im wondering if it would work on a tablet? ....
Alas, HN does not belong to us, and the existence of projects like this are subject to the whims of the legal owners of HN.
From the terms of use [0]:
"""
Commercial Use: Unless otherwise expressly authorized herein or in the Site, you agree not to display, distribute, license, perform, publish, reproduce, duplicate, copy, create derivative works from, modify, sell, resell, exploit, transfer or upload for any commercial purposes, any portion of the Site, use of the Site, or access to the Site. The buying, exchanging, selling and/or promotion (commercial or otherwise) of upvotes, comments, submissions, accounts (or any aspect of your account or any other account), karma, and/or content is strictly prohibited, constitutes a material breach of these Terms of Use, and could result in legal liability.
"""
Don't miss how this works. It's not a server-side application - this code runs entirely in your browser using SQLite compiled to WASM, but rather than fetching a full 22GB database it instead uses a clever hack that retrieves just "shards" of the SQLite database needed for the page you are viewing.
I watched it in the browser network panel and saw it fetch:
As I paginated to previous days.It's reminiscent of that brilliant SQLite.js VFS trick from a few years ago: https://github.com/phiresky/sql.js-httpvfs - only that one used HTTP range headers, this one uses sharded files instead.
The interactive SQL query interface at https://hackerbook.dosaygo.com/?view=query asks you to select which shards to run the query against, there are 1636 total.