logoalt Hacker News

Show HN: 22 GB of Hacker News in SQLite

667 pointsby keepamovinyesterday at 5:01 PM201 commentsview on HN

Community, All the HN belong to you. This is an archive of hacker news that fits in your browser. When I made HN Made of Primes I realized I could probably do this offline sqlite/wasm thing with the whole GBs of archive. The whole dataset. So I tried it, and this is it. Have Hacker News on your device.

Go to this repo (https://github.com/DOSAYGO-STUDIO/HackerBook): you can download it. Big Query -> ETL -> npx serve docs - that's it. 20 years of HN arguments and beauty, can be yours forever. So they'll never die. Ever. It's the unkillable static archive of HN and it's your hands. That's my Year End gift to you all. Thank you for a wonderful year, have happy and wonderful 2026. make something of it.


Comments

simonwyesterday at 7:27 PM

Don't miss how this works. It's not a server-side application - this code runs entirely in your browser using SQLite compiled to WASM, but rather than fetching a full 22GB database it instead uses a clever hack that retrieves just "shards" of the SQLite database needed for the page you are viewing.

I watched it in the browser network panel and saw it fetch:

  https://hackerbook.dosaygo.com/static-shards/shard_1636.sqlite.gz
  https://hackerbook.dosaygo.com/static-shards/shard_1635.sqlite.gz
  https://hackerbook.dosaygo.com/static-shards/shard_1634.sqlite.gz
As I paginated to previous days.

It's reminiscent of that brilliant SQLite.js VFS trick from a few years ago: https://github.com/phiresky/sql.js-httpvfs - only that one used HTTP range headers, this one uses sharded files instead.

The interactive SQL query interface at https://hackerbook.dosaygo.com/?view=query asks you to select which shards to run the query against, there are 1636 total.

show 7 replies
yreadyesterday at 9:17 PM

I wonder how much smaller it could get with some compression. You could probably encode "This website hijacks the scrollbar and I don't like it" comments into just a few bits.

show 4 replies
kamranjontoday at 12:58 AM

It'd be great if you could add it to Kiwix[1] somehow (not sure what the process is for that but 100rabbits figured it out for their site) - I use it all the time now that I have a dumb phone - I have the entirety of wikipedia, wiktionary and 100rabbits all offline.

https://kiwix.org/en/

show 2 replies
ComputerGurutoday at 5:42 PM

Awesome work.

Minor bug/suggestion: right-aligned text inputs (eg the username input on the “me” page) aren’t ideal since they are often obscured by input helpers (autocomplete or form fill helper icons).

zkmonyesterday at 7:41 PM

Similar to Single-page applications (SPA), single-table application (STA) might become a thing. Just a shard a table on multiple keys and serve the shards as static files, provided that the data is Ok to share, similar to sharing static html content.

show 2 replies
Paul-Eyesterday at 7:17 PM

That's pretty neat!

I did something similar. I build a tool[1] to import the Project Arctic Shift dumps[2] of reddit into sqlite. It was mostly an exercise to experiment with Rust and SQLite (HN's two favorite topics). If you don't build a FTS5 index and import without WAL (--unsafe-mode), import of every reddit comment and submission takes a bit over 24 hours and produces a ~10TB DB.

SQLite offers a lot of cool json features that would let you store the raw json and operate on that, but I eschewed them in favor of parsing only once at load time. THat also lets me normalize the data a bit.

I find that building the DB is pretty "fast", but queries run much faster if I immediately vacuum the DB after building it. The vacuum operation is actually slower than the original import, taking a few days to finish.

[1] https://github.com/Paul-E/Pushshift-Importer

[2] https://github.com/ArthurHeitmann/arctic_shift/blob/master/d...

show 2 replies
kristianpyesterday at 11:13 PM

I tried "select * from items limit 10" and it is slowly iterating through the shards without returning. I got up to 60 shards before I stopped. Selecting just one shard makes that query return instantly. As mentioned elsewhere I think duckdb can work faster by only reading the part of a parquet file it needs over http.

I was getting an error that the users and user_domains tables aren't available, but you just need to change the shard filter to the user stats shard.

show 2 replies
carbocationyesterday at 6:37 PM

That repo is throwing up a 404 for me.

Question - did you consider tradeoffs between duckdb (or other columnar stores) and SQLite?

show 5 replies
Xyratoday at 7:16 AM

Similar in spirit to a recent tool I recently posted Show HN on, https://exopriors.com/scry. You can use Claude Code to SQL+vector query HackerNews and many other high quality public commons sites, exceptionally well-indexed and usually 5+ minute query timeout limits, so you can run seriously large research queries, to rapidly refine your worldview (particular because you can do easily to EXHAUSTIVE exploration).

show 2 replies
m-p-3yesterday at 10:40 PM

Looks like the repo was taken down (404).

That's too bad, I'd like to see the inner-working with a subset of data, even with placeholders for the posts and comments.

show 2 replies
7777777philtoday at 11:43 AM

Absolutely love this!! I have been analyzing a lot of HN data lately [1] so I backtested my hypothesis on your dataset and ran some stats: https://philippdubach.com/standalone/hackerbook-stats/

[1]https://news.ycombinator.com/item?id=46434575

show 1 reply
WadeGrimridgetoday at 10:34 AM

threw some heatmaps together of post volume and average score by day and time (15min intervals)

story volume (all time): https://ibb.co/pBTTRznP

average score (all time): https://ibb.co/KcvVjx8p

story volume (since 2020): https://ibb.co/cKC5d7Pp

average score (since 2020): https://ibb.co/WpN20kfh

show 2 replies
zX41ZdbWyesterday at 6:49 PM

The query tab looks quite complex with all these content shards: https://hackerbook.dosaygo.com/?view=query

I have a much simpler database: https://play.clickhouse.com/play?user=play#U0VMRUNUIHRpbWUsI...

show 1 reply
sieepyesterday at 7:31 PM

What a reminder on how text is so much more efficient than video, its crazy! Could you imagine the same amount of knowledge (or dribble) but in video form? I wonder how large that would be.

show 4 replies
Sn0wCoderyesterday at 9:57 PM

Site does not load on Firefox console error says 'Uncaught (in promise) TypeError: can't access property "wasm", sqlite3 is null'

Guess its common knowledge that SharedArrayBuffer (SQLite wasm) does not work with FF due to Cross-Origin Attacks (i just found out ;).

Once the initial chunk of data loads the rest load almost instantly on Chrome. Can you please fix the GitHub link (current 404) would like to peak at the code. Thank you!

show 2 replies
abixbyesterday at 7:12 PM

Wonder if you could turn this into a .zim file for offline browsing with an offline browser like Kiwix, etc. [0]

I've been taking frequent "offline-only-day" breaks to consolidate whatever I've been learning, and Kiwix has been a great tool for reference (offline Wikipedia, StackOverflow and whatnot).

[0] https://kiwix.org/en/the-new-kiwix-library-is-available/

show 2 replies
diyseguyyesterday at 11:20 PM

link no workie: https://github.com/DOSAYGO-STUDIO/HackerBook

show 1 reply
tevonyesterday at 7:34 PM

The link seems to be down, was it taken down?

show 1 reply
rcarmotoday at 1:40 PM

Nice. I wonder if there’s any way to quickly get a view for a whole year.

show 1 reply
adamszakaltoday at 12:14 PM

Is it a thing that the design is almost unusable on a mobile phone? The tech making this possible is beyond cool, but it's just presented in such a brutal way for phone users, even though fixing it would be super simple.

show 1 reply
RyJonestoday at 10:12 AM

Neat. I keep wanting to build something like this for GitHub audit logs, but at ~5 tb, probably a little much

modelesstoday at 1:45 AM

It's really a shame that comment scores are hidden forever. Would the admins consider publishing them after stories are old enough that voting is closed? It would be great to have them for archives and search indices and projects like this.

show 2 replies
3eb7988a1663today at 1:41 AM

Did anyone get a copy of this before it was pulled? If GitHub is not keen, could it be uploaded to HuggingFace or some other service which hosts large assets?

I have always known I could scrape HN, but I would much rather take a neat little package.

dspillettyesterday at 10:15 PM

Is there a public dump of the data anywhere that this is based upon, or have they scraped it themselves?

Such as DB might be entertaining to play with, and the threadedness of comments would be useful for beginners to practise efficient recursive queries (more so than the StackExchange dumps, for instance).

show 2 replies
yupyupyupsyesterday at 6:51 PM

1 hour passed and it's already nuked?

Thank you btw

spit2windyesterday at 9:40 PM

This is pretty neat! The calendar didn't work well for me. I could only seem to navigate by month. And when I selected the earliest day (after much tapping), nothing seemed to be updated.

Nonetheless, random access history is cool.

show 1 reply
fouctoday at 3:26 AM

Suddenly occurs to me that it would be neat to pair a small LLM (3-7B) with an HN dataset

show 1 reply
dmarwickeyesterday at 9:19 PM

22gb for mostly text? tried loading the site, it's pretty slow. curious how the query performance is with this much data in sqlite

layer8yesterday at 10:12 PM

Apparently the comment counts are only the top-level comments?

It would be nice for the thread pages to show a comment count.

show 1 reply
wslhyesterday at 6:44 PM

Is this updated regularly? 404 on GitHub as the other comment.

With all due respect it would be great if there is an official HN public dump available (and not requiring stuff such as BigQuery which is expensive).

show 1 reply
joshcsimmonsyesterday at 11:26 PM

Link appears broken

show 1 reply
KomoDyesterday at 10:45 PM

How do I download it? That repo is a 404.

sirjazyesterday at 7:33 PM

This would be awesome as a cross platform app.

show 1 reply
solarizedyesterday at 11:12 PM

Beautiful !

2026 prayer: for all you AI junkies—please don’t pollute H/N with your dirty AI gaming.

Don’t bot posts, comments, or upvote/downvote just to maximize karma. Please.

We can’t identify anymore who’s a bot and who’s human. I just want to hang out with real humans here.

DenisDolyatoday at 7:57 AM

Hahaha, now you can be prepared for the apocalypse when the internet disappears. ;)

fao_yesterday at 7:24 PM

> Community, All the HN belong to you. This is an archive of hacker news that fits in your browser.

> 20 years of HN arguments and beauty, can be yours forever. So they'll never die. Ever. It's the unkillable static archive of HN and it's your hands

I'm really sorry to have to ask this, but this really feels like you had an LLM write it?

show 5 replies
asdefghykyesterday at 5:37 PM

How much space is needed? ...for the data .... Im wondering if it would work on a tablet? ....

show 2 replies
abetuskyesterday at 10:34 PM

Alas, HN does not belong to us, and the existence of projects like this are subject to the whims of the legal owners of HN.

From the terms of use [0]:

"""

Commercial Use: Unless otherwise expressly authorized herein or in the Site, you agree not to display, distribute, license, perform, publish, reproduce, duplicate, copy, create derivative works from, modify, sell, resell, exploit, transfer or upload for any commercial purposes, any portion of the Site, use of the Site, or access to the Site. The buying, exchanging, selling and/or promotion (commercial or otherwise) of upvotes, comments, submissions, accounts (or any aspect of your account or any other account), karma, and/or content is strictly prohibited, constitutes a material breach of these Terms of Use, and could result in legal liability.

"""

[0] https://www.ycombinator.com/legal/#tou

show 1 reply