Show HN: 22 GB of Hacker News in SQLite

667 points • by keepamovin • yesterday at 5:01 PM • 201 comments • view on HN

Community, All the HN belong to you. This is an archive of hacker news that fits in your browser. When I made HN Made of Primes I realized I could probably do this offline sqlite/wasm thing with the whole GBs of archive. The whole dataset. So I tried it, and this is it. Have Hacker News on your device.

Go to this repo (https://github.com/DOSAYGO-STUDIO/HackerBook): you can download it. Big Query -> ETL -> npx serve docs - that's it. 20 years of HN arguments and beauty, can be yours forever. So they'll never die. Ever. It's the unkillable static archive of HN and it's your hands. That's my Year End gift to you all. Thank you for a wonderful year, have happy and wonderful 2026. make something of it.

Comments

simonw • yesterday at 7:27 PM

Don't miss how this works. It's not a server-side application - this code runs entirely in your browser using SQLite compiled to WASM, but rather than fetching a full 22GB database it instead uses a clever hack that retrieves just "shards" of the SQLite database needed for the page you are viewing.

I watched it in the browser network panel and saw it fetch:

  https://hackerbook.dosaygo.com/static-shards/shard_1636.sqlite.gz
  https://hackerbook.dosaygo.com/static-shards/shard_1635.sqlite.gz
  https://hackerbook.dosaygo.com/static-shards/shard_1634.sqlite.gz

As I paginated to previous days.

It's reminiscent of that brilliant SQLite.js VFS trick from a few years ago: https://github.com/phiresky/sql.js-httpvfs - only that one used HTTP range headers, this one uses sharded files instead.

The interactive SQL query interface at https://hackerbook.dosaygo.com/?view=query asks you to select which shards to run the query against, there are 1636 total.

➕ show 7 replies

yread • yesterday at 9:17 PM

I wonder how much smaller it could get with some compression. You could probably encode "This website hijacks the scrollbar and I don't like it" comments into just a few bits.

➕ show 4 replies

kamranjon • today at 12:58 AM

It'd be great if you could add it to Kiwix[1] somehow (not sure what the process is for that but 100rabbits figured it out for their site) - I use it all the time now that I have a dumb phone - I have the entirety of wikipedia, wiktionary and 100rabbits all offline.

https://kiwix.org/en/

➕ show 2 replies

ComputerGuru • today at 5:42 PM

Awesome work.

Minor bug/suggestion: right-aligned text inputs (eg the username input on the “me” page) aren’t ideal since they are often obscured by input helpers (autocomplete or form fill helper icons).

zkmon • yesterday at 7:41 PM

Similar to Single-page applications (SPA), single-table application (STA) might become a thing. Just a shard a table on multiple keys and serve the shards as static files, provided that the data is Ok to share, similar to sharing static html content.

➕ show 2 replies

Paul-E • yesterday at 7:17 PM

That's pretty neat!

I did something similar. I build a tool[1] to import the Project Arctic Shift dumps[2] of reddit into sqlite. It was mostly an exercise to experiment with Rust and SQLite (HN's two favorite topics). If you don't build a FTS5 index and import without WAL (--unsafe-mode), import of every reddit comment and submission takes a bit over 24 hours and produces a ~10TB DB.

SQLite offers a lot of cool json features that would let you store the raw json and operate on that, but I eschewed them in favor of parsing only once at load time. THat also lets me normalize the data a bit.

I find that building the DB is pretty "fast", but queries run much faster if I immediately vacuum the DB after building it. The vacuum operation is actually slower than the original import, taking a few days to finish.

[1] https://github.com/Paul-E/Pushshift-Importer

[2] https://github.com/ArthurHeitmann/arctic_shift/blob/master/d...

➕ show 2 replies

kristianp • yesterday at 11:13 PM

I tried "select * from items limit 10" and it is slowly iterating through the shards without returning. I got up to 60 shards before I stopped. Selecting just one shard makes that query return instantly. As mentioned elsewhere I think duckdb can work faster by only reading the part of a parquet file it needs over http.

I was getting an error that the users and user_domains tables aren't available, but you just need to change the shard filter to the user stats shard.

➕ show 2 replies

carbocation • yesterday at 6:37 PM

That repo is throwing up a 404 for me.

Question - did you consider tradeoffs between duckdb (or other columnar stores) and SQLite?

➕ show 5 replies

Xyra • today at 7:16 AM

Similar in spirit to a recent tool I recently posted Show HN on, https://exopriors.com/scry. You can use Claude Code to SQL+vector query HackerNews and many other high quality public commons sites, exceptionally well-indexed and usually 5+ minute query timeout limits, so you can run seriously large research queries, to rapidly refine your worldview (particular because you can do easily to EXHAUSTIVE exploration).

➕ show 2 replies

m-p-3 • yesterday at 10:40 PM

Looks like the repo was taken down (404).

That's too bad, I'd like to see the inner-working with a subset of data, even with placeholders for the posts and comments.

➕ show 2 replies

7777777phil • today at 11:43 AM

Absolutely love this!! I have been analyzing a lot of HN data lately [1] so I backtested my hypothesis on your dataset and ran some stats: https://philippdubach.com/standalone/hackerbook-stats/

[1]https://news.ycombinator.com/item?id=46434575

➕ show 1 reply

WadeGrimridge • today at 10:34 AM

threw some heatmaps together of post volume and average score by day and time (15min intervals)

story volume (all time): https://ibb.co/pBTTRznP

average score (all time): https://ibb.co/KcvVjx8p

story volume (since 2020): https://ibb.co/cKC5d7Pp

average score (since 2020): https://ibb.co/WpN20kfh

➕ show 2 replies

zX41ZdbW • yesterday at 6:49 PM

The query tab looks quite complex with all these content shards: https://hackerbook.dosaygo.com/?view=query

I have a much simpler database: https://play.clickhouse.com/play?user=play#U0VMRUNUIHRpbWUsI...

➕ show 1 reply

sieep • yesterday at 7:31 PM

What a reminder on how text is so much more efficient than video, its crazy! Could you imagine the same amount of knowledge (or dribble) but in video form? I wonder how large that would be.

➕ show 4 replies

Sn0wCoder • yesterday at 9:57 PM

Site does not load on Firefox console error says 'Uncaught (in promise) TypeError: can't access property "wasm", sqlite3 is null'

Guess its common knowledge that SharedArrayBuffer (SQLite wasm) does not work with FF due to Cross-Origin Attacks (i just found out ;).

Once the initial chunk of data loads the rest load almost instantly on Chrome. Can you please fix the GitHub link (current 404) would like to peak at the code. Thank you!

➕ show 2 replies

abixb • yesterday at 7:12 PM

Wonder if you could turn this into a .zim file for offline browsing with an offline browser like Kiwix, etc. [0]

I've been taking frequent "offline-only-day" breaks to consolidate whatever I've been learning, and Kiwix has been a great tool for reference (offline Wikipedia, StackOverflow and whatnot).

[0] https://kiwix.org/en/the-new-kiwix-library-is-available/

➕ show 2 replies

diyseguy • yesterday at 11:20 PM

link no workie: https://github.com/DOSAYGO-STUDIO/HackerBook

➕ show 1 reply

tevon • yesterday at 7:34 PM

The link seems to be down, was it taken down?

➕ show 1 reply

rcarmo • today at 1:40 PM

Nice. I wonder if there’s any way to quickly get a view for a whole year.

➕ show 1 reply

adamszakal • today at 12:14 PM

Is it a thing that the design is almost unusable on a mobile phone? The tech making this possible is beyond cool, but it's just presented in such a brutal way for phone users, even though fixing it would be super simple.

➕ show 1 reply

RyJones • today at 10:12 AM

Neat. I keep wanting to build something like this for GitHub audit logs, but at ~5 tb, probably a little much

modeless • today at 1:45 AM

It's really a shame that comment scores are hidden forever. Would the admins consider publishing them after stories are old enough that voting is closed? It would be great to have them for archives and search indices and projects like this.

➕ show 2 replies

3eb7988a1663 • today at 1:41 AM

Did anyone get a copy of this before it was pulled? If GitHub is not keen, could it be uploaded to HuggingFace or some other service which hosts large assets?

I have always known I could scrape HN, but I would much rather take a neat little package.

dspillett • yesterday at 10:15 PM

Is there a public dump of the data anywhere that this is based upon, or have they scraped it themselves?

Such as DB might be entertaining to play with, and the threadedness of comments would be useful for beginners to practise efficient recursive queries (more so than the StackExchange dumps, for instance).

➕ show 2 replies

yupyupyups • yesterday at 6:51 PM

1 hour passed and it's already nuked?

Thank you btw

spit2wind • yesterday at 9:40 PM

This is pretty neat! The calendar didn't work well for me. I could only seem to navigate by month. And when I selected the earliest day (after much tapping), nothing seemed to be updated.

Nonetheless, random access history is cool.

➕ show 1 reply

fouc • today at 3:26 AM

Suddenly occurs to me that it would be neat to pair a small LLM (3-7B) with an HN dataset

➕ show 1 reply

dmarwicke • yesterday at 9:19 PM

22gb for mostly text? tried loading the site, it's pretty slow. curious how the query performance is with this much data in sqlite

layer8 • yesterday at 10:12 PM

Apparently the comment counts are only the top-level comments?

It would be nice for the thread pages to show a comment count.

➕ show 1 reply

wslh • yesterday at 6:44 PM

Is this updated regularly? 404 on GitHub as the other comment.

With all due respect it would be great if there is an official HN public dump available (and not requiring stuff such as BigQuery which is expensive).

➕ show 1 reply

joshcsimmons • yesterday at 11:26 PM

Link appears broken

➕ show 1 reply

KomoD • yesterday at 10:45 PM

How do I download it? That repo is a 404.

sirjaz • yesterday at 7:33 PM

This would be awesome as a cross platform app.

➕ show 1 reply

solarized • yesterday at 11:12 PM

Beautiful !

2026 prayer: for all you AI junkies—please don’t pollute H/N with your dirty AI gaming.

Don’t bot posts, comments, or upvote/downvote just to maximize karma. Please.

We can’t identify anymore who’s a bot and who’s human. I just want to hang out with real humans here.

DenisDolya • today at 7:57 AM

Hahaha, now you can be prepared for the apocalypse when the internet disappears. ;)

fao_ • yesterday at 7:24 PM

> Community, All the HN belong to you. This is an archive of hacker news that fits in your browser.

> 20 years of HN arguments and beauty, can be yours forever. So they'll never die. Ever. It's the unkillable static archive of HN and it's your hands

I'm really sorry to have to ask this, but this really feels like you had an LLM write it?

➕ show 5 replies

asdefghyk • yesterday at 5:37 PM

How much space is needed? ...for the data .... Im wondering if it would work on a tablet? ....

➕ show 2 replies

abetusk • yesterday at 10:34 PM

Alas, HN does not belong to us, and the existence of projects like this are subject to the whims of the legal owners of HN.

From the terms of use [0]:

"""

Commercial Use: Unless otherwise expressly authorized herein or in the Site, you agree not to display, distribute, license, perform, publish, reproduce, duplicate, copy, create derivative works from, modify, sell, resell, exploit, transfer or upload for any commercial purposes, any portion of the Site, use of the Site, or access to the Site. The buying, exchanging, selling and/or promotion (commercial or otherwise) of upvotes, comments, submissions, accounts (or any aspect of your account or any other account), karma, and/or content is strictly prohibited, constitutes a material breach of these Terms of Use, and could result in legal liability.

"""

[0] https://www.ycombinator.com/legal/#tou

➕ show 1 reply

alt Hacker News

Show HN: 22 GB of Hacker News in SQLite

Comments