I think there’s a couple ways to improve it:
1. There’s a lot of variants of the same book. We only need one for the index. Perhaps for each ISBN, select the format easiest to parse.
2. We can download, convert and index top 100K books first, launch with these, and then continue indexing and adding other books.
The thing is, for an ISBN, that is one edition, by one publisher and one can easily have the same text under 3 different ISBNs from one publisher (hardcover, trade paperback, mass-market paperback).
I count 80+ editions of J.R.R. Tolkien's _The Hobbit_ at:
https://tolkienlibrary.com/booksbytolkien/hobbit/editions.ph...
granted some predate ISBNs, one is the 3D pop-up version, so not a traditional text, and so forth, but filtering by ISBN will _not_ filter out duplicates.
There is also the problem of the same work being published under multiple titles (and also ISBNs) --- Hal Clement's _Small Changes_ was re-published as _Space Lash_ and that short story collection is now collected in:
https://www.goodreads.com/book/show/939760.Music_of_Many_Sph...
along with others.
There should be a way to leverage compression when storing multiple editions of the same book.
[dead]
How are you going to download the top 100k? The only reasonable way to download that many books from AA or Libgen is to use the torrents, which are sorted sequentially by upload date.
I tried to automate downloading just a thousand books and it was unbearably slow, from IPFS or the mirrors both. I ended up picking the individual files out of the torrents. Even just identifying or deduping the top 100k would be a significant task.