Wouldn't this basically give us Google Books and searchable Scihub at the same time?
What would it cost?
They did! They conducted a competition https://annas-archive.org/blog/all-isbns-winners.html , in which a few submissions exceeded the minimum requirements and implemented a good search tool & visualiser.
You must mean free text search and page level return, because it already has full metadata indexing.
The thing is AA doesn't hold the texts. They're disputable IPR and even a derived work would be a legal target.
There’s an android app called OpenLip. [1]
Description:
Openlib is an open source app to download and read books from shadow library (Anna’s Archive). The App Has Built In Reader to Read Books.
As Anna’s Archive doesn't have an API, the app works by sending requests to Anna’s Archive and parses the response to objects. The app extracts the mirrors from the responses, downloads the book and stores it in the application's document directory.
Note : The app requires VPN to function properly . Without VPN the might show the captcha required page even after completing the captcha
Main Features:
Trending Books
Download And Read Books With In-Built Viewer
Supports Epub And Pdf Formats
Open Books With Your Favourite Ebooks Reader
Filter Books
Sort Books
Facebook said they leeched it, and Anna once mentioned a few companies most of them from China paid for it, so I assume the answer is yes someone has the data and very likely built the search, but no one will open it given the legal and reputation risk.
Z-Library has a keyword search. Personally i didn't find it too useful, especially given Google Books exists. It's not easy to create a quality book search engine.
As far as I know, no one has fully implemented full-text search directly over Anna's Archive. Technically it’s feasible with tools like Meilisearch, Elasticsearch, or Lucene, but the main challenges are:
Converting all documents (PDFs, EPUBs, etc.) to clean plaintext.
Indexing at scale efficiently.
Managing potential legal issues.
Z-Library does something similar, but it’s smaller in scope and doesn't integrate AA’s full catalog.AFAIK, Z-Library already does this, to some extent. Basic full-text queries do search inside the body of books and articles.
It's a bit smaller than Anna's Archive, as they do host their own collections. From some locations, it's only easy to access through Tor.
Related question, has Anna's archive been thoroughly filtered for non-copyright-related illegal material? Pedo, terrorism, etc. I've considered downloading a few chunks of it but I'm worried of ending up with content I really don't want to be anywhere near from.
Probably this was already done at Google, Meta, X and OpenAI, before training their LLMs.
small number of people willing to put in significant engineering hours for something that would be illegal and non-monetizable
There is a search solution for zipped fb2 files. Not exactly what you need, but it has potential.
The project has similar story to Anna's archive. There is 0.5 TB of archived books, and the project creates index of all the books with text, title and aruthor search capabilities, gives html UI for search and reading. On weak machine it takes about 2 hours to build that index.
So if you have zipped archives of fb2, you can use the project to create web UI with search for those files. Without need of enough space to unpack all the files.
You'll have to translate some russian though to get instructions on how to set it up.
https://gitlab.com/opennota/fb2index/-/blob/master/README.ru...
I have found some searche engines, but I do not think they're for Anna's.
Seeing as OpenAI & Co were trained on torrented books from similar places, I'm sure that ChatGPT provides an adequate search layer on top of Anna's Archive, though it is not as free from confabulations as one might hope for in a search engine.
Edit: grammar
Has anyone explored a different angle — like mapping out the 1,000 most frequently mentioned or cited books (across HN, Substack, Twitter, etc.), then turning their raw content into clean, structured data optimized for LLMs? Imagine curating these into thematic shelves — say, “Bill Gates’ Bookshelf” or “HN Canon” — and building an indie portal where anyone can semantically search across these high-signal texts. Kind of like an AI-searchable personal library of the internet’s favorite books.
Facebook did it's ai is trained on it so you can use that.
yes, every major llm company did it:
illegally using annas archive, the pile, common crawl, their own crawl, books2, libgen etc. and embed it into high dimensional space and do next token prediction on it.
A functional full text search of the shadow libraries would be massive. It would have a comparable impact on humanity to the impact AI will have. And it's probably not difficult technically. Let's start a project to get this done!
Edit: I have had this exact project as my dream for a couple of years, and even experimented a little bit. But I'm not a programmer, so I can only understand theoretically what would be needed for this to work.
Anybody with the same dream, send me an e-mail to [email protected] and let's see what we can do to get the ball rolling!
No, because you can't avert the legal issues of doing that.
This works in various search engines
site:annas-archive.org avacado
[dead]
[dead]
Don't do it. Just because you can, doesn't mean you should. Do you know if they have anywhere near the legal muscle to push back the flood of legal notices if you did this? Assume it survives because it doesn't have a wide open barn door to the public.
Mebbe easier to just search Amazon or Goodreads. Like site:amazon.ca <query words> as someone has mentioned below.
Every book has an ISBN 10 or 13 digit ISBN number to identify them. Unless it's some self-pub/amateur-hour situation by some paranoid prepper living in a faraday-cage-protected cage in Arkansas or Florida it's likely a publication with a title, an author and an ISBN number.
Honestly I don't think it would be that costly, but it would take a pretty long time to put together. I have a (few years old) copy of Library Genesis converted to plaintext and it's around 1TB. I think libgen proper was 50-100TB at the time, so we can probably assume that AA (~1PB) would be around 10-20TB when converted to plaintext. You'd probably spend several weeks torrenting a chunk of the archive, converting everything in it to plaintext, deleting the originals, then repeating with a new chunk until you have plaintext versions of everything in the archive. Then indexing all that for full text search would take even more storage and even more time, but still perfectly doable on commodity hardware.
The main barriers are going to be reliably extracting plaintext from the myriad of formats in the archive, cleaning up the data, and selecting a decent full text search database (god help you if you pick wrong and decide you want to switch and re-index everything later).