I built my own web search index on bare metal, index now up to 34m docs: https://greppr.org/
People rely too much on other people's infra and services, which can be decommissioned anytime. The Google Graveyard is real.
The input on the results page doesn't work, you always need to return to the start page on which the browser history is disabled. That's just confusing behaviour.
Unfortunately the index is the easy part. Transforming user input into a series of tokens which get used to rank possible matches and return the top N, based on likely relevence, is the hard part and I'm afraid this doesn't appear to do an acceptable job with any of the queries I tested.
There's a reason Google became so popular as quickly as it did. It's even harder to compete in this space nowadays, as the volume of junk and SEO spam is many orders of magnitude worse as a percentage of the corpus than it was back then.
This is pretty cool. Don't let the naysayers stop you. Taking a stab at beating Google at their core product is bravery in my book. The best of luck to you!
You should consider filtering by input language. Showing the same Wikipedia article in different languages is not helpful when I am searching in English. Also you may unify by entries by URL, it shows the same URL, just with different publish dates, which is interesting and might be useful, but should maybe be behind a toggle, as it is confusing at first.
I made also something for my own search needs. It's just an SQLite table of domains, and places. I have your search engine there also ;-)
https://github.com/rumca-js/Internet-Places-Database
Demo for most important ones https://rumca-js.github.io/search
I tested it using a local keyword, as I normally do, and it took me to a Wikipedia page I didn’t know existed. So thanks for that.
Lol, a GooglePlus URL was mentionned on a webpage i browsed this week.#blastFromThePast
Number of docs isn’t the limiting factor.
I just searched for “stackoverflow” and the first result was this: https://www.perl.com/tags/stackoverflow/
The actual Stackoverflow site was ranked way down, below some weird twitter accounts.