logoalt Hacker News

smallerfish10/01/20243 repliesview on HN

I wrote a prototype of a browser extension that scraped your bookmarks + 1 degree, and indexed everything into an in-memory search index (which gets persisted in localstorage). I took over the new tab page with a simple search UI, with instant type-ahead search.

Rough aspects:

a) It requires a _lot_ of browser permissions to install the extension, and I figured the audience who might be interested in their own search index would likely be put off by intrusive perms.

b) Loading the search index from localstorage on browser startup took 10-15s with a moderate number of sites; not great. Maybe would be a fit for pouchdb or something else that makes IndexedDB tolerable. (Or wasm sqllite, if it's mature enough.)

c) A lot of sites didn't like being scraped (even with rate limiting and back-off), and I ended up being served an annoying number of captchas in my regular everyday browsing.

d) Some walled garden sites seem completely unscrapable (even in the browser) - e.g. Linkedin.


Replies

changing199910/01/2024

In my experience building a browser-based scraper I preferred scraping pages by a direct in-browser visit rather that a fetch request. A direct visit from a real browser is basically undetectable by anti-bot software (unless you try to do something funny like automated deep crawling and scraping). So applied to your usecase it would have to go through every bookmark + 1 degree to index it. Maybe even in an offscreen canvas (haven't tried that though, could be detectable).

8chanAnon10/01/2024

>Some walled garden sites seem completely unscrapable

Any examples besides Linkedin? Tell me what sites you're trying to target and I'll have a look to see what can be done with them. It takes some pretty evil Javascript obfuscation to block me and only one site has been able to do that. I doubt that the sites you're hitting are anywhere near that evil, lol. I would appreciate it if you have a good example that I could use in a future article.

show 1 reply
paulryanrogers10/01/2024

How often did it crawl? Once per day shouldn't trigger any blockers.