logoalt Hacker News

Web scraping with your web browser: Why not?

148 pointsby 8chanAnon10/01/202473 commentsview on HN

Includes working code. First article in a planned series.


Comments

joshdavham10/01/2024

> can you write a web scraper in your browser? The answer is: YES, you can! So why is nobody doing it?

Completely agree with this sentiment.

I just spent the last couple of months developing a chrome extension, but recently also did an unrleated web scraping project where I looked into all the common tools like beautiful soup, selenium, playwright, pupeteer, etc, etc.

All of these tools were needlessly complicated and I was having a ton of trouble with sites that required authentication. I then realized it would be way easier to write some javascript and paste it in my browser to do the scraping. Worked like a charm!

show 2 replies
smallerfish10/01/2024

I wrote a prototype of a browser extension that scraped your bookmarks + 1 degree, and indexed everything into an in-memory search index (which gets persisted in localstorage). I took over the new tab page with a simple search UI, with instant type-ahead search.

Rough aspects:

a) It requires a _lot_ of browser permissions to install the extension, and I figured the audience who might be interested in their own search index would likely be put off by intrusive perms.

b) Loading the search index from localstorage on browser startup took 10-15s with a moderate number of sites; not great. Maybe would be a fit for pouchdb or something else that makes IndexedDB tolerable. (Or wasm sqllite, if it's mature enough.)

c) A lot of sites didn't like being scraped (even with rate limiting and back-off), and I ended up being served an annoying number of captchas in my regular everyday browsing.

d) Some walled garden sites seem completely unscrapable (even in the browser) - e.g. Linkedin.

show 3 replies
gmac10/02/2024

Yes: I find it surprising that this isn't a more widespread approach. It's how I've taught web scraping to my PhD students for some years.

https://github.com/jawj/web-scraping-for-researchers

show 1 reply
hildenae10/03/2024

I understand that "with/in your web browser" implies a extention or simmilar, but i have good experience using Selenium and Python to scrape websites. Some sites are trickier than others, and when you are instrumenting a browser it easily triggers bot prevention, but you are also able to easily scrape pages that build the DOM using JS and simmilar. I have considered, but not looked into compiling my own Firefox to disable i.e. navigator.webdriver, but it feels a bit to much work.

This is my project for extracting my (your) webshop order & item data https://gitlab.com/Kagee/webshop-order-scraper

seanwilson10/02/2024

> So the question is: can you write a web scraper in your browser? The answer is: YES, you can! So why is nobody doing it?

> One of the issues is what is called CORS (Cross-Origin Resource Sharing) which is a set of protocols which may forbid or allow access to a web resource by Javascript. There are two possible workarounds: a browser extension or a proxy server. The first choice is fairly limited since some security restrictions still apply.

I'm doing this for a browser extension that crawls a website from page to page checking for SEO/speed/security problems (https://www.checkbot.io/). It's been flexible enough, and it's nice not to have to maintain and scale servers for the web crawling. https://browserflow.app/ is another extension I know of that does scraping within the browser I think, and other automation.

show 1 reply
simlan10/01/2024

I also did something similar for my spring project. The idea was to buy a used car and I was frustrated with the BS the listing sites claimed as fair price etc..

I went the browser extension route and used grease monkey to inject custom JavaScript. I patched the window.fetch and because it was a react page it did most of the work for me providing me with a slightly convolute JSON doc everytime I scrolled. Getting the data extracted was only a question of getting a flask API with correct CORS settings running.

Thanks for posting using a local proxy for even more control could be helpful in the future.

show 1 reply
linsomniac10/01/2024

There is an extension called "Amazon Order History Reporter" that will scrape Amazon to download your order history. I've used it a couple times and it works brilliantly.

ggorlen10/02/2024

I wrote a similar post on in-browser scraping: https://serpapi.com/blog/dynamic-scraping-without-libraries/

My approach is a step or two more automated (optionally using a userscript and a backend) and runs in the console on the site under automation rather than cross-origin, as shown in OP.

In addition to being simple for one-off scripts and avoiding the learning curve of a Selenium, Playwright or Puppeteer, scraping in-browser avoids a good deal of potential bot detection issues, and is useful for constant polling a site to wait for something to happen (for example, a specific message or article to appear).

You can still use a backend and write to file, trigger an email or SMS, etc. Just have your userscript make requests to a server you're running.

ljw100410/01/2024

In my web-scraping I've gravitated towards the "cheerio" library for javascript.

I kind of don't want to use DOMParser because it's browser-only... my web-scrapers have to evolve every few years as the underlying web pages change, so I really want CI tests, so it's easiest to have something that works in node.

gabrielsroka10/01/2024

Why do you need a proxy or to worry about CORS? Why not just point your browser to rumble.com and start from there?

I've posted here about scraping for example HN with JavaScript. It's certainly not a new idea.

2020: https://news.ycombinator.com/item?id=22788236

show 1 reply
ricardo8110/03/2024

I've found using a local proxy helps when using Puppeteer and a proxy. The way chrome authenticates to a proxy keeps the connection open which can sometimes mess up rotating proxy endpoints, and having to close/re-open browsers per page is just too inefficient.

flashgordon10/02/2024

Ah I remember doing this almost 20 years ago and even rotating through 1500 proxies to not get tripped up by ddos detectors :). A plugin is one of the ways to scrape as it also looks like a human (ie more js run, more divs loaded and so on).

datadrivenangel10/01/2024

I've been playing around with this idea lately as well! There are a lot of web interfaces that are hostile to scraping, and I see no reason why we shouldn't be able to use the data we have access to for our own purposes. CUSTOMIZE YOUR INTERFACES

acheong0810/02/2024

I actually did that with a firefox extension + containers to scrape ChatGPT a long while back (before the APIs)

https://github.com/acheong08/ChatGPT-API-agent

Worked pretty well but browsers took up too much memory per tab so automating thousands of accounts (what i wanted) was infeasible

pimlottc10/02/2024

When I have to do some really quick ad-hoc webscraping, I often just select all text on the page, copy it, and then switch to a terminal window where I build a pipeline that extracts the part I need (using pbpaste to access the clipboard). Very quick and dirty for when you just need to hit a few pages.

turingfeel10/02/2024

If you want to get your personal IP and fingerprint blacklisted across major providers and large ranges, unfortunately this is how you do it. Just keep the rates low.

show 1 reply
changing199910/01/2024

> can you write a web scraper in your browser? The answer is: YES, you can! So why is nobody doing it?

My guess would be that some companies are doing it (I worked at a major tech company that is/was), just not publicizing this fact as crawling/scraping is such a gray legal area.

chaosharmonic10/01/2024

> You can find plenty of tutorials on the Internet about the art of web scraping... and the first things you will learn about are Python and Beautiful Soup. There is no tutorial on web scraping with Javascript in a web browser...

Um... [0]

[0] https://bhmt.dev/blog/scraping

show 1 reply
dewey10/01/2024

I've read through that (hard to read, because of the bad formatting) but I still don't understand why you would do that instead of Playwright, Puppeteer etc. - The only reason seems to be "This technique certainly has its limits.".

show 2 replies
spullara10/01/2024

I love it when something like this reminds me of a project from forever ago...

https://github.com/spullara/browsercrawler

nsonha10/02/2024

sorry the format of this site is just too annoying for me to bother to read it. If this is about the shocking revelation that you can paste some code into the browser console, aka manually extracting information, then manually put that into whatever workflow that you need that information for, then I don't think that is called web scrapping, it's just browsing the web with code.

micahdeath10/02/2024

Excel/Word Macro using a WebBrowser object in a Form (old IE did this nicely; Haven't done that since Edge came out.)

ttshaw110/02/2024

How is this different from scraping in, say, Selenium in non-headless mode?

show 1 reply
deisteve10/01/2024

is there anything that runs on WASM for scraping? the issue is that you need to enable flags and turn off other security features to scrape on your web browser and this is why its not popular but with WASM that might change

show 1 reply
welder10/01/2024

Neo already did that in the Matrix:

https://www.youtube.com/watch?v=sjoad6gcRzs

squigz10/02/2024

This horrendous color scheme makes it impossible for me to read this.