logoalt Hacker News

radium3dtoday at 12:24 AM2 repliesview on HN

Instead of "should have been an email" this is "should have been a prompt" and can be run locally instead. There are a number of ways to do this from a linux terminal.

``` write a custom crawler that will crawl every page on a site (internal links to the original domain only, scroll down to mimic a human, and save the output as a WebP screenshot, HTML, Markdown, and structured JSON. Make it designed to run locally in a terminal on a linux machine using headless Google Chrome and take advantage of multiple cores to run multiple pages simultaneously while keeping in mind that it might have to throttle if the server gets hit too fast from the same IP. ```

Might use available open source software such as python, playwright, beautifulsoup4, pillow, aiofiles, trafilatura


Replies

Normal_gaussiantoday at 12:55 AM

This presumably is going to be cheap and effective. Its much easier to wrap a prompt round this and know it works that mess around with crawling it all yourself.

You'll still be hand-rolling it if you want to disrespect crawling requirements though.

supermdguytoday at 1:09 AM

I’ve actually written a crawler like that before, and still ended up going with Firecrawl for a more recent project. There’s just so many headaches at scale: OOMs from heavy pages, proxies for sites that block cloud IPs, handling nested iframes, etc.