Of course you wouldn't do it for every single page. If I was designing this crawler I'd make it sample a percentage of pages, starting at 100% sample rate for a completely unknown website, decreasing the sample rate over time as more "good" pages are found relative to "bad" pages.
After a "good" page percentage threshold is exceeded, stop sampling entirely and just crawl, assuming that all content is good. After a "bad" page percentage threshold is exceeded just stop wasting your time crawling that domain entirely.
With modern models the sampling cost should be quite cheap, especially since Nepenthes has a really small page size. Now if the page was humungous that might make it harder and more expensive to put through an LLM
Of course you wouldn't do it for every single page. If I was designing this crawler I'd make it sample a percentage of pages, starting at 100% sample rate for a completely unknown website, decreasing the sample rate over time as more "good" pages are found relative to "bad" pages.
After a "good" page percentage threshold is exceeded, stop sampling entirely and just crawl, assuming that all content is good. After a "bad" page percentage threshold is exceeded just stop wasting your time crawling that domain entirely.
With modern models the sampling cost should be quite cheap, especially since Nepenthes has a really small page size. Now if the page was humungous that might make it harder and more expensive to put through an LLM