logoalt Hacker News

smcin12/09/20241 replyview on HN

That's what I just concluded. I think the OP was oversold on the idea of using AI to do scraping, NLP and summarization, all in one go.


Replies

smcin12/11/2024

Best practice (for many reasons) is to separate scraping (and OCR) and store the rawtext or raw HTML/JS, and also the parsed intermediate result (cleaned scraped text or HTML, with all the useless parts/tags removed). This is then the input to the rest of the pipeline. You really want to separate those, both for minimizing costs, and preventing breakage when site format changes, anti-scraping heuristics change, etc. And not exposing garbage tags to AI saves you time/money.