logoalt Hacker News

rybosworld12/09/20241 replyview on HN

> The data collection process involved a daily ritual of manually visiting the Tagesschau website to capture links to both the COVID and later Ukraine war newstickers. While this manual approach constituted the bulk of the project’s effort, it was necessitated by Tagesschau’s unstructured URL schema, which made automated link collection impractical.

> The emphasis on preserving raw HTML proved vital when Tagesschau repeatedly altered their newsticker DOM structure throughout Q2 2020.

Another big takeaway is that it's not sustainable to rely on this type of a data source. Your data source should be stable. If the site offers API's, that's almost always better than parsing html.

Website developers do not consider scrapers when they make changes. Why would they? So if you are ever trying to collect some unique dataset, it doesn't hurt to reach out to the web devs to see if they can provide a public API.


Replies

abirch12/09/2024

Please consider it an early Christmas present to yourself if you can pay a nominal amount for an API instead of spending your time scraping unless you enjoy doing the scraping.