Point number 2. is super important for non-hobby projects. Collect a bit of data, even if you have t...

morkalork • 12/09/2024 • 1 reply • view on HN

Point number 2. is super important for non-hobby projects. Collect a bit of data, even if you have to do it manually at first and do a "dry run" / first cut of whatever analysis you're thinking of doing so you confirm you're actually collecting what you need and what you're doing is even going to work. Seeing a pipeline get built, run for like two months and then the data scientist come along and say "this isn't what we needed" was complete goddamn shitshow. I'm just glad I was only a spectator to it.

Replies

IanCal • 12/09/2024

They touch on something relevant here and it's a great point to emphasise

> The emphasis on preserving raw HTML proved vital when Tagesschau repeatedly altered their newsticker DOM structure throughout Q2 2020. This experience underscored a fundamental data engineering principle: raw data is king. While parsers can be rewritten, lost data is irretrievable.

I've done this before keeping full, timestamped, versioned raw HTML. That still risks shifts to javascript based things but keeping your collection and processing distinct as much as you can so you can rerun things later is incredibly helpful.

Usually, processing raw data is cheap. Recovering raw data is expensive or impossible.

As a bonus, collecting raw data is usually easier than collecting and processing it, so you might as well start there. Maybe you'll find out you were missing something, but it's no worse than if you'd tied things together.

edit

> Huh? To find the specific dates new item corresponding to a given topic? Why not just predict the date-range e.g. "Apr-Aug 2022"

They say they had to manually find the links to the right liveblog subpage. So they had to go to the main page, find the link and then store it.

alt Hacker News

Replies