logoalt Hacker News

smcin12/09/20241 replyview on HN

> Nothing stopping you from releasing the raw dataset and calling it a success!

Right. OP: release it as a Kaggle Dataset (https://www.kaggle.com/datasets) and invite people to collaboratively figure out how to autonate the analyses. (Do you just want to get sentiment on a specific topic (e.g. vaccination, German energy supplies, German govt approval)? or quantitative predictions?) Start with something easy.

> for example, I would try to find a foundation model to do the job of for example finding the right link on the Tagesschau website, which was by far the most draining part of the whole project.

Huh? To find the specific dates new item corresponding to a given topic? Why not just predict the date-range e.g. "Apr-Aug 2022"

> and yeah, the web scraping part is still the worst.

Sounds wrong. OP, fix your scraping. (unless it was anti-AI heuristics that kept breaking it, which I doubt since it's Tagesschau). But Tagesschau has RSS feeds, so why are you blocked on scraping? https://www.tagesschau.de/infoservices/rssfeeds

Compare to: Kaggle Datasets "10k German News Articles for topic classification", Schabus, Skowron Trspp, SIGIR 2017 [https://www.kaggle.com/datasets/abhishek/10k-german-news-art...]


Replies

IanCal12/09/2024

I'll put a shoutout for https://zenodo.org/ and https://figshare.com/ as places to put your data, where you'll get a DOI and can let someone that's not a company look after hosting and backing it up. Zenodo is hosted as long as CERN is around (is the promise) and figshare is backed by the CLOCKSS archive (multiple geographically distributed universities).