I’m not sure I would call this a failure.. more just something you tried out of curiosity and abandoned. Happens to literally everyone. “Failed” to me would imply there was something fundamentally broken about the approach or the dataset, or that there was an actual negative impact to the unrealized result. It’s very hard to finish long-running side projects that aren’t generating income, attention, or driven by some quasi-pathological obsession. The fact you even blogged about it and made HN front page qualifies as a success in my book.
> If I would have finished the project, this dataset would then have been released and used for a number of analyses using Python.
Nothing stopping you from releasing the raw dataset and calling it a success!
> Back then, I would have trained a specialised model (or used a pretrained specialised model) but since LLMs made so much progress during the runtime of this project from 2020-Q1 to 2024-Q4, I would now rather consider a foundational model wrapped as an AI agent instead; for example, I would try to find a foundation model to do the job of for example finding the right link on the Tagesschau website, which was by far the most draining part of the whole project.
I actually just started (and subsequently —-abandoned—- paused) my own news analysis side project leveraging LLMs for consolidation/aggregation.. and yeah, the web scraping part is still the worst. And I’ve had the same thought that feeding raw HTML to the LLM might be an easier way of parsing web objects now. The problem is most sites are privy to scraping efforts and it’s not so much a matter of finding the right element but bypassing the weird click-thru screens, tricking the site that you’re on a real browser, etc…
> Nothing stopping you from releasing the raw dataset and calling it a success!
Right. OP: release it as a Kaggle Dataset (https://www.kaggle.com/datasets) and invite people to collaboratively figure out how to autonate the analyses. (Do you just want to get sentiment on a specific topic (e.g. vaccination, German energy supplies, German govt approval)? or quantitative predictions?) Start with something easy.
> for example, I would try to find a foundation model to do the job of for example finding the right link on the Tagesschau website, which was by far the most draining part of the whole project.
Huh? To find the specific dates new item corresponding to a given topic? Why not just predict the date-range e.g. "Apr-Aug 2022"
> and yeah, the web scraping part is still the worst.
Sounds wrong. OP, fix your scraping. (unless it was anti-AI heuristics that kept breaking it, which I doubt since it's Tagesschau). But Tagesschau has RSS feeds, so why are you blocked on scraping? https://www.tagesschau.de/infoservices/rssfeeds
Compare to: Kaggle Datasets "10k German News Articles for topic classification", Schabus, Skowron Trspp, SIGIR 2017 [https://www.kaggle.com/datasets/abhishek/10k-german-news-art...]
Personally, I think it's helpful to feel disappointment and insufficiency when those emotions pop up. They are the voices of certain preferences, needs, and/or desires that work to enrich our lives. Recontextualizing the world into some kind of positive success story can often gaslight those emotions out of existence, which can, paradoxically, be self-sabotoging.
The piece reads to me like a direct and honest confrontation with failure. It means the author thinks they can do better and is working to identify unhelpful subconscious patterns and overcome them.
Personally, I found the author's laser focus on "data science projects" intriguing. I have a tendency to immediately go meta which biases towards eliding detail; however, even if overly narrow, the author's focus does end up precipitating out concrete, actionable hypotheses for improvement.
Bravo, IMHO.