logoalt Hacker News

Show HN: Real-time system that tracks how news spreads across 200k websites

37 pointsby antiochIstlast Wednesday at 1:27 AM8 commentsview on HN

I built a system that monitors ~200,000 news RSS feeds in near real-time and clusters related articles to show how stories spread across the web.

It uses Snowflake’s Arctic model for embeddings and HNSW for fast similarity search. Each “story cluster” shows who published first, how fast it propagated, and how the narrative evolved as more outlets picked it up.

Would love feedback on the architecture, scaling approach, and any ways to make the clusters more accurate or useful.

Live demo: https://yandori.io/news-flow/


Comments

Havoctoday at 12:34 PM

That's really cool!

Curious how you sourced the feeds? It seems to have a bias towards Indian/Srilanka/Iran/Indonesia/Turkey etc - i.e. not the traditional western centric reporting. Always interested in trying to get a more balanced news diet so anything you could share around that would be interesting. Most out of the box news tools seem to automatically lean west

FYI layout sometimes breaks like so:

https://i.imgur.com/FXeqB9R.png

juujiantoday at 12:28 PM

Very cool. Our lab will want to do something like this eventually. Do you have a repo?

hmokiguesstoday at 11:43 AM

Cool idea! What I liked the most was the breakdown into categories like “breaking” and “trending” plus the number of sources.

The view showing the flow with a play animation was a nice concept but I couldn’t see much value in it, wondering if you could try to get a more aggregate stats that shows a connection between these different flows, maybe they follow a pattern like ad-based campaigns or publishers who own these domains, which would explain things. Expanding on this idea, could even try and setup different scores and metrics based on major groups and sponsored content versus organic spread.

KomoDtoday at 11:29 AM

I think the idea is interesting but it includes a lot of spam and non-news (e.g. archive.fo, .vn, .today, etc.)

masterphailast Wednesday at 1:34 AM

Interesting project - it’s rare to see news-flow tracking done in real time at this scale. One thing you may want to stress-test is how stable the clustering remains when stories evolve semantically over a few hours. Embeddings tend to drift as outlets rewrite or localize a piece, and HNSW can sometimes over-merge when the centroid shifts.

A trick that helped in a similar system I built was doing a second-pass “temporal coherence” check: if two articles are close in embedding space but far apart in publish time or share no common entities, keep them in adjacent clusters rather than forcing a merge. It reduced false positives significantly.

Also curious how you handle deduping syndicated content - AP/Reuters can dominate the embedding space unless you weight publisher identity or canonical URLs.

Overall, really nice work. The propagation timeline is especially useful.

jMylestoday at 12:24 PM

Just tried it, and clicking on the stories doesn't seem to do anything. Console shows "TypeError: can't access property "time", flowData[Math.min(...)] is undefined"

Ubuntu 24.04, Firefox 145.0.1 (64-bit)

psychoslavetoday at 11:35 AM

Can it be tuned to get a sense of how it reach Wikimedia projects?

Oraslast Wednesday at 10:09 AM

I really like the idea. I would love a feature to add keywords and see related news.