I would like to share an open database focused on link-level metadata extraction and aggregation, which may be of interest to researchers.
The project maintains a structured dataset of links enriched with metadata such as:
- page title
- description / summary
- publication date (when available)
- thumbnail / preview image
- etc.
The goal is to provide a reusable, inspectable set of link metadata that can be used for experiments in areas such as:
- RSS and feed analysis
- news analysis
- link rot analysis?
The database is publicly available here:
https://github.com/rumca-js/RSS-Link-Database-2025
There are also databases for previous years
Curious how you handle feed evolution over time. When an RSS source changes structure (fields added/removed, summaries truncated, etc.), do you normalize to a fixed schema or store the raw payload alongside a best-effort normalized version? Longitudinal datasets tend to get tricky there.