logoalt Hacker News

aminerjtoday at 12:24 AM0 repliesview on HN

The SEC/public records context is where this gets genuinely hard — you can't vet the source the way you can with internal Confluence. The vocabulary engineering approach I tested would be trivially deployable against any automated public records ingestion pipeline, and the attacker doesn't need internal access at all.

The scoring per source is the right direction. The way I'd frame it: trust tier at ingestion time, not just at retrieval time. Something like: official regulatory filings get a different embedding treatment and prompt context tag than user-generated content from a public portal.