logoalt Hacker News

delichonyesterday at 7:57 PM1 replyview on HN

How do you handle entity clustering/deduplication?


Replies

segmentayesterday at 8:10 PM

We use a two-layer approach.

The raw sync layer (Gmail, calendar, transcripts, etc.) is idempotent and file-based. Each thread, event, or transcript is stored as its own Markdown file keyed by the source ID, and we track sync state to avoid re-ingesting the same item. That layer is append-only and not deduplicated.

Entity consolidation happens in a separate graph-building step. An LLM processes batches of those raw files along with an index of existing entities (people, orgs, projects and their aliases). Instead of relying on string matching, the model decides whether a mention like “Sarah” maps to an existing “Sarah Chen” node or represents a new entity, and then either updates the existing note or creates a new one.

show 1 reply