How do you handle entity clustering/deduplication?

delichon • yesterday at 7:57 PM • 1 reply • view on HN

Replies

We use a two-layer approach.

The raw sync layer (Gmail, calendar, transcripts, etc.) is idempotent and file-based. Each thread, event, or transcript is stored as its own Markdown file keyed by the source ID, and we track sync state to avoid re-ingesting the same item. That layer is append-only and not deduplicated.

Entity consolidation happens in a separate graph-building step. An LLM processes batches of those raw files along with an index of existing entities (people, orgs, projects and their aliases). Instead of relying on string matching, the model decides whether a mention like “Sarah” maps to an existing “Sarah Chen” node or represents a new entity, and then either updates the existing note or creates a new one.

➕ show 1 reply

alt Hacker News

Replies