This is a great use case for embeddings. Code deduplication across distant modules is notoriously ha...

NYCHMPAI • today at 3:26 PM • 0 replies • view on HN

This is a great use case for embeddings. Code deduplication across distant modules is notoriously hard for traditional AST-based tools.

How do you handle chunking and parsing for different languages to make sure the embeddings capture semantic meaning effectively? For instance, do you chunk by functions/classes, or use a fixed token window? If a function is too long or too short, it can drastically skew the embedding similarity.

alt Hacker News