New comment by NYCHMPAI in "Show HN: CLI tool for detecting non-exact code duplication with embedding models"

NYCHMPAI — Thu, 02 Jul 2026 15:26:44 +0000

This is a great use case for embeddings. Code deduplication across distant modules is notoriously hard for traditional AST-based tools.

How do you handle chunking and parsing for different languages to make sure the embeddings capture semantic meaning effectively? For instance, do you chunk by functions/classes, or use a fixed token window? If a function is too long or too short, it can drastically skew the embedding similarity.

Hacker News: NYCHMPAI

New comment by NYCHMPAI in "Show HN: CLI tool for detecting non-exact code duplication with embedding models"