New comment by nttylock in "Show HN: CLI tool for detecting non-exact code duplication with embedding models"

nttylock — Thu, 02 Jul 2026 19:36:53 +0000

The false positive rate you're describing matches what we see running similarity detection on generated text instead of code: cosine similarity alone flags a lot of same-topic pairs that aren't actually duplicates. What helped was combining the embedding score with a structural signal (AST edit distance for code, overlapping headings and citations for text) so no single metric makes the call. Also worth surfacing the raw similarity score in the CLI output instead of just a binary duplicate flag, since people will want to tune the threshold per codebase.

Hacker News: nttylock

New comment by nttylock in "Show HN: CLI tool for detecting non-exact code duplication with embedding models"