Skip to content

Detect duplicate & unused Python code via AST hashing, Jaccard similarity, and semantic embeddings (ModernBERT, C2LLM, EmbeddingGemma). CLI + Python API w hybrid synthesis

License

Notifications You must be signed in to change notification settings

pszemraj/codedupes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

codedupes

codedupes detects duplicate and potentially unused Python code with:

  • Traditional AST/token matching (exact + Jaccard near-duplicate)
  • Semantic matching with model-profile embeddings (default gte-modernbert-base)
  • Heuristic unused-code detection

Install

pip install "codedupes @ git+https://github.com/pszemraj/codedupes.git"

Optional GPU extras:

pip install "codedupes[gpu] @ git+https://github.com/pszemraj/codedupes.git"

Requires Python 3.11+. Details are in docs/install.md

Quick Start

codedupes check ./src
codedupes search ./src "normalize request payload"
codedupes info

codedupes check defaults to a hybrid-first report:

  • one combined duplicate list (Hybrid Duplicates)
  • likely dead code (potentially_unused)

Use --show-all to include raw traditional + raw semantic duplicate lists.

Documentation

Primary docs live under docs/:

Notes and limits

  • Call graph and unused detection are heuristic and conservative by default.
  • Semantic model-profile defaults and task behavior are defined in docs/model-profiles.md.
  • Analysis defaults (semantic candidate scope, tiny-traditional filtering, hybrid gates) are defined in docs/analysis-defaults.md.
  • Semantic analysis may download model weights on first use.
  • Extraction skips common artifact/cache directories by default (__pycache__, .venv, etc).

About

Detect duplicate & unused Python code via AST hashing, Jaccard similarity, and semantic embeddings (ModernBERT, C2LLM, EmbeddingGemma). CLI + Python API w hybrid synthesis

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages