codedupes detects duplicate and potentially unused Python code with:
- Traditional AST/token matching (exact + Jaccard near-duplicate)
- Semantic matching with model-profile embeddings (default
gte-modernbert-base) - Heuristic unused-code detection
pip install "codedupes @ git+https://github.com/pszemraj/codedupes.git"Optional GPU extras:
pip install "codedupes[gpu] @ git+https://github.com/pszemraj/codedupes.git"Requires Python 3.11+. Details are in docs/install.md
codedupes check ./src
codedupes search ./src "normalize request payload"
codedupes infocodedupes check defaults to a hybrid-first report:
- one combined duplicate list (
Hybrid Duplicates) - likely dead code (
potentially_unused)
Use --show-all to include raw traditional + raw semantic duplicate lists.
Primary docs live under docs/:
- docs/index.md: documentation map and ownership
- docs/cli.md: commands, flags, and defaults
- docs/model-profiles.md: semantic model aliases, profile defaults, and task behavior
- docs/analysis-defaults.md: analysis-behavior defaults and heuristics
- docs/output.md: JSON schemas and exit codes
- docs/usage.md: practical workflows and tuning examples
- docs/python-api.md: programmatic API usage
- docs/hybrid-tuning.md: hybrid gate tuning workflow
- Call graph and unused detection are heuristic and conservative by default.
- Semantic model-profile defaults and task behavior are defined in docs/model-profiles.md.
- Analysis defaults (semantic candidate scope, tiny-traditional filtering, hybrid gates) are defined in docs/analysis-defaults.md.
- Semantic analysis may download model weights on first use.
- Extraction skips common artifact/cache directories by default (
__pycache__,.venv, etc).