CodeClone is a Python code clone detector based on normalized AST and Control Flow Graphs (CFG).
It discovers architectural duplication and prevents new copy-paste from entering your codebase via CI.
CodeClone focuses on architectural duplication, not text similarity. It detects structural patterns through:
- Normalized AST analysis — robust to renaming, formatting, and minor refactors
- Control Flow Graphs — captures execution logic, not just syntax
- Strict, explainable matching — clear signals, not fuzzy heuristics
Unlike token-based tools, CodeClone compares structure and control flow, making it ideal for finding:
- Repeated service/orchestration patterns
- Duplicated guard/validation blocks
- Copy-pasted handler logic across modules
- Recurring internal segments in large functions
Three Detection Levels:
-
Function clones (CFG fingerprint)
Strong structural signal for cross-layer duplication -
Block clones (statement windows)
Detects repeated local logic patterns -
Segment clones (report-only)
Internal function repetition for explainability; not used for baseline gating
CI-Ready Features:
- Deterministic output with stable ordering
- Reproducible artifacts for audit trails
- Baseline-driven gating to prevent new duplication
- Fast incremental analysis with intelligent caching
pip install codecloneRequirements: Python 3.10+
# Analyze current directory
codeclone .
# Check version
codeclone --versioncodeclone . \
--html .cache/codeclone/report.html \
--json .cache/codeclone/report.json \
--text .cache/codeclone/report.txt# 1. Generate baseline once (commit to repo)
codeclone . --update-baseline
# 2. Add to CI pipeline
codeclone . --ciThe --ci preset is equivalent to --fail-on-new --no-color --quiet.
Baselines capture the current state of duplication in your codebase. Once committed, they serve as the reference point for CI checks.
Key points (contract-level):
- Baseline file is versioned (
codeclone.baseline.json) and used to classify clones as NEW vs KNOWN. - Compatibility is gated by
schema_version,fingerprint_version, andpython_tag. - Baseline trust is gated by
meta.generator.name(codeclone) and integrity (payload_sha256). - In CI preset (
--ci), an untrusted baseline is a contract error (exit2).
Full contract details: docs/book/06-baseline.md
CodeClone uses a deterministic exit code contract:
| Code | Meaning |
|---|---|
0 |
Success — run completed without gating failures |
2 |
Contract error — baseline missing/untrusted, invalid output extensions, incompatible versions, unreadable source files in CI/gating |
3 |
Gating failure — new clones detected or threshold exceeded |
5 |
Internal error — unexpected exception |
Priority: Contract errors (2) override gating failures (3) when both occur.
Full contract details: docs/book/03-contracts-exit-codes.md
Debug Support:
# Show detailed error information
codeclone . --debug
# Or via environment variable
CODECLONE_DEBUG=1 codeclone .- HTML (
--html) — Interactive web report with filtering - JSON (
--json) — Machine-readable structured data - Text (
--text) — Plain text summary
The JSON report uses a compact deterministic layout:
- Top-level:
meta,files,groups,groups_split,group_item_layout - Optional top-level:
facts groups_splitprovides explicit NEW / KNOWN separation per sectionmeta.groups_countsprovides deterministic per-section aggregatesmetafollows a shared canonical contract across HTML/JSON/TXT
Canonical report contract: docs/book/08-report.md
Minimal shape (v1.1):
{
"meta": {
"report_schema_version": "1.1",
"codeclone_version": "1.4.0",
"python_version": "3.13",
"python_tag": "cp313",
"baseline_path": "/path/to/codeclone.baseline.json",
"baseline_fingerprint_version": "1",
"baseline_schema_version": "1.0",
"baseline_python_tag": "cp313",
"baseline_generator_name": "codeclone",
"baseline_generator_version": "1.4.0",
"baseline_payload_sha256": "<sha256>",
"baseline_payload_sha256_verified": true,
"baseline_loaded": true,
"baseline_status": "ok",
"cache_path": "/path/to/.cache/codeclone/cache.json",
"cache_used": true,
"cache_status": "ok",
"cache_schema_version": "1.2",
"files_skipped_source_io": 0,
"groups_counts": {
"functions": {
"total": 0,
"new": 0,
"known": 0
},
"blocks": {
"total": 0,
"new": 0,
"known": 0
},
"segments": {
"total": 0,
"new": 0,
"known": 0
}
}
},
"files": [],
"groups": {
"functions": {},
"blocks": {},
"segments": {}
},
"groups_split": {
"functions": {
"new": [],
"known": []
},
"blocks": {
"new": [],
"known": []
},
"segments": {
"new": [],
"known": []
}
},
"group_item_layout": {
"functions": [
"file_i",
"qualname",
"start",
"end",
"loc",
"stmt_count",
"fingerprint",
"loc_bucket"
],
"blocks": [
"file_i",
"qualname",
"start",
"end",
"size"
],
"segments": [
"file_i",
"qualname",
"start",
"end",
"size",
"segment_hash",
"segment_sig"
]
},
"facts": {
"blocks": {}
}
}Cache is an optimization layer only and is never a source of truth.
- Default path:
<root>/.cache/codeclone/cache.json - Schema version: v1.2
- Invalid or oversized cache is ignored with warning and rebuilt (fail-open)
Full contract details: docs/book/07-cache.md
repos:
- repo: local
hooks:
- id: codeclone
name: CodeClone
entry: codeclone
language: system
pass_filenames: false
args: [ ".", "--ci" ]
types: [ python ]- A structural clone detector for Python
- A CI guard against new duplication
- A deterministic analysis tool with auditable outputs
- A linter or code formatter
- A semantic equivalence prover
- A runtime execution analyzer
High-level Pipeline:
- Parse — Python source → AST
- Normalize — AST → canonical structure
- CFG Construction — per-function control flow graph
- Fingerprinting — stable hash computation
- Grouping — function/block/segment clone groups
- Determinism — stable ordering for reproducibility
- Baseline Comparison — new vs known clones (when requested)
Learn more:
- Architecture:
docs/architecture.md - CFG semantics:
docs/cfg.md
Use this map to pick the right level of detail:
- Contract book (canonical contracts/specs):
docs/book/- Start here:
docs/book/00-intro.md - Exit codes and precedence:
docs/book/03-contracts-exit-codes.md - Baseline contract (schema/trust/integrity):
docs/book/06-baseline.md - Cache contract (schema/integrity/fail-open):
docs/book/07-cache.md - Report contract (schema v1.1 + NEW/KNOWN split):
docs/book/08-report.md - CLI behavior:
docs/book/09-cli.md - HTML rendering:
docs/book/10-html-render.md - Determinism policy:
docs/book/12-determinism.md - Compatibility/versioning rules:
docs/book/14-compatibility-and-versioning.md
- Start here:
- Deep dives:
- Architecture narrative:
docs/architecture.md - CFG semantics:
docs/cfg.md
- Architecture narrative: