55[ ![ Python] ( https://img.shields.io/pypi/pyversions/codeclone.svg )] ( https://pypi.org/project/codeclone/ )
66[ ![ License] ( https://img.shields.io/pypi/l/codeclone.svg )] ( LICENSE )
77
8- ** CodeClone** is an AST-based Python code clone detector that helps teams find architectural duplication and prevent new
9- copy-paste via CI.
8+ ** CodeClone** is a Python code clone detector based on ** normalized AST and control-flow graphs (CFG) ** .
9+ It helps teams discover architectural duplication and prevent new copy-paste from entering the codebase via CI.
1010
11- It is designed to help teams:
11+ CodeClone is designed to help teams:
1212
13- - discover structural and logical code duplication,
14- - understand architectural hotspots,
15- - and prevent * new* duplication from entering the codebase via CI .
13+ - discover ** structural and control-flow duplication** ,
14+ - identify architectural hotspots,
15+ - prevent * new* duplication via CI and pre-commit hooks .
1616
17- Unlike token- or text-based tools, CodeClone works on ** normalized Python AST** , which makes it robust against renaming,
17+ Unlike token- or text-based tools, CodeClone operates on ** normalized Python AST and CFG ** , making it robust against renaming,
1818formatting, and minor refactoring.
1919
2020---
2121
2222## Why CodeClone?
2323
2424Most existing tools detect * textual* duplication.
25- CodeClone detects ** structural and block-level duplication** that usually indicates missing abstractions or
26- architectural drift.
25+ CodeClone detects ** structural and block-level duplication** , which usually signals missing abstractions or architectural drift.
2726
2827Typical use cases:
2928
30- - duplicated service logic across layers (API ↔ application),
29+ - duplicated service or orchestration logic across layers (API ↔ application),
3130- repeated validation or guard blocks,
32- - copy-pasted request/ handler flows,
33- - duplicated orchestration logic in routers, handlers, or services.
31+ - copy-pasted request / handler flows,
32+ - duplicated control-flow logic in routers, handlers, or services.
3433
3534---
3635
3736## Features
3837
39- ### Function-level clone detection (Type-2)
38+ ### Function-level clone detection (Type-2, CFG-based )
4039
41- - Detects functions and methods with identical structure.
40+ - Detects functions and methods with identical ** control-flow structure** .
41+ - Based on ** Control Flow Graph (CFG)** fingerprinting.
4242- Robust to:
43- - variable renaming,
44- - constant changes,
45- - formatting differences.
46- - Ideal for spotting architectural duplication between layers.
43+ - variable renaming,
44+ - constant changes,
45+ - attribute renaming,
46+ - formatting differences,
47+ - docstrings and type annotations.
48+ - Ideal for spotting architectural duplication across layers.
4749
4850### Block-level clone detection (Type-3-lite)
4951
5052- Detects repeated ** statement blocks** inside larger functions.
53+ - Uses sliding windows over CFG-normalized statement sequences.
5154- Targets:
52- - validation blocks,
53- - guard clauses,
54- - repeated orchestration logic.
55- - Carefully filtered to avoid noise:
56- - no overlapping windows,
57- - no clones inside the same function,
58- - no ` __init__ ` noise.
55+ - validation blocks,
56+ - guard clauses,
57+ - repeated orchestration logic.
58+ - Carefully filtered to reduce noise:
59+ - no overlapping windows,
60+ - no clones inside the same function,
61+ - no ` __init__ ` noise,
62+ - size and statement-count thresholds.
63+
64+ ### Control-Flow Awareness (CFG v1)
65+
66+ - Each function is converted into a ** Control Flow Graph** .
67+ - CFG nodes contain normalized AST statements.
68+ - CFG edges represent structural control flow (` if ` , ` for ` , ` while ` ).
69+ - Current CFG semantics (v1):
70+ - ` break ` and ` continue ` are treated as statements (no jump targets),
71+ - after-blocks are explicit and always present,
72+ - focus is on ** structural similarity** , not precise runtime semantics.
73+
74+ This design keeps clone detection ** stable, deterministic, and low-noise** .
5975
6076### Low-noise by design
6177
62- - AST normalization instead of token matching.
63- - Size and statement-count thresholds.
78+ - AST + CFG normalization instead of token matching.
6479- Conservative defaults tuned for real-world Python projects.
80+ - Explicit thresholds for size and statement count.
81+ - Focus on * architectural duplication* , not micro-similarities.
6582
6683### CI-friendly baseline mode
6784
6885- Establish a baseline of existing clones.
6986- Fail CI ** only when new clones are introduced** .
70- - Safe for legacy codebases.
87+ - Safe for legacy codebases and incremental refactoring .
7188
7289---
7390
@@ -77,11 +94,11 @@ Typical use cases:
7794pip install codeclone
7895```
7996
80- Python 3.10+ is required.
97+ Python ** 3.10+** is required.
8198
82- ⸻
99+ ---
83100
84- Quick Start
101+ ## Quick Start
85102
86103Run on a project:
87104
@@ -91,9 +108,10 @@ codeclone .
91108
92109This will:
93110
94- * scan Python files,
95- * detect function-level and block-level clones,
96- * print a summary to stdout.
111+ - scan Python files,
112+ - build CFGs for functions,
113+ - detect function-level and block-level clones,
114+ - print a summary to stdout.
97115
98116Generate reports:
99117
@@ -103,81 +121,97 @@ codeclone . \
103121 --text-out .cache/codeclone/report.txt
104122```
105123
106- ⸻
124+ Generate an HTML report:
125+
126+ ``` bash
127+ codeclone . --html-out .cache/codeclone/report.html
128+ ```
129+
130+ ---
107131
108- Baseline Workflow (Recommended)
132+ ## Baseline Workflow (Recommended)
109133
110- 1 . Create a baseline
134+ ### 1. Create a baseline
111135
112136Run once on your current codebase:
113137
114138``` bash
115139codeclone . --update-baseline
116140```
117141
118- This creates a file:
142+ Commit the generated baseline file to the repository.
143+
144+ ### 2. Use in CI
119145
120146``` bash
121- . codeclone-baseline.json
147+ codeclone . --fail-on-new
122148```
123149
124- Commit this file to the repository.
125-
126- ⸻
150+ Behavior:
127151
128- 2 . Use in CI
152+ - ✅ existing clones are allowed,
153+ - ❌ build fails if * new* clones appear,
154+ - ✅ refactoring that removes duplication is always allowed.
129155
130- In CI, run:
156+ ---
131157
132- ``` bash
133- codeclone . --fail-on-new
158+ ## Using with pre-commit
159+
160+ ``` yaml
161+ repos :
162+ - repo : local
163+ hooks :
164+ - id : codeclone
165+ name : CodeClone
166+ entry : codeclone
167+ language : python
168+ args : [".", "--fail-on-new"]
169+ types : [python]
134170` ` `
135171
136- Behavior:
172+ ---
137173
138- * ✅ existing clones are allowed,
139- * ❌ build fails if new function or block clones appear,
140- * ✅ refactoring that removes duplication is always allowed.
174+ ## What CodeClone Is (and Is Not)
141175
142- This enables gradual improvement without breaking existing development flow.
176+ ### CodeClone **is**
143177
144- ⸻
178+ - an architectural analysis tool,
179+ - a duplication radar,
180+ - a CI guard against copy-paste,
181+ - a control-flow-aware clone detector.
145182
146- What CodeClone Is (and Is Not)
183+ ### CodeClone **is not**
147184
148- CodeClone is
185+ - a linter,
186+ - a formatter,
187+ - a semantic equivalence prover,
188+ - a runtime analyzer.
149189
150- * an architectural analysis tool,
151- * a duplication radar,
152- * a CI guard against copy-paste.
190+ ---
191+
192+ ## How It Works (High Level)
153193
154- CodeClone is not
194+ 1. Parse Python source into AST.
195+ 2. Normalize AST (names, constants, attributes, annotations).
196+ 3. Build a **Control Flow Graph (CFG)** per function.
197+ 4. Compute stable CFG fingerprints.
198+ 5. Detect function-level and block-level clones.
199+ 6. Apply conservative filters to suppress noise.
155200
156- * a linter,
157- * a formatter,
158- * a replacement for SonarQube or static analyzers,
159- * a semantic equivalence prover.
201+ ---
160202
161- It intentionally focuses on high-signal duplication.
203+ ## Control Flow Graph (CFG)
162204
163- ⸻
205+ Starting from **version 1.1.0**, CodeClone uses a **Control Flow Graph (CFG)**
206+ to improve structural clone detection robustness.
164207
165- How It Works (High Level)
208+ The CFG is a **structural abstraction**, not a runtime execution model.
166209
167- * Parses Python source into AST.
168- * Normalizes:
169- - variable names,
170- - constants,
171- - attributes,
172- - docstrings and annotations.
173- * Computes stable structural fingerprints.
174- * Detects:
175- - identical function structures,
176- - repeated statement blocks across functions.
177- * Applies filters to suppress noise.
210+ See full design and semantics:
211+ - [docs/cfg.md](docs/cfg.md)
178212
179- ⸻
213+ ---
180214
181- License
215+ ## License
182216
183- MIT License
217+ MIT License
0 commit comments