Skip to content

Commit 0ac6f0c

Browse files
committed
Merge branch 'feat/1.1.0'
2 parents 5833b6e + 1c176fa commit 0ac6f0c

26 files changed

Lines changed: 2162 additions & 260 deletions

.pre-commit-config.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
repos:
2+
- repo: local
3+
hooks:
4+
- id: codeclone
5+
name: CodeClone
6+
entry: codeclone
7+
language: python
8+
args: [".", "--fail-on-new"]
9+
types: [python]

CHANGELOG.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Changelog
2+
3+
## [1.1.0] — 2026-01-19
4+
5+
### Added
6+
7+
- Control Flow Graph (CFG v1) for structural clone detection
8+
- Deterministic CFG-based function fingerprints
9+
- Interactive HTML report with syntax highlighting
10+
- Dark/light theme toggle in HTML report
11+
- Block-level clone visualization
12+
13+
### Changed
14+
15+
- Function clone detection now based on CFG instead of pure AST
16+
- Improved robustness against refactoring and control-flow changes
17+
18+
### Documentation
19+
20+
- Added `docs/cfg.md` with CFG semantics and limitations
21+
- Added `docs/architecture.md` describing system design
22+
23+
---
24+
25+
## [1.0.0] — 2026-01-17
26+
27+
### Initial release
28+
29+
- AST-based function clone detection
30+
- Block-level clone detection (Type-3-lite)
31+
- Baseline workflow for CI
32+
- JSON and text reports

README.md

Lines changed: 113 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -5,69 +5,86 @@
55
[![Python](https://img.shields.io/pypi/pyversions/codeclone.svg)](https://pypi.org/project/codeclone/)
66
[![License](https://img.shields.io/pypi/l/codeclone.svg)](LICENSE)
77

8-
**CodeClone** is an AST-based Python code clone detector that helps teams find architectural duplication and prevent new
9-
copy-paste via CI.
8+
**CodeClone** is a Python code clone detector based on **normalized AST and control-flow graphs (CFG)**.
9+
It helps teams discover architectural duplication and prevent new copy-paste from entering the codebase via CI.
1010

11-
It is designed to help teams:
11+
CodeClone is designed to help teams:
1212

13-
- discover structural and logical code duplication,
14-
- understand architectural hotspots,
15-
- and prevent *new* duplication from entering the codebase via CI.
13+
- discover **structural and control-flow duplication**,
14+
- identify architectural hotspots,
15+
- prevent *new* duplication via CI and pre-commit hooks.
1616

17-
Unlike token- or text-based tools, CodeClone works on **normalized Python AST**, which makes it robust against renaming,
17+
Unlike token- or text-based tools, CodeClone operates on **normalized Python AST and CFG**, making it robust against renaming,
1818
formatting, and minor refactoring.
1919

2020
---
2121

2222
## Why CodeClone?
2323

2424
Most existing tools detect *textual* duplication.
25-
CodeClone detects **structural and block-level duplication** that usually indicates missing abstractions or
26-
architectural drift.
25+
CodeClone detects **structural and block-level duplication**, which usually signals missing abstractions or architectural drift.
2726

2827
Typical use cases:
2928

30-
- duplicated service logic across layers (API ↔ application),
29+
- duplicated service or orchestration logic across layers (API ↔ application),
3130
- repeated validation or guard blocks,
32-
- copy-pasted request/handler flows,
33-
- duplicated orchestration logic in routers, handlers, or services.
31+
- copy-pasted request / handler flows,
32+
- duplicated control-flow logic in routers, handlers, or services.
3433

3534
---
3635

3736
## Features
3837

39-
### Function-level clone detection (Type-2)
38+
### Function-level clone detection (Type-2, CFG-based)
4039

41-
- Detects functions and methods with identical structure.
40+
- Detects functions and methods with identical **control-flow structure**.
41+
- Based on **Control Flow Graph (CFG)** fingerprinting.
4242
- Robust to:
43-
- variable renaming,
44-
- constant changes,
45-
- formatting differences.
46-
- Ideal for spotting architectural duplication between layers.
43+
- variable renaming,
44+
- constant changes,
45+
- attribute renaming,
46+
- formatting differences,
47+
- docstrings and type annotations.
48+
- Ideal for spotting architectural duplication across layers.
4749

4850
### Block-level clone detection (Type-3-lite)
4951

5052
- Detects repeated **statement blocks** inside larger functions.
53+
- Uses sliding windows over CFG-normalized statement sequences.
5154
- Targets:
52-
- validation blocks,
53-
- guard clauses,
54-
- repeated orchestration logic.
55-
- Carefully filtered to avoid noise:
56-
- no overlapping windows,
57-
- no clones inside the same function,
58-
- no `__init__` noise.
55+
- validation blocks,
56+
- guard clauses,
57+
- repeated orchestration logic.
58+
- Carefully filtered to reduce noise:
59+
- no overlapping windows,
60+
- no clones inside the same function,
61+
- no `__init__` noise,
62+
- size and statement-count thresholds.
63+
64+
### Control-Flow Awareness (CFG v1)
65+
66+
- Each function is converted into a **Control Flow Graph**.
67+
- CFG nodes contain normalized AST statements.
68+
- CFG edges represent structural control flow (`if`, `for`, `while`).
69+
- Current CFG semantics (v1):
70+
- `break` and `continue` are treated as statements (no jump targets),
71+
- after-blocks are explicit and always present,
72+
- focus is on **structural similarity**, not precise runtime semantics.
73+
74+
This design keeps clone detection **stable, deterministic, and low-noise**.
5975

6076
### Low-noise by design
6177

62-
- AST normalization instead of token matching.
63-
- Size and statement-count thresholds.
78+
- AST + CFG normalization instead of token matching.
6479
- Conservative defaults tuned for real-world Python projects.
80+
- Explicit thresholds for size and statement count.
81+
- Focus on *architectural duplication*, not micro-similarities.
6582

6683
### CI-friendly baseline mode
6784

6885
- Establish a baseline of existing clones.
6986
- Fail CI **only when new clones are introduced**.
70-
- Safe for legacy codebases.
87+
- Safe for legacy codebases and incremental refactoring.
7188

7289
---
7390

@@ -77,11 +94,11 @@ Typical use cases:
7794
pip install codeclone
7895
```
7996

80-
Python 3.10+ is required.
97+
Python **3.10+** is required.
8198

82-
99+
---
83100

84-
Quick Start
101+
## Quick Start
85102

86103
Run on a project:
87104

@@ -91,9 +108,10 @@ codeclone .
91108

92109
This will:
93110

94-
* scan Python files,
95-
* detect function-level and block-level clones,
96-
* print a summary to stdout.
111+
- scan Python files,
112+
- build CFGs for functions,
113+
- detect function-level and block-level clones,
114+
- print a summary to stdout.
97115

98116
Generate reports:
99117

@@ -103,81 +121,97 @@ codeclone . \
103121
--text-out .cache/codeclone/report.txt
104122
```
105123

106-
124+
Generate an HTML report:
125+
126+
```bash
127+
codeclone . --html-out .cache/codeclone/report.html
128+
```
129+
130+
---
107131

108-
Baseline Workflow (Recommended)
132+
## Baseline Workflow (Recommended)
109133

110-
1. Create a baseline
134+
### 1. Create a baseline
111135

112136
Run once on your current codebase:
113137

114138
```bash
115139
codeclone . --update-baseline
116140
```
117141

118-
This creates a file:
142+
Commit the generated baseline file to the repository.
143+
144+
### 2. Use in CI
119145

120146
```bash
121-
.codeclone-baseline.json
147+
codeclone . --fail-on-new
122148
```
123149

124-
Commit this file to the repository.
125-
126-
150+
Behavior:
127151

128-
2. Use in CI
152+
- ✅ existing clones are allowed,
153+
- ❌ build fails if *new* clones appear,
154+
- ✅ refactoring that removes duplication is always allowed.
129155

130-
In CI, run:
156+
---
131157

132-
```bash
133-
codeclone . --fail-on-new
158+
## Using with pre-commit
159+
160+
```yaml
161+
repos:
162+
- repo: local
163+
hooks:
164+
- id: codeclone
165+
name: CodeClone
166+
entry: codeclone
167+
language: python
168+
args: [".", "--fail-on-new"]
169+
types: [python]
134170
```
135171
136-
Behavior:
172+
---
137173
138-
* ✅ existing clones are allowed,
139-
* ❌ build fails if new function or block clones appear,
140-
* ✅ refactoring that removes duplication is always allowed.
174+
## What CodeClone Is (and Is Not)
141175
142-
This enables gradual improvement without breaking existing development flow.
176+
### CodeClone **is**
143177
144-
178+
- an architectural analysis tool,
179+
- a duplication radar,
180+
- a CI guard against copy-paste,
181+
- a control-flow-aware clone detector.
145182
146-
What CodeClone Is (and Is Not)
183+
### CodeClone **is not**
147184
148-
CodeClone is
185+
- a linter,
186+
- a formatter,
187+
- a semantic equivalence prover,
188+
- a runtime analyzer.
149189
150-
* an architectural analysis tool,
151-
* a duplication radar,
152-
* a CI guard against copy-paste.
190+
---
191+
192+
## How It Works (High Level)
153193
154-
CodeClone is not
194+
1. Parse Python source into AST.
195+
2. Normalize AST (names, constants, attributes, annotations).
196+
3. Build a **Control Flow Graph (CFG)** per function.
197+
4. Compute stable CFG fingerprints.
198+
5. Detect function-level and block-level clones.
199+
6. Apply conservative filters to suppress noise.
155200
156-
* a linter,
157-
* a formatter,
158-
* a replacement for SonarQube or static analyzers,
159-
* a semantic equivalence prover.
201+
---
160202
161-
It intentionally focuses on high-signal duplication.
203+
## Control Flow Graph (CFG)
162204
163-
205+
Starting from **version 1.1.0**, CodeClone uses a **Control Flow Graph (CFG)**
206+
to improve structural clone detection robustness.
164207
165-
How It Works (High Level)
208+
The CFG is a **structural abstraction**, not a runtime execution model.
166209
167-
* Parses Python source into AST.
168-
* Normalizes:
169-
- variable names,
170-
- constants,
171-
- attributes,
172-
- docstrings and annotations.
173-
* Computes stable structural fingerprints.
174-
* Detects:
175-
- identical function structures,
176-
- repeated statement blocks across functions.
177-
* Applies filters to suppress noise.
210+
See full design and semantics:
211+
- [docs/cfg.md](docs/cfg.md)
178212
179-
213+
---
180214
181-
License
215+
## License
182216
183-
MIT License
217+
MIT License

codeclone/__init__.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
"""
2+
CodeClone — AST and CFG-based code clone detector for Python
3+
focused on architectural duplication.
4+
5+
Copyright (c) 2026 Den Rozhnovskiy
6+
Licensed under the MIT License.
7+
"""
8+
9+
from importlib.metadata import version, PackageNotFoundError
10+
11+
try:
12+
__version__ = version("codeclone")
13+
except PackageNotFoundError:
14+
__version__ = "dev"
15+
16+
__all__ = ["__version__"]

codeclone/baseline.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,11 @@
1+
"""
2+
CodeClone — AST and CFG-based code clone detector for Python
3+
focused on architectural duplication.
4+
5+
Copyright (c) 2026 Den Rozhnovskiy
6+
Licensed under the MIT License.
7+
"""
8+
19
from __future__ import annotations
210

311
import json

codeclone/blockhash.py

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,21 @@
1+
"""
2+
CodeClone — AST and CFG-based code clone detector for Python
3+
focused on architectural duplication.
4+
5+
Copyright (c) 2026 Den Rozhnovskiy
6+
Licensed under the MIT License.
7+
"""
8+
19
from __future__ import annotations
210

311
import ast
412
import hashlib
513

614
from .normalize import NormalizationConfig, AstNormalizer
715

16+
817
def stmt_hash(stmt: ast.stmt, cfg: NormalizationConfig) -> str:
918
normalizer = AstNormalizer(cfg)
1019
stmt = ast.fix_missing_locations(normalizer.visit(stmt))
1120
dump = ast.dump(stmt, annotate_fields=True, include_attributes=False)
12-
return hashlib.sha1(dump.encode("utf-8")).hexdigest()
21+
return hashlib.sha1(dump.encode("utf-8")).hexdigest()

0 commit comments

Comments
 (0)