Skip to content

Commit 3e47109

Browse files
authored
Merge pull request #2 from orenlab/feat/1.2.1
Merge feat/1.2.1: security hardening, deterministic baseline checks, UX polish
2 parents 5769d1c + d8139ad commit 3e47109

45 files changed

Lines changed: 5273 additions & 1213 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/actions/codeclone/action.yml

Lines changed: 44 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,16 @@ branding:
1010
color: blue
1111

1212
inputs:
13+
python-version:
14+
description: "Python version to use"
15+
required: false
16+
default: "3.13"
17+
18+
package-version:
19+
description: "CodeClone version from PyPI (empty = latest)"
20+
required: false
21+
default: ""
22+
1323
path:
1424
description: "Path to the project root"
1525
required: false
@@ -20,20 +30,50 @@ inputs:
2030
required: false
2131
default: "true"
2232

33+
no-progress:
34+
description: "Disable progress output"
35+
required: false
36+
default: "true"
37+
38+
require-baseline:
39+
description: "Fail if codeclone.baseline.json is missing"
40+
required: false
41+
default: "true"
42+
2343
runs:
2444
using: composite
2545
steps:
46+
- name: Set up Python
47+
uses: actions/setup-python@v5
48+
with:
49+
python-version: ${{ inputs.python-version }}
50+
cache: pip
51+
2652
- name: Install CodeClone
2753
shell: bash
2854
run: |
2955
python -m pip install --upgrade pip
30-
pip install codeclone
56+
if [ -n "${{ inputs.package-version }}" ]; then
57+
pip install "codeclone==${{ inputs.package-version }}"
58+
else
59+
pip install codeclone
60+
fi
61+
62+
- name: Verify baseline exists
63+
if: ${{ inputs.require-baseline == 'true' }}
64+
shell: bash
65+
run: |
66+
test -f "${{ inputs.path }}/codeclone.baseline.json"
3167
3268
- name: Run CodeClone
3369
shell: bash
3470
run: |
71+
extra=""
72+
if [ "${{ inputs.no-progress }}" = "true" ]; then
73+
extra="--no-progress"
74+
fi
3575
if [ "${{ inputs.fail-on-new }}" = "true" ]; then
36-
codeclone "${{ inputs.path }}" --fail-on-new
76+
codeclone "${{ inputs.path }}" --fail-on-new $extra
3777
else
38-
codeclone "${{ inputs.path }}"
39-
fi
78+
codeclone "${{ inputs.path }}" $extra
79+
fi

.github/workflows/tests.yml

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
name: tests
2+
3+
on:
4+
push:
5+
branches: [ "**" ]
6+
pull_request:
7+
8+
permissions:
9+
contents: read
10+
11+
concurrency:
12+
group: tests-${{ github.ref }}
13+
cancel-in-progress: true
14+
15+
jobs:
16+
test:
17+
runs-on: ubuntu-latest
18+
strategy:
19+
fail-fast: false
20+
matrix:
21+
python-version: [ "3.10", "3.11", "3.12", "3.13", "3.14" ]
22+
steps:
23+
- name: Checkout
24+
uses: actions/checkout@v6.0.2
25+
26+
- name: Set up Python
27+
uses: actions/setup-python@v6.2.0
28+
with:
29+
python-version: ${{ matrix.python-version }}
30+
allow-prereleases: true
31+
32+
- name: Set up uv
33+
uses: astral-sh/setup-uv@v5
34+
with:
35+
enable-cache: true
36+
37+
- name: Install dependencies
38+
run: uv sync --all-extras --dev
39+
40+
- name: Run tests
41+
run: uv run pytest --cov=codeclone --cov-report=term-missing --cov-fail-under=98
42+
43+
- name: Verify baseline exists
44+
if: ${{ matrix.python-version == '3.13' }}
45+
run: test -f codeclone.baseline.json
46+
47+
- name: Check for new clones vs baseline
48+
if: ${{ matrix.python-version == '3.13' }}
49+
run: uv run codeclone . --fail-on-new --no-progress
50+
51+
lint:
52+
runs-on: ubuntu-latest
53+
steps:
54+
- name: Checkout
55+
uses: actions/checkout@v6.0.2
56+
57+
- name: Set up Python
58+
uses: actions/setup-python@v6.2.0
59+
with:
60+
python-version: "3.13"
61+
62+
- name: Set up uv
63+
uses: astral-sh/setup-uv@v5
64+
with:
65+
enable-cache: true
66+
67+
- name: Install dependencies
68+
run: uv sync --all-extras --dev
69+
70+
- name: Ruff
71+
run: uv run ruff check .
72+
73+
- name: Mypy
74+
run: uv run mypy .

CHANGELOG.md

Lines changed: 153 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,71 +1,186 @@
11
# Changelog
22

3+
## [1.2.1] - 2026-02-02
4+
5+
### Overview
6+
7+
This release focuses on security hardening, robustness, and long-term maintainability.
8+
No breaking API changes were introduced.
9+
10+
The goal of this release is to provide users with a safe, deterministic, and CI-friendly
11+
tool suitable for security-sensitive and large-scale environments.
12+
13+
### Security & Robustness
14+
15+
- **Path Traversal Protection**
16+
Implemented strict path validation to prevent scanning outside the project root or
17+
accessing sensitive system directories, including macOS `/private` paths.
18+
19+
- **Cache Integrity Protection**
20+
Added HMAC-SHA256 signing for cache files to prevent cache poisoning and detect tampering.
21+
22+
- **Parser Safety Limits**
23+
Introduced AST parsing time limits to mitigate risks from pathological or adversarial inputs.
24+
25+
- **Resource Exhaustion Protection**
26+
Enforced a maximum file size limit (10MB) and a maximum file count per scan to prevent
27+
excessive memory or CPU usage.
28+
29+
- **Structured Error Handling**
30+
Introduced a dedicated exception hierarchy (`ParseError`, `CacheError`, etc.) and replaced
31+
broad exception handling with graceful, user-friendly failure reporting.
32+
33+
### Performance Improvements
34+
35+
- **Optimized AST Normalization**
36+
Replaced expensive `deepcopy` operations with in-place AST normalization, significantly
37+
reducing CPU and memory overhead.
38+
39+
- **Improved Memory Efficiency**
40+
Added an LRU cache for file reading and optimized string concatenation during fingerprint
41+
generation.
42+
43+
- **HTML Report Memory Bounds**
44+
HTML reports now read only the required line ranges instead of entire files, reducing peak
45+
memory usage on large codebases.
46+
47+
### Architecture & Maintainability
48+
49+
- **Strict Type Safety**
50+
Migrated all optional typing to Python 3.10+ `| None` syntax and achieved 100% `mypy` strict
51+
compliance.
52+
53+
- **Modular CFG Design**
54+
Split CFG data structures and builder logic into separate modules (`cfg_model.py` and
55+
`cfg.py`) for improved clarity and extensibility.
56+
57+
- **Template Extraction**
58+
Extracted HTML templates into a dedicated `templates.py` module.
59+
60+
- Added a `py.typed` marker for downstream type checkers.
61+
- Added `__slots__` to performance-critical classes to reduce per-object memory overhead.
62+
63+
### CLI & User Experience
64+
65+
- Added a sequential execution fallback when process pools are unavailable (for example, in
66+
restricted or sandboxed environments).
67+
- Emit clear, user-visible warnings when cache validation fails instead of silently ignoring
68+
corrupted state.
69+
- Hardened HTML report template to safely embed JavaScript template literals and aligned it
70+
with linting requirements.
71+
72+
### Testing & Quality
73+
74+
- Expanded unit and integration test coverage across the CLI, CFG construction, cache
75+
handling, scanner, and HTML reporting paths.
76+
- Added security regression tests for dot-dot traversal and symlinked sensitive directories.
77+
- Tightened cache mismatch assertions to verify full state reset.
78+
- Achieved and enforced 98%+ line coverage, with coverage configuration added to
79+
`pyproject.toml`.
80+
- Added GitHub Actions workflow with Python 3.10–3.14 test matrix, including `ruff` and
81+
`mypy` checks.
82+
- CI baseline enforcement now runs on a single pinned Python version to avoid AST dump
83+
differences across interpreter versions.
84+
85+
### Python Version Consistency for Baseline Checks
86+
87+
Due to inherent differences in Python’s AST between interpreter versions, baseline
88+
generation and verification must be performed using the same Python version.
89+
90+
The baseline file now stores the Python version (`major.minor`) used during generation.
91+
When running with `--fail-on-new`, codeclone verifies that the current interpreter version
92+
matches the baseline and exits with code 2 if they differ.
93+
94+
This design ensures deterministic and reproducible clone detection results while preserving
95+
support for Python 3.10–3.14 across the test matrix.
96+
97+
### Fixed
98+
99+
- **CFG Exception Handling**
100+
Fixed incorrect control-flow linking for `try`/`except` blocks.
101+
102+
- **Pattern Matching Support**
103+
Added missing structural handling for `match`/`case` statements in the CFG.
104+
105+
- **Block Detection Scaling**
106+
Made `MIN_LINE_DISTANCE` dynamic based on block size to improve clone detection accuracy
107+
across differently sized functions.
108+
109+
---
110+
3111
## [1.2.0] - 2026-02-02
4112

5113
### BREAKING CHANGES
6114

7-
- **CLI Arguments**: Renamed output flags for brevity and consistency:
115+
- **CLI Arguments**
116+
Renamed output flags for brevity and consistency:
8117
- `--json-out``--json`
9118
- `--text-out``--text`
10119
- `--html-out``--html`
11120
- `--cache``--cache-dir`
12-
- **Baseline Behavior**:
13-
- The default baseline file location has changed from `~/.config/codeclone/baseline.json` to
14-
`./codeclone.baseline.json`. This encourages committing the baseline file to the repository, simplifying CI/CD
15-
integration.
16-
- The CLI now warns if a baseline file is expected but missing (unless `--update-baseline` is used).
121+
122+
- **Baseline Behavior**
123+
- The default baseline file location changed from
124+
`~/.config/codeclone/baseline.json` to `./codeclone.baseline.json`.
125+
- The CLI now warns if a baseline file is expected but missing (unless
126+
`--update-baseline` is used).
17127

18128
### Added
19129

20-
- **Detection Engine**:
21-
- **Deep CFG Analysis**: Added support for constructing control flow graphs for `try`/`except`/`finally`, `with`/
22-
`async with`, and `match`/`case` (Python 3.10+) statements. The tool now analyzes the internal structure of these
23-
blocks instead of treating them as opaque statements.
24-
- **Normalization**: Implemented normalization for Augmented Assignments. Code using `x += 1` is now detected as a
25-
clone of `x = x + 1`.
26-
- **Rich Output**: Integrated `rich` library for professional CLI output, including:
27-
- Color-coded status messages (Success/Warning/Error).
28-
- Progress bars and spinners for long-running tasks.
130+
- **Detection Engine**
131+
- Deep CFG analysis for `try`/`except`/`finally`, `with`/`async with`, and
132+
`match`/`case` (Python 3.10+) statements.
133+
- Normalization for augmented assignments (`x += 1` vs `x = x + 1`).
134+
135+
- **Rich Output**
136+
- Color-coded status messages.
137+
- Progress indicators for long-running tasks.
29138
- Formatted summary tables.
30-
- **CI/CD Improvements**: Clearer separation of arguments in `--help` output (Target, Tuning, Baseline, Reporting).
139+
140+
- **CI/CD Improvements**
141+
- Clearer argument grouping in `--help` output.
31142

32143
### Improved
33144

34-
- **Baseline**: Enhanced `Baseline` class with safer JSON loading (error handling for corrupted files), better typing (
35-
using `set` instead of `Set`), and cleaner API for creating instances (`from_groups` accepts path).
36-
- **Cache**: Refactored `Cache` to handle corrupted cache files gracefully by starting fresh instead of crashing.
37-
Updated typing to modern standards.
38-
- **Normalization**: Added `copy.deepcopy` to AST normalization to prevent side effects on the original AST nodes during
39-
fingerprinting. This ensures the AST remains intact for any subsequent operations.
40-
- **Typing**: General typing improvements across `report.py` and other modules to align with Python 3.10+ practices.
145+
- **Baseline**
146+
- Safer JSON loading.
147+
- Improved typing and cleaner construction API.
148+
149+
- **Cache**
150+
- Graceful recovery from corrupted cache files.
151+
- Updated typing to modern Python standards.
152+
153+
- **Typing**
154+
- General typing improvements across reporting and normalization modules.
155+
156+
---
41157

42-
## [1.1.0] 2026-01-19
158+
## [1.1.0] - 2026-01-19
43159

44160
### Added
45161

46-
- Control Flow Graph (CFG v1) for structural clone detection
47-
- Deterministic CFG-based function fingerprints
48-
- Interactive HTML report with syntax highlighting
49-
- Dark/light theme toggle in HTML report
50-
- Block-level clone visualization
162+
- Control Flow Graph (CFG v1) for structural clone detection.
163+
- Deterministic CFG-based function fingerprints.
164+
- Interactive HTML report with syntax highlighting.
165+
- Block-level clone visualization.
51166

52167
### Changed
53168

54-
- Function clone detection now based on CFG instead of pure AST
55-
- Improved robustness against refactoring and control-flow changes
169+
- Function clone detection now based on CFG instead of pure AST.
170+
- Improved robustness against refactoring and control-flow changes.
56171

57172
### Documentation
58173

59-
- Added `docs/cfg.md` with CFG semantics and limitations
60-
- Added `docs/architecture.md` describing system design
174+
- Added `docs/cfg.md` with CFG semantics and limitations.
175+
- Added `docs/architecture.md` describing system design.
61176

62177
---
63178

64-
## [1.0.0] 2026-01-17
179+
## [1.0.0] - 2026-01-17
65180

66181
### Initial release
67182

68-
- AST-based function clone detection
69-
- Block-level clone detection (Type-3-lite)
70-
- Baseline workflow for CI
71-
- JSON and text reports
183+
- AST-based function clone detection.
184+
- Block-level clone detection (Type-3-lite).
185+
- Baseline workflow for CI.
186+
- JSON and text reports.

0 commit comments

Comments
 (0)