-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
Baseline format in CodeClone currently risks breaking CI for users on upgrades, because schema/metadata evolution may require baseline regeneration too often.
Even if behavior remains deterministic, users perceive baseline changes as unstable CI friction.
We need a baseline contract that can survive patch/minor releases across the whole 1.x line, and require regeneration only when detection fingerprint semantics actually change.
Goal
Introduce a stable, long-lived Baseline v1 schema contract:
- deterministic
- reproducible
- audit-friendly
- CI-safe
- forward extensible inside v1
Baseline regeneration must be required only when fingerprint_version changes, not on patch/minor releases.
Proposed Baseline v1 Structure
{
"meta": {
"generator": "codeclone",
"format_version": "1.0",
"schema_version": "1.0",
"fingerprint_version": "1",
"python_tag": "cp313",
"created_at": "2026-02-08T11:43:16Z",
"payload_sha256": "..."
},
"clones": {
"functions": [],
"blocks": []
}
}Compatibility Rules
Must NOT require baseline regeneration
- patch/minor CodeClone releases (1.4.1, 1.5.0, etc.)
- CLI/report/UI changes
- performance refactors
- cache changes
Must require baseline regeneration
- any change that affects clone IDs / grouping semantics:
- CFG semantics
- AST normalization rules
- hashing / fingerprint computation
- clone extraction rules
This is tracked by fingerprint_version.
Baseline Validation Pipeline (deterministic order)
Validation must always run in this strict order:
- size guard (--max-baseline-size-mb)
- JSON parse
- strict schema/type validation
- required fields present
- compatibility checks (generator, format_version, schema_version, fingerprint_version, python_tag)
- payload integrity check (payload_sha256)
Integrity Model
payload_sha256 must be computed from a canonical JSON payload containing only:
clones.functionsclones.blocksmeta.format_versionmeta.schema_versionmeta.fingerprint_versionmeta.python_tag
No paths, timestamps, or CodeClone version included.
⸻
Behavior Rules
Normal mode
If baseline is untrusted → warn and compare against empty baseline.
Gating mode (--fail-on-new / --ci)
If baseline is untrusted → fail-fast with exit code 2.
Exit codes must remain stable:
- 0 success
- 2 baseline/contract failure in gating
- 3 new clones detected
Migration
Existing baseline formats (pre-v1 schema) must be treated as legacy/untrusted with clear regeneration guidance.
No automatic silent migration in default CLI flow.
Required Tests
- happy path baseline load
- generator mismatch
- python_tag mismatch
- fingerprint_version mismatch
- integrity missing/mismatch
- corrupted JSON
- oversized baseline handling
- atomic baseline write regression
- unsorted / duplicate clone list detection
Acceptance Criteria
- baseline v1 format is documented and treated as stable across 1.x
- CI breakage happens only when fingerprint_version changes
- baseline validation statuses are deterministic and visible in reports
- tests cover all trust/untrust states
Notes
Suppressions are explicitly out of scope for baseline v1.
They should be considered a separate artifact or separate feature area later.