Skip to content

Baseline schema stabilization: introduce long-lived v1 contract (meta + clones) and decouple from CodeClone versions #4

@orenlab

Description

@orenlab

Problem

Baseline format in CodeClone currently risks breaking CI for users on upgrades, because schema/metadata evolution may require baseline regeneration too often.

Even if behavior remains deterministic, users perceive baseline changes as unstable CI friction.

We need a baseline contract that can survive patch/minor releases across the whole 1.x line, and require regeneration only when detection fingerprint semantics actually change.


Goal

Introduce a stable, long-lived Baseline v1 schema contract:

  • deterministic
  • reproducible
  • audit-friendly
  • CI-safe
  • forward extensible inside v1

Baseline regeneration must be required only when fingerprint_version changes, not on patch/minor releases.


Proposed Baseline v1 Structure

{
  "meta": {
    "generator": "codeclone",
    "format_version": "1.0",
    "schema_version": "1.0",
    "fingerprint_version": "1",
    "python_tag": "cp313",
    "created_at": "2026-02-08T11:43:16Z",
    "payload_sha256": "..."
  },
  "clones": {
    "functions": [],
    "blocks": []
  }
}

Compatibility Rules

Must NOT require baseline regeneration

  • patch/minor CodeClone releases (1.4.1, 1.5.0, etc.)
  • CLI/report/UI changes
  • performance refactors
  • cache changes

Must require baseline regeneration

  • any change that affects clone IDs / grouping semantics:
  • CFG semantics
  • AST normalization rules
  • hashing / fingerprint computation
  • clone extraction rules

This is tracked by fingerprint_version.


Baseline Validation Pipeline (deterministic order)

Validation must always run in this strict order:

  • size guard (--max-baseline-size-mb)
  • JSON parse
  • strict schema/type validation
  • required fields present
  • compatibility checks (generator, format_version, schema_version, fingerprint_version, python_tag)
  • payload integrity check (payload_sha256)

Integrity Model

payload_sha256 must be computed from a canonical JSON payload containing only:

  • clones.functions
  • clones.blocks
  • meta.format_version
  • meta.schema_version
  • meta.fingerprint_version
  • meta.python_tag

No paths, timestamps, or CodeClone version included.

Behavior Rules

Normal mode

If baseline is untrusted → warn and compare against empty baseline.

Gating mode (--fail-on-new / --ci)

If baseline is untrusted → fail-fast with exit code 2.

Exit codes must remain stable:

  • 0 success
  • 2 baseline/contract failure in gating
  • 3 new clones detected

Migration

Existing baseline formats (pre-v1 schema) must be treated as legacy/untrusted with clear regeneration guidance.

No automatic silent migration in default CLI flow.


Required Tests

  • happy path baseline load
  • generator mismatch
  • python_tag mismatch
  • fingerprint_version mismatch
  • integrity missing/mismatch
  • corrupted JSON
  • oversized baseline handling
  • atomic baseline write regression
  • unsorted / duplicate clone list detection

Acceptance Criteria

  • baseline v1 format is documented and treated as stable across 1.x
  • CI breakage happens only when fingerprint_version changes
  • baseline validation statuses are deterministic and visible in reports
  • tests cover all trust/untrust states

Notes

Suppressions are explicitly out of scope for baseline v1.

They should be considered a separate artifact or separate feature area later.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingenhancementNew feature or request

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions