Skip to content

claude-code-chat-browser: schema drift detection for upstream JSONL c…#108

Open
clean6378-max-it wants to merge 2 commits into
masterfrom
feat/schema-drift-detection
Open

claude-code-chat-browser: schema drift detection for upstream JSONL c…#108
clean6378-max-it wants to merge 2 commits into
masterfrom
feat/schema-drift-detection

Conversation

@clean6378-max-it

@clean6378-max-it clean6378-max-it commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Closes #103

Summary

Detect upstream Claude Code JSONL schema drift by fingerprinting field paths during parse_session() and diffing against a committed schema_baseline.json. New or missing required paths emit warnings on the claude_code_chat_browser.schema_drift logger; the web UI shows a dismissible amber banner on the session list page; GET /api/schema-report returns {known_fields, new_fields, missing_fields, has_drift}.

Closes sprint item 5 (July Week 1 Thursday, 5 pt).

Changes

  • utils/schema_drift.py — baseline loader, recursive path collection, drift diff, process-wide report
  • utils/jsonl_parser.py_collect_field_paths() wrapper; drift check at parse completion
  • schema_baseline.json — 98 known (json_path, expected_type) pairs; only type is required
  • api/schema_report.pyGET /api/schema-report
  • static/js/sessions.js + static/css/style.css — dismissible amber warning banner (sessionStorage fingerprint)
  • tests/fixtures/jsonl/unknown_field.jsonl — synthetic FutureToolXYZ / unknown tool key fixture
  • tests/test_schema_drift.py — 10 tests covering warnings, API, false-positive guard, merge behavior

Out of scope

  • Auto-updating baseline on every parse (updates are deliberate)
  • Blocking parse on drift
  • CLI command (API endpoint chosen instead)

Test plan

  • pytest tests/ -k schema -q
  • pytest -q (443 passed)
  • mypy -p api -p utils -p models
  • ruff check .
  • Manual: parse tests/fixtures/jsonl/unknown_field.jsonl — confirm warning logged and amber banner in UI

Summary by CodeRabbit

  • New Features
    • Added a schema drift warning banner in the workspace to surface newly detected or missing session fields (up to 5 new and 5 missing required items).
    • Added a schema drift report endpoint so the UI can display current drift status.
  • Bug Fixes
    • Banner dismissals are remembered per drift state, preventing repeated pop-ups.
    • If the schema baseline is invalid, drift tracking is skipped without breaking session handling.
  • Style
    • Added warning-themed banner styling in both light and dark modes, including a dismiss button.

…hanges (#5)

Fingerprint known Claude Code JSONL field paths against a committed
schema_baseline.json, warn on drift during parsing, expose GET
/api/schema-report, and surface a dismissible amber banner on the
session list page. Warnings only - parsing is never blocked.
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ee9dded3-eb93-44e7-8e9f-4f0bca3b9425

📥 Commits

Reviewing files that changed from the base of the PR and between 8a7f170 and 252429c.

📒 Files selected for processing (5)
  • benchmarks/baselines.json
  • static/js/sessions.js
  • static/js/sessions.test.js
  • tests/test_schema_drift.py
  • utils/schema_drift.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/test_schema_drift.py
  • static/js/sessions.js

📝 Walkthrough

Walkthrough

Adds JSONL schema drift tracking from parser to API and UI, backed by a committed baseline, updated tests, and refreshed benchmark baselines.

Changes

Schema Drift Detection

Layer / File(s) Summary
Baseline field catalog
schema_baseline.json
Defines the committed field-path catalog, including required flags and typed entries for message, tool, and metadata fields.
Schema drift detection module
utils/schema_drift.py
Collects dotted field paths, loads the baseline, computes known/new/missing sets, accumulates reports, and supports reset/cache clearing.
Parser and API wiring
utils/jsonl_parser.py, api/schema_report.py, app.py
Records observed field paths during session parsing, exposes /api/schema-report, and registers the new blueprint in the app.
UI drift banner
static/js/sessions.js, static/css/style.css
Fetches the schema report, renders a dismissible banner when drift exists, persists dismissal state, and adds warning styles.
Tests and benchmarks
tests/test_schema_drift.py, static/js/sessions.test.js, tests/fixtures/jsonl/unknown_field.jsonl, benchmarks/baselines.json
Adds fixture coverage, schema drift tests, UI call-order/banner tests, and updated benchmark baselines.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Possibly related PRs

Suggested reviewers: timon0305, wpak-ai

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is specific and matches the main change: schema drift detection for Claude Code JSONL parsing.
Linked Issues check ✅ Passed The PR meets the linked issue goals: field-path fingerprinting, baseline comparison, warnings, banner, API report, fixture, and non-blocking parsing.
Out of Scope Changes check ✅ Passed The changes appear scoped to schema-drift detection and related tests, UI, baseline, and benchmark updates; no unrelated code changes stand out.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/schema-drift-detection

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
utils/schema_drift.py (2)

90-121: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Baseline JSON is re-read and re-parsed from disk on every parse_session() call.

load_baseline_fields() does file I/O + json.loads and is invoked unconditionally by diff_against_baseline(), which per the parser integration snippet runs once per session file. On a session-list page rendering many sessions, this repeats disk I/O + parsing for a file that never changes at runtime. Consider loading/parsing the baseline once (module import or lazily-cached, e.g. via functools.lru_cache) instead of on every parse.

♻️ Example: cache the parsed baseline
+from functools import lru_cache
+
+
+@lru_cache(maxsize=1)
+def load_baseline_fields() -> dict[str, SchemaFieldSpec]:
     """Load ``schema_baseline.json`` field specs keyed by dotted path."""
     raw = json.loads(BASELINE_PATH.read_text(encoding="utf-8"))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@utils/schema_drift.py` around lines 90 - 121, The baseline JSON is being
re-read and re-parsed on every call through diff_against_baseline() and
load_baseline_fields(), which adds repeated disk I/O for data that does not
change at runtime. Cache the parsed baseline once instead of loading it per
session, either by memoizing load_baseline_fields() with a lazy cache such as
functools.lru_cache or by initializing the baseline at module import, and keep
diff_against_baseline() using the cached result.

71-87: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

collect_field_paths_with_types and baseline expected_type are unused in the drift diff.

diff_against_baseline only compares field paths, and nothing in this module reads observed types. If type-drift checks aren’t coming next, remove the helper and expected_type plumbing for now to keep this focused.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@utils/schema_drift.py` around lines 71 - 87, The observed type-tracking code
in collect_field_paths_with_types is not used by diff_against_baseline, so the
drift logic should be simplified to only compare field paths. Remove the unused
type-collection helper and any expected_type plumbing from the baseline handling
unless you are adding type-drift checks in this same change, and keep the
remaining symbols centered around collect_field_paths, baseline, and
diff_against_baseline.
static/js/sessions.js (1)

85-88: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Parallelize independent fetches to reduce latency.

fetchSchemaDriftBannerHtml() and the sessions fetch are independent; awaiting them sequentially adds the schema-report round trip to the workspace load time.

⚡ Suggested refactor
-        const schemaBannerHtml = await fetchSchemaDriftBannerHtml();
-
-        const res = await fetch(`/api/projects/${encodeURIComponent(projectName)}/sessions`);
-        state.cachedSessions = await res.json();
+        const [schemaBannerHtml, res] = await Promise.all([
+            fetchSchemaDriftBannerHtml(),
+            fetch(`/api/projects/${encodeURIComponent(projectName)}/sessions`),
+        ]);
+        state.cachedSessions = await res.json();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@static/js/sessions.js` around lines 85 - 88, The workspace load is waiting on
two independent requests in sequence, which adds unnecessary latency. Update the
sessions-loading flow in sessions.js so fetchSchemaDriftBannerHtml() and the
/api/projects/${encodeURIComponent(projectName)}/sessions fetch start in
parallel, then await both results together before using schemaBannerHtml and
state.cachedSessions. Keep the change localized to the sessions-loading logic
and preserve the existing assignment behavior once both promises resolve.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@utils/jsonl_parser.py`:
- Around line 236-239: `parse_session` currently calls
`record_parse_drift(observed_field_paths)` unguarded, so drift tracking can
abort parsing if `diff_against_baseline()` or `load_baseline_fields()` fails on
a missing or malformed `schema_baseline.json`. Update `record_parse_drift` (or
its call site in `parse_session`) to catch and suppress non-fatal drift-tracking
errors, while still allowing `validate_session_dict(...)` and the rest of
parsing to proceed normally; keep the fix localized around `record_parse_drift`,
`diff_against_baseline`, and `load_baseline_fields`.

---

Nitpick comments:
In `@static/js/sessions.js`:
- Around line 85-88: The workspace load is waiting on two independent requests
in sequence, which adds unnecessary latency. Update the sessions-loading flow in
sessions.js so fetchSchemaDriftBannerHtml() and the
/api/projects/${encodeURIComponent(projectName)}/sessions fetch start in
parallel, then await both results together before using schemaBannerHtml and
state.cachedSessions. Keep the change localized to the sessions-loading logic
and preserve the existing assignment behavior once both promises resolve.

In `@utils/schema_drift.py`:
- Around line 90-121: The baseline JSON is being re-read and re-parsed on every
call through diff_against_baseline() and load_baseline_fields(), which adds
repeated disk I/O for data that does not change at runtime. Cache the parsed
baseline once instead of loading it per session, either by memoizing
load_baseline_fields() with a lazy cache such as functools.lru_cache or by
initializing the baseline at module import, and keep diff_against_baseline()
using the cached result.
- Around line 71-87: The observed type-tracking code in
collect_field_paths_with_types is not used by diff_against_baseline, so the
drift logic should be simplified to only compare field paths. Remove the unused
type-collection helper and any expected_type plumbing from the baseline handling
unless you are adding type-drift checks in this same change, and keep the
remaining symbols centered around collect_field_paths, baseline, and
diff_against_baseline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 118cf0f8-8beb-4e8a-ab11-f5c572bfcdac

📥 Commits

Reviewing files that changed from the base of the PR and between 4345d69 and 8a7f170.

📒 Files selected for processing (9)
  • api/schema_report.py
  • app.py
  • schema_baseline.json
  • static/css/style.css
  • static/js/sessions.js
  • tests/fixtures/jsonl/unknown_field.jsonl
  • tests/test_schema_drift.py
  • utils/jsonl_parser.py
  • utils/schema_drift.py

Comment thread utils/jsonl_parser.py
…108)

Cache schema_baseline.json with lru_cache and make record_parse_drift
non-fatal on baseline I/O or parse errors so parsing never aborts.
Fetch /api/schema-report after sessions load so the banner reflects
drift from the current parse run. Add vitest coverage for banner
rendering and fetch ordering; extend pytest for malformed baseline.
Raise benchmark baselines for per-entry field-path fingerprinting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

claude-code-chat-browser: Schema drift detection — mechanism to detect upstream Claude Code JSONL changes

1 participant