claude-code-chat-browser: schema drift detection for upstream JSONL c…#108
claude-code-chat-browser: schema drift detection for upstream JSONL c…#108clean6378-max-it wants to merge 2 commits into
Conversation
…hanges (#5) Fingerprint known Claude Code JSONL field paths against a committed schema_baseline.json, warn on drift during parsing, expose GET /api/schema-report, and surface a dismissible amber banner on the session list page. Warnings only - parsing is never blocked.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds JSONL schema drift tracking from parser to API and UI, backed by a committed baseline, updated tests, and refreshed benchmark baselines. ChangesSchema Drift Detection
Estimated code review effort: 3 (Moderate) | ~25 minutes Possibly related PRs
Suggested reviewers: 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
utils/schema_drift.py (2)
90-121: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick winBaseline JSON is re-read and re-parsed from disk on every
parse_session()call.
load_baseline_fields()does file I/O +json.loadsand is invoked unconditionally bydiff_against_baseline(), which per the parser integration snippet runs once per session file. On a session-list page rendering many sessions, this repeats disk I/O + parsing for a file that never changes at runtime. Consider loading/parsing the baseline once (module import or lazily-cached, e.g. viafunctools.lru_cache) instead of on every parse.♻️ Example: cache the parsed baseline
+from functools import lru_cache + + +@lru_cache(maxsize=1) +def load_baseline_fields() -> dict[str, SchemaFieldSpec]: """Load ``schema_baseline.json`` field specs keyed by dotted path.""" raw = json.loads(BASELINE_PATH.read_text(encoding="utf-8"))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@utils/schema_drift.py` around lines 90 - 121, The baseline JSON is being re-read and re-parsed on every call through diff_against_baseline() and load_baseline_fields(), which adds repeated disk I/O for data that does not change at runtime. Cache the parsed baseline once instead of loading it per session, either by memoizing load_baseline_fields() with a lazy cache such as functools.lru_cache or by initializing the baseline at module import, and keep diff_against_baseline() using the cached result.
71-87: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value
collect_field_paths_with_typesand baselineexpected_typeare unused in the drift diff.
diff_against_baselineonly compares field paths, and nothing in this module reads observed types. If type-drift checks aren’t coming next, remove the helper andexpected_typeplumbing for now to keep this focused.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@utils/schema_drift.py` around lines 71 - 87, The observed type-tracking code in collect_field_paths_with_types is not used by diff_against_baseline, so the drift logic should be simplified to only compare field paths. Remove the unused type-collection helper and any expected_type plumbing from the baseline handling unless you are adding type-drift checks in this same change, and keep the remaining symbols centered around collect_field_paths, baseline, and diff_against_baseline.static/js/sessions.js (1)
85-88: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick winParallelize independent fetches to reduce latency.
fetchSchemaDriftBannerHtml()and the sessions fetch are independent; awaiting them sequentially adds the schema-report round trip to the workspace load time.⚡ Suggested refactor
- const schemaBannerHtml = await fetchSchemaDriftBannerHtml(); - - const res = await fetch(`/api/projects/${encodeURIComponent(projectName)}/sessions`); - state.cachedSessions = await res.json(); + const [schemaBannerHtml, res] = await Promise.all([ + fetchSchemaDriftBannerHtml(), + fetch(`/api/projects/${encodeURIComponent(projectName)}/sessions`), + ]); + state.cachedSessions = await res.json();🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@static/js/sessions.js` around lines 85 - 88, The workspace load is waiting on two independent requests in sequence, which adds unnecessary latency. Update the sessions-loading flow in sessions.js so fetchSchemaDriftBannerHtml() and the /api/projects/${encodeURIComponent(projectName)}/sessions fetch start in parallel, then await both results together before using schemaBannerHtml and state.cachedSessions. Keep the change localized to the sessions-loading logic and preserve the existing assignment behavior once both promises resolve.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@utils/jsonl_parser.py`:
- Around line 236-239: `parse_session` currently calls
`record_parse_drift(observed_field_paths)` unguarded, so drift tracking can
abort parsing if `diff_against_baseline()` or `load_baseline_fields()` fails on
a missing or malformed `schema_baseline.json`. Update `record_parse_drift` (or
its call site in `parse_session`) to catch and suppress non-fatal drift-tracking
errors, while still allowing `validate_session_dict(...)` and the rest of
parsing to proceed normally; keep the fix localized around `record_parse_drift`,
`diff_against_baseline`, and `load_baseline_fields`.
---
Nitpick comments:
In `@static/js/sessions.js`:
- Around line 85-88: The workspace load is waiting on two independent requests
in sequence, which adds unnecessary latency. Update the sessions-loading flow in
sessions.js so fetchSchemaDriftBannerHtml() and the
/api/projects/${encodeURIComponent(projectName)}/sessions fetch start in
parallel, then await both results together before using schemaBannerHtml and
state.cachedSessions. Keep the change localized to the sessions-loading logic
and preserve the existing assignment behavior once both promises resolve.
In `@utils/schema_drift.py`:
- Around line 90-121: The baseline JSON is being re-read and re-parsed on every
call through diff_against_baseline() and load_baseline_fields(), which adds
repeated disk I/O for data that does not change at runtime. Cache the parsed
baseline once instead of loading it per session, either by memoizing
load_baseline_fields() with a lazy cache such as functools.lru_cache or by
initializing the baseline at module import, and keep diff_against_baseline()
using the cached result.
- Around line 71-87: The observed type-tracking code in
collect_field_paths_with_types is not used by diff_against_baseline, so the
drift logic should be simplified to only compare field paths. Remove the unused
type-collection helper and any expected_type plumbing from the baseline handling
unless you are adding type-drift checks in this same change, and keep the
remaining symbols centered around collect_field_paths, baseline, and
diff_against_baseline.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 118cf0f8-8beb-4e8a-ab11-f5c572bfcdac
📒 Files selected for processing (9)
api/schema_report.pyapp.pyschema_baseline.jsonstatic/css/style.cssstatic/js/sessions.jstests/fixtures/jsonl/unknown_field.jsonltests/test_schema_drift.pyutils/jsonl_parser.pyutils/schema_drift.py
…108) Cache schema_baseline.json with lru_cache and make record_parse_drift non-fatal on baseline I/O or parse errors so parsing never aborts. Fetch /api/schema-report after sessions load so the banner reflects drift from the current parse run. Add vitest coverage for banner rendering and fetch ordering; extend pytest for malformed baseline. Raise benchmark baselines for per-entry field-path fingerprinting.
Closes #103
Summary
Detect upstream Claude Code JSONL schema drift by fingerprinting field paths during
parse_session()and diffing against a committedschema_baseline.json. New or missing required paths emit warnings on theclaude_code_chat_browser.schema_driftlogger; the web UI shows a dismissible amber banner on the session list page;GET /api/schema-reportreturns{known_fields, new_fields, missing_fields, has_drift}.Closes sprint item 5 (July Week 1 Thursday, 5 pt).
Changes
utils/schema_drift.py— baseline loader, recursive path collection, drift diff, process-wide reportutils/jsonl_parser.py—_collect_field_paths()wrapper; drift check at parse completionschema_baseline.json— 98 known(json_path, expected_type)pairs; onlytypeis requiredapi/schema_report.py—GET /api/schema-reportstatic/js/sessions.js+static/css/style.css— dismissible amber warning banner (sessionStorage fingerprint)tests/fixtures/jsonl/unknown_field.jsonl— syntheticFutureToolXYZ/ unknowntoolkey fixturetests/test_schema_drift.py— 10 tests covering warnings, API, false-positive guard, merge behaviorOut of scope
Test plan
pytest tests/ -k schema -qpytest -q(443 passed)mypy -p api -p utils -p modelsruff check .tests/fixtures/jsonl/unknown_field.jsonl— confirm warning logged and amber banner in UISummary by CodeRabbit