fix(knowledge): repair schema, dedup sections, normalize responses#3
fix(knowledge): repair schema, dedup sections, normalize responses#3ohdearquant wants to merge 11 commits into
Conversation
…min, 20-word atom min, SQL files, response normalization
Schema:
- knowledge_atoms: remove content column (content lives in sections)
- knowledge_sections: add content_hash column, UNIQUE(atom_id, content_hash)
- DDL moved to crates/khive-db/sql/knowledge_schema.sql
- V22-V24 collapsed into no-ops (fresh start, no legacy DBs)
Validators:
- Section content minimum: 80 characters
- Atom description minimum: 20 words
- content_hash: sha256(content)[:16], computed at write time
Write paths:
- upsert_atoms routes content param into knowledge_section
- edit handler uses explicit SELECT + UPDATE/INSERT (no ON CONFLICT)
Response normalization:
- Timestamps: epoch microseconds → ISO 8601 via micros_to_iso
- search/suggest envelope: {status, data: {results, count}} → {results, total}
- topic result key: items → results
CI:
- Golden help file updated for v0.2.8
- All test envelope references updated
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4964156 to
169bb18
Compare
…om content model
Fresh-start repo (v0.2.8) — collapse all incremental migrations into one.
- schema.sql loaded as V1 via include_str!; run_migrations simplified
- knowledge_atoms uses a single `content` column (content IS the description)
- knowledge_sections: content_hash + UNIQUE(atom_id, content_hash)
- atom content >= 20 words; section content >= 80 chars; content_hash=sha256[:16]
- response normalization (ISO timestamps, {results,total}, topic items->results)
- ADR-048 updated
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…odex r5) - Delete superseded sql/knowledge_schema.sql (had old description column) - Delete V17 preserving-rebuild contract test (fresh-start: old vec0 unsupported) - backend.rs vec0 error: pre-v0.2.8 schema unsupported, recreate DB (was: run V17) - migrations.rs: drop build_v14 reference from EMBEDDING_MODELS_DDL comment - ADR-048: update all DDL blocks to content_hash + UNIQUE(atom_id, content_hash) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mmit sqlfluff cannot parse the fts5/vec0 virtual-table extension syntax our DDL uses, so the meaningful check is execution: scripts/lint-sql.sh loads every crates/**/*.sql into an in-memory SQLite database (fts5 is built into stdlib sqlite3) and flags trailing whitespace / tabs. Wired into both the CI script and a pre-commit hook scoped to crates/**/*.sql. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ab1bdb0414
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| pub const MIGRATIONS: &[VersionedMigration] = &[VersionedMigration { | ||
| version: 1, | ||
| name: "initial_schema", | ||
| up: V1_UP, | ||
| }]; |
There was a problem hiding this comment.
Append an upgrade migration instead of resetting to V1
On any existing khive database whose _schema_migrations already records the old latest version (for example 22), run_migrations treats this single V1 entry as already older than the current version and skips it, so the new schema changes are never applied. That leaves deployed DBs without knowledge_sections.content_hash and with the old UNIQUE(atom_id, section_type)/FTS shape, while the new knowledge handlers now insert and update content_hash, causing section edits/imports to fail after upgrade. This also violates the repo guidance in CLAUDE.md to append schema versions rather than editing V1.
Useful? React with 👍 / 👎.
… scoring field Atom content has a single source — the `content` column. Remove the leftover `description` debt from the prior two-column model: - AtomInput: drop the `description` input alias and the fuller-wins selection logic; `content` is the sole, required body field. - Scoring: the Candidate `description` token field duplicated `content` (both filled from atom.content), double-counting body text under w_description + w_content. Collapse to a single `content` field and fold the body weight into D_W_CONTENT (1.0 -> 1.5). - Output: rename ScoredHit.description -> content; emit `content` in the results JSON for atoms and domains alike. - Drop w_description from the public weights-override surface and help. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eeds AtomInput no longer has a `description` field, so these keys were silently dropped by serde — dead test cruft. Every affected atom already carries a `content` field. Domain seeds (which legitimately retain `description`) and learn/create entity calls are untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Record the policy already implemented by the schema consolidation in this PR: - DDL is authored in `crates/khive-db/sql/*.sql` and pulled into migrations.rs via `include_str!` — not inline Rust string literals. Tooling (lint-sql.sh, wired into CI + pre-commit) can lint/load real SQL; inline strings drift silently. - The V1–V22 ledger is marked historical: the fresh-start v0.2.8 reset collapsed it into a single V1 = initial_schema. Future changes append V2, V3, … each sourcing its own .sql file. - Directional preference: extract hot/tuned query SQL into .sql files where it makes sense — lintable, EXPLAIN-able, tunable without recompiling. Mirror the policy in CLAUDE.md (Schema changes + "What lives where"). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The consolidated schema.sql was generated from a migration dump: several CREATE TABLE statements were comma-jammed onto a single line and the triggers had stray blank lines and column-0 indentation. Reformat to one column per line with consistent indentation and logical grouping (core tables, packs, knowledge, FTS virtual tables, indexes, triggers). Verified semantically identical to the prior schema: every table/column/index/ FK/virtual-table fingerprint matches; the only delta is cosmetic `values(` -> `values (` in trigger bodies. 121 khive-db tests pass. The lint missed the jamming because it only checked execution + whitespace. Add a format check that flags multi-column CREATE TABLE on a single line — no auto-formatter exists for the fts5/vec0 dialect, so this guards against regression. Confirmed it catches all 9 jammed tables in the pre-cleanup file. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
@codex review, note that, the clean break this time is intended due to a new repo |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9dc569b696
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| sql: "SELECT id, status FROM knowledge_sections \ | ||
| WHERE atom_id = ?1 AND section_type = ?2 \ | ||
| ORDER BY sort_order ASC, created_at ASC LIMIT 1" |
There was a problem hiding this comment.
Preserve repeated section types instead of overwriting them
When an import or edit supplies two sections with the same section_type but different content, the second one finds the first via this lookup and updates it, even though the new schema now deduplicates on (atom_id, content_hash) and explicitly allows same-type sibling sections. In section-based atlas imports with repeated headings (for example multiple references or examples blocks), this silently drops all but one block instead of inserting the distinct section rows that the new content_hash key was added to support.
Useful? React with 👍 / 👎.
| /// The atom's content text (>= 20 words). | ||
| #[serde(default)] | ||
| pub content: Option<String>, |
There was a problem hiding this comment.
Keep accepting description-only atom upserts
The updated ADR says knowledge.upsert_atoms still accepts description as a legacy synonym for atom content, but this input struct no longer deserializes it. Any caller that sends only description now has content == None, so upsert_atoms validates an empty string and rejects the request; callers that send both also do not get the documented “fuller of the two wins” behavior. Please retain a description field here and normalize it with content before validation.
Useful? React with 👍 / 👎.
knowledge.edit resolved the target section by (atom_id, section_type) and updated it in place. With the new UNIQUE(atom_id, content_hash) key — added to allow same-type sibling sections — that lookup silently overwrote distinct content: importing two `references` (or `examples`) blocks kept only the last. Resolve the upsert by (atom_id, content_hash) instead: - byte-identical content → idempotent metadata refresh (status + embedding kept) - new content → INSERT a fresh row, leaving siblings (incl. verified) untouched This makes the content_hash key meaningful for the write path and matches the ADR's "multiple same-type sections are legitimate" guarantee. Behavior change: the old "editing a verified section downgrades it to reviewed" no longer applies — content is never mutated in place, so there is nothing to downgrade. Reworked the two W8 tests to the content-addressed model (idempotent re-import preserves verification; distinct same-type content creates a sibling without disturbing a verified one). Also scrub the stale ADR-048 claim that knowledge.upsert_atoms still accepts a `description` alias — the input struct is content-only (the alias was removed as tech debt in an earlier commit). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: de5d12dc19
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| pub const MIGRATIONS: &[VersionedMigration] = &[VersionedMigration { | ||
| version: 1, | ||
| name: "initial_schema", | ||
| up: V1_UP, | ||
| }]; |
There was a problem hiding this comment.
Reject pre-consolidation ledgers explicitly
For any database that already has the old shipped ledger (for example _schema_migrations max version 22), run_migrations computes current_version = 22 and skips the only remaining migration (version: 1), so the consolidated schema.sql changes are never applied. The process then runs against the old schema (e.g. knowledge_sections lacks content_hash, while the new edit path inserts that column), producing runtime SQL failures instead of a clear reset/migration path; if the baseline reset is intentional, this should detect current_version > latest and fail with an explicit recreate/upgrade error rather than silently accepting the stale schema.
Useful? React with 👍 / 👎.
| let content = atom_in.content.as_deref().unwrap_or("").trim().to_string(); | ||
| validate_atom_content(&content)?; |
There was a problem hiding this comment.
Preserve section-only markdown imports
When knowledge.import ingests an atlas-style markdown file whose useful text is entirely under ## sections, parse_atlas_md passes only the pre-section body into upsert_atoms; this new 20-word validation rejects an empty/short preamble before the later section import can run. That makes otherwise valid section-chunked imports fail for files like # Title\n\n## Overview\n..., so import should synthesize/allow atom content from sections or bypass this atom-level minimum for sectioned imports.
Useful? React with 👍 / 👎.
…kdown Two correctness gaps surfaced by review of the consolidation: 1. run_migrations silently accepted pre-consolidation databases. With the V1..V22 ledger collapsed into a single V1, a database recorded at version 22 computed current_version=22, skipped the only migration, and ran against the stale schema (no content_hash column) — producing runtime SQL failures on the new section path. Now run_migrations detects current_version > latest and fails with an explicit recreate error instead of corrupting silently. 2. knowledge.import rejected section-only markdown. parse_atlas_md feeds only the pre-section body into upsert_atoms; for `# Title\n\n## Overview\n...` that body is empty and the 20-word atom-content minimum rejected the file before the section import ran. Import now synthesizes atom content from the section bodies when the pre-section body is too short. Regression tests: rejects_pre_consolidation_ledger (khive-db), w10_import_section_only_markdown_synthesizes_atom_content (pack-knowledge). ADR-015 and ADR-048 updated to document both behaviors. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The consolidated schema (single V1) is incompatible with pre-consolidation databases, which run_migrations now rejects. Move the default path to a new file name so a fresh install starts clean on the new schema instead of colliding with an existing legacy khive-graph.db. Bench fixture comments that reference the historical file by name are left unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
contentcolumn from knowledge_atoms) with idempotency guardcontentcolumn references from INSERT/UPDATE SQL in crud.rsatom.contentnow returns description (content column removed from DB){results, total}items→resultsTest plan
knowledge.upsert_atomssucceeds (was: "no column named content")knowledge.editsucceeds (was: "ON CONFLICT does not match")knowledge.listreturns ISO timestamps (was: epoch microseconds)knowledge.topicreturnsresultskey (was:items)🤖 Generated with Claude Code