fix(knowledge): repair schema, dedup sections, normalize responses by ohdearquant · Pull Request #3 · ohdearquant/khive

ohdearquant · 2026-06-07T04:46:02Z

Summary

Register V23 migration (drop content column from knowledge_atoms) with idempotency guard
Add V24 migration: dedup knowledge_sections by content — preserves distinct content, removes true duplicates
Remove all content column references from INSERT/UPDATE SQL in crud.rs
Change edit handler from ON CONFLICT upsert to explicit SELECT+UPDATE/INSERT (no UNIQUE constraint needed)
atom.content now returns description (content column removed from DB)
Normalize timestamps from epoch microseconds → ISO 8601
Flatten search/suggest response envelope to {results, total}
Rename topic result key items → results

Test plan

knowledge.upsert_atoms succeeds (was: "no column named content")
knowledge.edit succeeds (was: "ON CONFLICT does not match")
knowledge.list returns ISO timestamps (was: epoch microseconds)
knowledge.topic returns results key (was: items)
cargo check + clippy + fmt clean

🤖 Generated with Claude Code

…min, 20-word atom min, SQL files, response normalization Schema: - knowledge_atoms: remove content column (content lives in sections) - knowledge_sections: add content_hash column, UNIQUE(atom_id, content_hash) - DDL moved to crates/khive-db/sql/knowledge_schema.sql - V22-V24 collapsed into no-ops (fresh start, no legacy DBs) Validators: - Section content minimum: 80 characters - Atom description minimum: 20 words - content_hash: sha256(content)[:16], computed at write time Write paths: - upsert_atoms routes content param into knowledge_section - edit handler uses explicit SELECT + UPDATE/INSERT (no ON CONFLICT) Response normalization: - Timestamps: epoch microseconds → ISO 8601 via micros_to_iso - search/suggest envelope: {status, data: {results, count}} → {results, total} - topic result key: items → results CI: - Golden help file updated for v0.2.8 - All test envelope references updated Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…om content model Fresh-start repo (v0.2.8) — collapse all incremental migrations into one. - schema.sql loaded as V1 via include_str!; run_migrations simplified - knowledge_atoms uses a single `content` column (content IS the description) - knowledge_sections: content_hash + UNIQUE(atom_id, content_hash) - atom content >= 20 words; section content >= 80 chars; content_hash=sha256[:16] - response normalization (ISO timestamps, {results,total}, topic items->results) - ADR-048 updated Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…odex r5) - Delete superseded sql/knowledge_schema.sql (had old description column) - Delete V17 preserving-rebuild contract test (fresh-start: old vec0 unsupported) - backend.rs vec0 error: pre-v0.2.8 schema unsupported, recreate DB (was: run V17) - migrations.rs: drop build_v14 reference from EMBEDDING_MODELS_DDL comment - ADR-048: update all DDL blocks to content_hash + UNIQUE(atom_id, content_hash) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…mmit sqlfluff cannot parse the fts5/vec0 virtual-table extension syntax our DDL uses, so the meaningful check is execution: scripts/lint-sql.sh loads every crates/**/*.sql into an in-memory SQLite database (fts5 is built into stdlib sqlite3) and flags trailing whitespace / tabs. Wired into both the CI script and a pre-commit hook scoped to crates/**/*.sql. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ohdearquant · 2026-06-07T14:44:20Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab1bdb0414

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-07T14:47:13Z

+pub const MIGRATIONS: &[VersionedMigration] = &[VersionedMigration {
+    version: 1,
+    name: "initial_schema",
+    up: V1_UP,
+}];


Append an upgrade migration instead of resetting to V1

On any existing khive database whose _schema_migrations already records the old latest version (for example 22), run_migrations treats this single V1 entry as already older than the current version and skips it, so the new schema changes are never applied. That leaves deployed DBs without knowledge_sections.content_hash and with the old UNIQUE(atom_id, section_type)/FTS shape, while the new knowledge handlers now insert and update content_hash, causing section edits/imports to fail after upgrade. This also violates the repo guidance in CLAUDE.md to append schema versions rather than editing V1.

Useful? React with 👍 / 👎.

@codex that is intentional, a clean break

… scoring field Atom content has a single source — the `content` column. Remove the leftover `description` debt from the prior two-column model: - AtomInput: drop the `description` input alias and the fuller-wins selection logic; `content` is the sole, required body field. - Scoring: the Candidate `description` token field duplicated `content` (both filled from atom.content), double-counting body text under w_description + w_content. Collapse to a single `content` field and fold the body weight into D_W_CONTENT (1.0 -> 1.5). - Output: rename ScoredHit.description -> content; emit `content` in the results JSON for atoms and domains alike. - Drop w_description from the public weights-override surface and help. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…eeds AtomInput no longer has a `description` field, so these keys were silently dropped by serde — dead test cruft. Every affected atom already carries a `content` field. Domain seeds (which legitimately retain `description`) and learn/create entity calls are untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Record the policy already implemented by the schema consolidation in this PR: - DDL is authored in `crates/khive-db/sql/*.sql` and pulled into migrations.rs via `include_str!` — not inline Rust string literals. Tooling (lint-sql.sh, wired into CI + pre-commit) can lint/load real SQL; inline strings drift silently. - The V1–V22 ledger is marked historical: the fresh-start v0.2.8 reset collapsed it into a single V1 = initial_schema. Future changes append V2, V3, … each sourcing its own .sql file. - Directional preference: extract hot/tuned query SQL into .sql files where it makes sense — lintable, EXPLAIN-able, tunable without recompiling. Mirror the policy in CLAUDE.md (Schema changes + "What lives where"). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The consolidated schema.sql was generated from a migration dump: several CREATE TABLE statements were comma-jammed onto a single line and the triggers had stray blank lines and column-0 indentation. Reformat to one column per line with consistent indentation and logical grouping (core tables, packs, knowledge, FTS virtual tables, indexes, triggers). Verified semantically identical to the prior schema: every table/column/index/ FK/virtual-table fingerprint matches; the only delta is cosmetic `values(` -> `values (` in trigger bodies. 121 khive-db tests pass. The lint missed the jamming because it only checked execution + whitespace. Add a format check that flags multi-column CREATE TABLE on a single line — no auto-formatter exists for the fts5/vec0 dialect, so this guards against regression. Confirmed it catches all 9 jammed tables in the pre-cleanup file. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ohdearquant · 2026-06-07T15:11:33Z

@codex review, note that, the clean break this time is intended due to a new repo

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9dc569b696

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-07T15:15:46Z

+                    sql: "SELECT id, status FROM knowledge_sections \
+                          WHERE atom_id = ?1 AND section_type = ?2 \
+                          ORDER BY sort_order ASC, created_at ASC LIMIT 1"


Preserve repeated section types instead of overwriting them

When an import or edit supplies two sections with the same section_type but different content, the second one finds the first via this lookup and updates it, even though the new schema now deduplicates on (atom_id, content_hash) and explicitly allows same-type sibling sections. In section-based atlas imports with repeated headings (for example multiple references or examples blocks), this silently drops all but one block instead of inserting the distinct section rows that the new content_hash key was added to support.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-07T15:15:46Z

+    /// The atom's content text (>= 20 words).
    #[serde(default)]
    pub content: Option<String>,


Keep accepting description-only atom upserts

The updated ADR says knowledge.upsert_atoms still accepts description as a legacy synonym for atom content, but this input struct no longer deserializes it. Any caller that sends only description now has content == None, so upsert_atoms validates an empty string and rejects the request; callers that send both also do not get the documented “fuller of the two wins” behavior. Please retain a description field here and normalize it with content before validation.

Useful? React with 👍 / 👎.

knowledge.edit resolved the target section by (atom_id, section_type) and updated it in place. With the new UNIQUE(atom_id, content_hash) key — added to allow same-type sibling sections — that lookup silently overwrote distinct content: importing two `references` (or `examples`) blocks kept only the last. Resolve the upsert by (atom_id, content_hash) instead: - byte-identical content → idempotent metadata refresh (status + embedding kept) - new content → INSERT a fresh row, leaving siblings (incl. verified) untouched This makes the content_hash key meaningful for the write path and matches the ADR's "multiple same-type sections are legitimate" guarantee. Behavior change: the old "editing a verified section downgrades it to reviewed" no longer applies — content is never mutated in place, so there is nothing to downgrade. Reworked the two W8 tests to the content-addressed model (idempotent re-import preserves verification; distinct same-type content creates a sibling without disturbing a verified one). Also scrub the stale ADR-048 claim that knowledge.upsert_atoms still accepts a `description` alias — the input struct is content-only (the alias was removed as tech debt in an earlier commit). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ohdearquant · 2026-06-07T15:32:04Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: de5d12dc19

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-07T15:36:23Z

+pub const MIGRATIONS: &[VersionedMigration] = &[VersionedMigration {
+    version: 1,
+    name: "initial_schema",
+    up: V1_UP,
+}];


Reject pre-consolidation ledgers explicitly

For any database that already has the old shipped ledger (for example _schema_migrations max version 22), run_migrations computes current_version = 22 and skips the only remaining migration (version: 1), so the consolidated schema.sql changes are never applied. The process then runs against the old schema (e.g. knowledge_sections lacks content_hash, while the new edit path inserts that column), producing runtime SQL failures instead of a clear reset/migration path; if the baseline reset is intentional, this should detect current_version > latest and fail with an explicit recreate/upgrade error rather than silently accepting the stale schema.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-07T15:36:23Z

+            let content = atom_in.content.as_deref().unwrap_or("").trim().to_string();
+            validate_atom_content(&content)?;


Preserve section-only markdown imports

When knowledge.import ingests an atlas-style markdown file whose useful text is entirely under ## sections, parse_atlas_md passes only the pre-section body into upsert_atoms; this new 20-word validation rejects an empty/short preamble before the later section import can run. That makes otherwise valid section-chunked imports fail for files like # Title\n\n## Overview\n..., so import should synthesize/allow atom content from sections or bypass this atom-level minimum for sectioned imports.

Useful? React with 👍 / 👎.

…kdown Two correctness gaps surfaced by review of the consolidation: 1. run_migrations silently accepted pre-consolidation databases. With the V1..V22 ledger collapsed into a single V1, a database recorded at version 22 computed current_version=22, skipped the only migration, and ran against the stale schema (no content_hash column) — producing runtime SQL failures on the new section path. Now run_migrations detects current_version > latest and fails with an explicit recreate error instead of corrupting silently. 2. knowledge.import rejected section-only markdown. parse_atlas_md feeds only the pre-section body into upsert_atoms; for `# Title\n\n## Overview\n...` that body is empty and the 20-word atom-content minimum rejected the file before the section import ran. Import now synthesizes atom content from the section bodies when the pre-section body is too short. Regression tests: rejects_pre_consolidation_ledger (khive-db), w10_import_section_only_markdown_synthesizes_atom_content (pack-knowledge). ADR-015 and ADR-048 updated to document both behaviors. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The consolidated schema (single V1) is incompatible with pre-consolidation databases, which run_migrations now rejects. Move the default path to a new file name so a fresh install starts clean on the new schema instead of colliding with an existing legacy khive-graph.db. Bench fixture comments that reference the historical file by name are left unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ohdearquant force-pushed the fix/knowledge-pack-schema branch from 4964156 to 169bb18 Compare June 7, 2026 13:58

ohdearquant and others added 3 commits June 7, 2026 10:28

chatgpt-codex-connector Bot reviewed Jun 7, 2026

View reviewed changes

ohdearquant and others added 4 commits June 7, 2026 10:58

chatgpt-codex-connector Bot reviewed Jun 7, 2026

View reviewed changes

ohdearquant and others added 2 commits June 7, 2026 11:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(knowledge): repair schema, dedup sections, normalize responses#3

fix(knowledge): repair schema, dedup sections, normalize responses#3
ohdearquant wants to merge 11 commits into
mainfrom
fix/knowledge-pack-schema

ohdearquant commented Jun 7, 2026

Uh oh!

ohdearquant commented Jun 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 7, 2026

Uh oh!

ohdearquant Jun 7, 2026 •

edited

Loading

Uh oh!

ohdearquant commented Jun 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 7, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 7, 2026

Uh oh!

ohdearquant commented Jun 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 7, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		let content = atom_in.content.as_deref().unwrap_or("").trim().to_string();
		validate_atom_content(&content)?;

Conversation

ohdearquant commented Jun 7, 2026

Summary

Test plan

Uh oh!

ohdearquant commented Jun 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

ohdearquant Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ohdearquant commented Jun 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

ohdearquant commented Jun 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ohdearquant Jun 7, 2026 •

edited

Loading