Skip to content

fix(knowledge): repair schema, dedup sections, normalize responses#3

Open
ohdearquant wants to merge 11 commits into
mainfrom
fix/knowledge-pack-schema
Open

fix(knowledge): repair schema, dedup sections, normalize responses#3
ohdearquant wants to merge 11 commits into
mainfrom
fix/knowledge-pack-schema

Conversation

@ohdearquant
Copy link
Copy Markdown
Owner

Summary

  • Register V23 migration (drop content column from knowledge_atoms) with idempotency guard
  • Add V24 migration: dedup knowledge_sections by content — preserves distinct content, removes true duplicates
  • Remove all content column references from INSERT/UPDATE SQL in crud.rs
  • Change edit handler from ON CONFLICT upsert to explicit SELECT+UPDATE/INSERT (no UNIQUE constraint needed)
  • atom.content now returns description (content column removed from DB)
  • Normalize timestamps from epoch microseconds → ISO 8601
  • Flatten search/suggest response envelope to {results, total}
  • Rename topic result key itemsresults

Test plan

  • knowledge.upsert_atoms succeeds (was: "no column named content")
  • knowledge.edit succeeds (was: "ON CONFLICT does not match")
  • knowledge.list returns ISO timestamps (was: epoch microseconds)
  • knowledge.topic returns results key (was: items)
  • cargo check + clippy + fmt clean

🤖 Generated with Claude Code

…min, 20-word atom min, SQL files, response normalization

Schema:
- knowledge_atoms: remove content column (content lives in sections)
- knowledge_sections: add content_hash column, UNIQUE(atom_id, content_hash)
- DDL moved to crates/khive-db/sql/knowledge_schema.sql
- V22-V24 collapsed into no-ops (fresh start, no legacy DBs)

Validators:
- Section content minimum: 80 characters
- Atom description minimum: 20 words
- content_hash: sha256(content)[:16], computed at write time

Write paths:
- upsert_atoms routes content param into knowledge_section
- edit handler uses explicit SELECT + UPDATE/INSERT (no ON CONFLICT)

Response normalization:
- Timestamps: epoch microseconds → ISO 8601 via micros_to_iso
- search/suggest envelope: {status, data: {results, count}} → {results, total}
- topic result key: items → results

CI:
- Golden help file updated for v0.2.8
- All test envelope references updated

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ohdearquant ohdearquant force-pushed the fix/knowledge-pack-schema branch from 4964156 to 169bb18 Compare June 7, 2026 13:58
ohdearquant and others added 3 commits June 7, 2026 10:28
…om content model

Fresh-start repo (v0.2.8) — collapse all incremental migrations into one.

- schema.sql loaded as V1 via include_str!; run_migrations simplified
- knowledge_atoms uses a single `content` column (content IS the description)
- knowledge_sections: content_hash + UNIQUE(atom_id, content_hash)
- atom content >= 20 words; section content >= 80 chars; content_hash=sha256[:16]
- response normalization (ISO timestamps, {results,total}, topic items->results)
- ADR-048 updated

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…odex r5)

- Delete superseded sql/knowledge_schema.sql (had old description column)
- Delete V17 preserving-rebuild contract test (fresh-start: old vec0 unsupported)
- backend.rs vec0 error: pre-v0.2.8 schema unsupported, recreate DB (was: run V17)
- migrations.rs: drop build_v14 reference from EMBEDDING_MODELS_DDL comment
- ADR-048: update all DDL blocks to content_hash + UNIQUE(atom_id, content_hash)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mmit

sqlfluff cannot parse the fts5/vec0 virtual-table extension syntax our DDL uses,
so the meaningful check is execution: scripts/lint-sql.sh loads every
crates/**/*.sql into an in-memory SQLite database (fts5 is built into stdlib
sqlite3) and flags trailing whitespace / tabs. Wired into both the CI script and
a pre-commit hook scoped to crates/**/*.sql.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ohdearquant
Copy link
Copy Markdown
Owner Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab1bdb0414

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +138 to +142
pub const MIGRATIONS: &[VersionedMigration] = &[VersionedMigration {
version: 1,
name: "initial_schema",
up: V1_UP,
}];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Append an upgrade migration instead of resetting to V1

On any existing khive database whose _schema_migrations already records the old latest version (for example 22), run_migrations treats this single V1 entry as already older than the current version and skips it, so the new schema changes are never applied. That leaves deployed DBs without knowledge_sections.content_hash and with the old UNIQUE(atom_id, section_type)/FTS shape, while the new knowledge handlers now insert and update content_hash, causing section edits/imports to fail after upgrade. This also violates the repo guidance in CLAUDE.md to append schema versions rather than editing V1.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

@ohdearquant ohdearquant Jun 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codex that is intentional, a clean break

ohdearquant and others added 4 commits June 7, 2026 10:58
… scoring field

Atom content has a single source — the `content` column. Remove the
leftover `description` debt from the prior two-column model:

- AtomInput: drop the `description` input alias and the fuller-wins
  selection logic; `content` is the sole, required body field.
- Scoring: the Candidate `description` token field duplicated `content`
  (both filled from atom.content), double-counting body text under
  w_description + w_content. Collapse to a single `content` field and
  fold the body weight into D_W_CONTENT (1.0 -> 1.5).
- Output: rename ScoredHit.description -> content; emit `content` in the
  results JSON for atoms and domains alike.
- Drop w_description from the public weights-override surface and help.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eeds

AtomInput no longer has a `description` field, so these keys were silently
dropped by serde — dead test cruft. Every affected atom already carries a
`content` field. Domain seeds (which legitimately retain `description`) and
learn/create entity calls are untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Record the policy already implemented by the schema consolidation in this PR:

- DDL is authored in `crates/khive-db/sql/*.sql` and pulled into migrations.rs
  via `include_str!` — not inline Rust string literals. Tooling (lint-sql.sh,
  wired into CI + pre-commit) can lint/load real SQL; inline strings drift
  silently.
- The V1–V22 ledger is marked historical: the fresh-start v0.2.8 reset collapsed
  it into a single V1 = initial_schema. Future changes append V2, V3, … each
  sourcing its own .sql file.
- Directional preference: extract hot/tuned query SQL into .sql files where it
  makes sense — lintable, EXPLAIN-able, tunable without recompiling.

Mirror the policy in CLAUDE.md (Schema changes + "What lives where").

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The consolidated schema.sql was generated from a migration dump: several
CREATE TABLE statements were comma-jammed onto a single line and the triggers
had stray blank lines and column-0 indentation. Reformat to one column per
line with consistent indentation and logical grouping (core tables, packs,
knowledge, FTS virtual tables, indexes, triggers).

Verified semantically identical to the prior schema: every table/column/index/
FK/virtual-table fingerprint matches; the only delta is cosmetic `values(` ->
`values (` in trigger bodies. 121 khive-db tests pass.

The lint missed the jamming because it only checked execution + whitespace.
Add a format check that flags multi-column CREATE TABLE on a single line —
no auto-formatter exists for the fts5/vec0 dialect, so this guards against
regression. Confirmed it catches all 9 jammed tables in the pre-cleanup file.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ohdearquant
Copy link
Copy Markdown
Owner Author

@codex review, note that, the clean break this time is intended due to a new repo

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9dc569b696

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +233 to +235
sql: "SELECT id, status FROM knowledge_sections \
WHERE atom_id = ?1 AND section_type = ?2 \
ORDER BY sort_order ASC, created_at ASC LIMIT 1"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve repeated section types instead of overwriting them

When an import or edit supplies two sections with the same section_type but different content, the second one finds the first via this lookup and updates it, even though the new schema now deduplicates on (atom_id, content_hash) and explicitly allows same-type sibling sections. In section-based atlas imports with repeated headings (for example multiple references or examples blocks), this silently drops all but one block instead of inserting the distinct section rows that the new content_hash key was added to support.

Useful? React with 👍 / 👎.

Comment on lines +224 to 226
/// The atom's content text (>= 20 words).
#[serde(default)]
pub content: Option<String>,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep accepting description-only atom upserts

The updated ADR says knowledge.upsert_atoms still accepts description as a legacy synonym for atom content, but this input struct no longer deserializes it. Any caller that sends only description now has content == None, so upsert_atoms validates an empty string and rejects the request; callers that send both also do not get the documented “fuller of the two wins” behavior. Please retain a description field here and normalize it with content before validation.

Useful? React with 👍 / 👎.

knowledge.edit resolved the target section by (atom_id, section_type) and
updated it in place. With the new UNIQUE(atom_id, content_hash) key — added to
allow same-type sibling sections — that lookup silently overwrote distinct
content: importing two `references` (or `examples`) blocks kept only the last.

Resolve the upsert by (atom_id, content_hash) instead:
- byte-identical content → idempotent metadata refresh (status + embedding kept)
- new content → INSERT a fresh row, leaving siblings (incl. verified) untouched

This makes the content_hash key meaningful for the write path and matches the
ADR's "multiple same-type sections are legitimate" guarantee.

Behavior change: the old "editing a verified section downgrades it to reviewed"
no longer applies — content is never mutated in place, so there is nothing to
downgrade. Reworked the two W8 tests to the content-addressed model (idempotent
re-import preserves verification; distinct same-type content creates a sibling
without disturbing a verified one).

Also scrub the stale ADR-048 claim that knowledge.upsert_atoms still accepts a
`description` alias — the input struct is content-only (the alias was removed as
tech debt in an earlier commit).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ohdearquant
Copy link
Copy Markdown
Owner Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: de5d12dc19

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +138 to +142
pub const MIGRATIONS: &[VersionedMigration] = &[VersionedMigration {
version: 1,
name: "initial_schema",
up: V1_UP,
}];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject pre-consolidation ledgers explicitly

For any database that already has the old shipped ledger (for example _schema_migrations max version 22), run_migrations computes current_version = 22 and skips the only remaining migration (version: 1), so the consolidated schema.sql changes are never applied. The process then runs against the old schema (e.g. knowledge_sections lacks content_hash, while the new edit path inserts that column), producing runtime SQL failures instead of a clear reset/migration path; if the baseline reset is intentional, this should detect current_version > latest and fail with an explicit recreate/upgrade error rather than silently accepting the stale schema.

Useful? React with 👍 / 👎.

Comment on lines +51 to +52
let content = atom_in.content.as_deref().unwrap_or("").trim().to_string();
validate_atom_content(&content)?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve section-only markdown imports

When knowledge.import ingests an atlas-style markdown file whose useful text is entirely under ## sections, parse_atlas_md passes only the pre-section body into upsert_atoms; this new 20-word validation rejects an empty/short preamble before the later section import can run. That makes otherwise valid section-chunked imports fail for files like # Title\n\n## Overview\n..., so import should synthesize/allow atom content from sections or bypass this atom-level minimum for sectioned imports.

Useful? React with 👍 / 👎.

ohdearquant and others added 2 commits June 7, 2026 11:40
…kdown

Two correctness gaps surfaced by review of the consolidation:

1. run_migrations silently accepted pre-consolidation databases. With the
   V1..V22 ledger collapsed into a single V1, a database recorded at version 22
   computed current_version=22, skipped the only migration, and ran against the
   stale schema (no content_hash column) — producing runtime SQL failures on the
   new section path. Now run_migrations detects current_version > latest and
   fails with an explicit recreate error instead of corrupting silently.

2. knowledge.import rejected section-only markdown. parse_atlas_md feeds only the
   pre-section body into upsert_atoms; for `# Title\n\n## Overview\n...` that body
   is empty and the 20-word atom-content minimum rejected the file before the
   section import ran. Import now synthesizes atom content from the section bodies
   when the pre-section body is too short.

Regression tests: rejects_pre_consolidation_ledger (khive-db),
w10_import_section_only_markdown_synthesizes_atom_content (pack-knowledge).
ADR-015 and ADR-048 updated to document both behaviors.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The consolidated schema (single V1) is incompatible with pre-consolidation databases, which run_migrations now rejects. Move the default path to a new file name so a fresh install starts clean on the new schema instead of colliding with an existing legacy khive-graph.db. Bench fixture comments that reference the historical file by name are left unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant