Add SQL query interface and Claude Code skill by wesm · Pull Request #249 · wesm/msgvault

wesm · 2026-04-08T16:21:15Z

Summary

Supersedes #236 with a different approach: instead of shelling out to the duckdb CLI via bash scripts, this builds a SQL view layer inside msgvault itself.

DuckDB views over Parquet files (RegisterViews), registered at engine startup — 8 base views + 5 convenience views (v_messages, v_senders, v_domains, v_labels, v_threads)
msgvault query CLI command with --format json|csv|table output
POST /api/v1/query HTTP endpoint on the existing serve daemon (503 when Parquet cache unavailable)
Claude Code skill that teaches agents to use msgvault query "SELECT ..."

No bash wrapper scripts, no external duckdb CLI dependency, no Parquet path knowledge leaked to consumers.

Changes

Component	Files
View layer	`internal/query/views.go` (base + convenience views, `RegisterViews`)
QuerySQL	`internal/query/duckdb.go` (method + read-only validation + view wiring)
CLI	`cmd/msgvault/cmd/query.go`
HTTP	`internal/api/handlers.go`, `internal/api/server.go`
Skill	`skills/claude-code/SKILL.md`, `skills/claude-code/references/views.md`
Tests	`internal/query/views_test.go`, `cmd/msgvault/cmd/query_test.go`, `internal/api/handlers_test.go`

🤖 Generated with Claude Code

Adds a Claude Code skill for msgvault that covers the full CLI surface and includes direct DuckDB queries against the Parquet analytics cache for operations the CLI search can't handle (boolean logic, multi-domain, aggregations, thread analysis). Includes: - SKILL.md with verified JSON output shapes, search strategy, safety rules - scripts/query.sh helper wrapping common DuckDB patterns (9 subcommands) - references/duckdb-queries.md with full Parquet schema and query patterns - references/workflows.md with multi-step analysis patterns Tested against a ~755k message archive. All documented commands, jq patterns, and DuckDB queries verified against live data. Ref: #230

… query - Add input validation to all query.sh subcommands (integers, dates, domains, emails, labels) to prevent SQL injection via crafted arguments - Fix senders subcommand to accept flags before or after optional limit - Fix thread analysis workflow to use query.sh instead of search --json (search does not return to/cc fields) - Guard all search-to-jq pipelines against non-JSON empty results - Add note about sql subcommand passing input unvalidated

The & character in the bash regex character class caused a parse error. Switched to denylist approach (reject single quotes, semicolons, backslashes) which is more robust for label names containing special characters like &.

Security: - Add duckdb binary existence check before running queries - Tighten domain validation: reject underscores, require start/end with alphanumeric (closes injection via underscore identifiers) - Add write-operation guard to sql subcommand: blocks DROP, DELETE, INSERT, UPDATE, CREATE, ALTER, COPY TO - Add security note to SKILL.md about sql subcommand risks Correctness: - Replace shift || true with explicit guard (prevents masked errors) - Add bounds check to validate_int (1-100000) Completeness: - Add build-cache and sync-full to SKILL.md Quick Reference - Add MSGVAULT_HOME path note to duckdb-queries.md - Document analytics cache prerequisite in DuckDB section

The write-operation guard used a case-sensitive blacklist that could be bypassed with lowercase or mixed-case statements. Replace with a strict allowlist that normalizes input to uppercase and only permits SELECT, WITH, EXPLAIN, DESCRIBE, SHOW, and PRAGMA statements.

- Reject semicolons in sql subcommand input to prevent multi-statement bypass (e.g. "SELECT 1; DROP TABLE messages") - Remove PRAGMA from allowlist (can modify DuckDB state) - Clarify threads subcommand matches any participant role (from/to/cc/bcc) not just senders — this is intentional for "who else is on threads involving this person" use case. Updated help text to document this.

EXPLAIN ANALYZE executes the underlying statement, so allowing EXPLAIN breaks the read-only guarantee. Remove EXPLAIN from the allowlist entirely — agents rarely need it and can use DESCRIBE/SHOW instead. Allowlist is now: SELECT, WITH, DESCRIBE, SHOW.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Drop HTTP endpoint from V1 scope (CLI-only), document security requirements for future remote access - Fix connection model: require single-connection pinning for view registration, matching existing DuckDBEngine pattern - Fix v_messages sender resolution to include dual-path logic (message_recipients + sender_id) and phone_number for chat sources - Document serve fallback behavior as future HTTP concern Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add POST /api/v1/query endpoint back in scope (users responsible for securing their installations) - Define labels/participant lists as JSON text via to_json(list(...)) in view contract, matching existing DuckDB engine pattern - 503 when Parquet cache unavailable (no SQLite fallback for query) - HTTP handler reuses DuckDBEngine connection (views already registered) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

7 tasks: base views, convenience views, QuerySQL method, CLI command, HTTP endpoint, Claude Code skill, final verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Create reusable DuckDB views over the 8 Parquet tables (messages, participants, message_recipients, labels, message_labels, attachments, conversations, sources). Each view normalises column types with CAST and handles optional columns (attachment_count, sender_id, message_type, phone_number, title, conversation_type, source_type) via probing and COALESCE fallbacks for older cache files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ls, v_threads) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

v_threads: COUNT(*) overcounted because LEFT JOINs on message_recipients and participants multiply rows per message. Changed to COUNT(DISTINCT m.id). v_senders: FIRST(mr.display_name) could return NULL/empty. Now uses COALESCE(NULLIF(TRIM(...), ''), email_address) to guarantee a non-empty from_name. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Register SQL views automatically during NewDuckDBEngine so they are available without a separate RegisterViews call. Add QuerySQL method and SQLQuerier interface for raw SQL access over the views. Deduplicate probeParquetColumns by delegating to the standalone probeColumns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

NewDuckDBEngine probed Parquet schemas to populate engine.optionalCols, then called RegisterViews which probed the same schemas again inside createBaseViews. Extract probeAllOptionalColumns and RegisterViewsWithColumns so NewDuckDBEngine can pass its already-computed map and skip the redundant I/O. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds handleQuery handler that type-asserts s.engine to query.SQLQuerier at runtime. DuckDB engines support it; SQLite engines return 503. Route registered alongside other engine-dependent endpoints in the authenticated API v1 group. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace the previous ops-heavy skill with a thin query-focused skill that teaches Claude to use `msgvault query` with SQL views. Add a full schema reference in references/views.md derived from views.go. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add union_by_name=true to probeColumns for mixed-schema Parquet caches - Guard QuerySQL with read-only prefix check (SELECT/WITH/DESCRIBE/EXPLAIN) - Normalize SQL NULLs to empty string in scanRow for clean CSV/table output - Fix v_senders from_name to prefer mr.display_name (consistent with v_messages) - Fix SKILL.md "Large attachments" example to use v_messages (has from_email) - Make views.md type descriptions open-ended for message_type, conversation_type, source_type - Add tests for DDL rejection, v_senders from_name, and exact v_threads message_count Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

roborev-ci · 2026-04-08T16:25:32Z

roborev: Combined Review (`c7c5d47`)

Verdict: Changes introduce a read-only query bypass and a couple of correctness regressions that should be addressed before merge.

High

Read-only SQL guard is bypassable
Locations: internal/query/duckdb.go:190, internal/query/duckdb.go:228, internal/query/duckdb.go:233, internal/api/handlers.go:631
The read-only check relies on prefix matching, which is not sufficient to guarantee safety for attacker-controlled SQL sent through POST /api/v1/query. Queries such as SELECT 1; DROP VIEW messages can bypass the intended restriction if multi-statement execution is accepted, and EXPLAIN ANALYZE ... can execute the wrapped statement rather than just describe it. This breaks the documented read-only contract and can mutate the shared DuckDB session.
Suggested fix: Validate the full parsed statement list rather than the first keyword, reject multi-statement input, remove or tightly constrain EXPLAIN, and add regression tests covering semicolon-separated statements and EXPLAIN ANALYZE bypasses.

Medium

CLI query path loses SQL NULL semantics
Location: cmd/msgvault/cmd/query.go:129
scanRow now converts every SQL NULL into "", so nullable fields and expressions no longer round-trip correctly in JSON/CSV/table output. That changes result meaning for callers by collapsing null and empty string into the same value.
Suggested fix: Preserve nil during scanning and handle display formatting per output mode instead of normalizing at scan time.
v_senders / v_domains undercount senders by ignoring the sender_id fallback path
Locations: internal/query/views.go:409, internal/query/views.go:431
These views aggregate only message_recipients rows where recipient_type = 'from', so messages that rely on the messages.sender_id fallback already used in v_messages are omitted. That makes sender/domain analytics inconsistent with the rest of the view layer and undercounts chat-style data.
Suggested fix: Build these aggregates from v_messages or union in the sender_id fallback before grouping.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Match the TUI pattern: check cacheNeedsBuild and auto-rebuild before querying instead of telling the user to run build-cache manually. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

roborev-ci · 2026-04-08T16:30:41Z

roborev: Combined Review (`5771e7d`)

Verdict: Changes are not ready to merge; the new query surface has two high-severity read-only bypasses and two medium-severity correctness regressions.

High

Read-only bypass via EXPLAIN ANALYZE
Location: internal/query/duckdb.go:192, internal/query/duckdb.go:236
QuerySQL/isReadOnlySQL allows statements starting with EXPLAIN, but in DuckDB EXPLAIN ANALYZE executes the wrapped statement. That lets /api/v1/query run mutating SQL against the shared session and regresses the prior read-only hardening.
Fix: Remove EXPLAIN from the allowlist, or at minimum block EXPLAIN ANALYZE, and add a regression test covering destructive input behind EXPLAIN ANALYZE.
Multi-statement SQL bypass
Location: internal/query/duckdb.go (isReadOnlySQL)
The read-only check only validates the prefix of the full SQL string. A query like SELECT 1; DROP VIEW messages can pass the initial check and then execute a destructive trailing statement.
Fix: Reject semicolons / multiple statements, or parse and validate every statement in the SQL input.

Medium

Sender/domain analytics drop messages that resolve sender via messages.sender_id
Location: internal/query/views.go:423, internal/query/views.go:446
v_senders and v_domains only aggregate message_recipients rows with recipient_type = 'from'. That excludes non-email/chat messages whose sender comes from messages.sender_id, even though v_messages already supports that fallback, so analytics are incomplete for mixed archives.
Fix: Reuse the same dual-path sender resolution as v_messages, or union in the messages.sender_id path, and add coverage for chat-style messages.
CLI query JSON output collapses SQL NULL into empty string
Location: cmd/msgvault/cmd/query.go:153
scanRow rewrites all SQL NULL values to "", which changes result semantics and makes NULL indistinguishable from an actual empty string in default JSON output.
Fix: Preserve nil for JSON results and only apply display formatting in CSV/table renderers.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

- JSON output now preserves nil (SQL NULL) instead of collapsing to empty string. CSV/table still display NULL as empty. - Remove isReadOnlySQL allowlist — users with CLI/API access are privileged and the prefix check was bypassable anyway. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CLI/API users are privileged — don't flag SQL injection, DDL bypass, or statement validation on the query interface. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

roborev-ci · 2026-04-08T16:38:41Z

roborev: Combined Review (`8fc23d2`)

Verdict: Changes introduce a high-severity raw-SQL execution path in the API that enables server-side tampering and unintended DuckDB access.

High

Raw SQL over /api/v1/query is not restricted to a read-only subset
Files: internal/api/server.go (added /query route), internal/api/handlers.go:643, internal/query/duckdb.go:187
The new endpoint forwards attacker-controlled SQL to DuckDB without enforcing a single-statement, read-only query shape. As implemented, callers can issue non-SELECT statements and DuckDB-specific primitives (COPY, ATTACH, INSTALL, LOAD, PRAGMA, SET, DDL/DML), which expands this from an analytics endpoint into arbitrary database/file access from the server process. This is also a regression from the previous read-only guard.
Fix: Restrict the API to parsed, single-statement SELECT/WITH queries over an explicit allowlist of views/functions, and reject non-read-only or multi-statement input. If raw SQL must remain, keep it local-only and behind explicit opt-in.
API queries run on the shared long-lived DuckDB session, allowing persistent state tampering and DoS
Files: internal/api/handlers.go:643, internal/query/duckdb.go:187
Requests execute against the same reused DuckDB session, so statements like DROP VIEW messages, CREATE OR REPLACE VIEW ..., SET threads = 1, or temp-object creation can alter behavior for subsequent requests until restart. Because the engine is single-connection, expensive queries can also pin the only connection and degrade the entire service.
Fix: Execute API SQL on a fresh per-request DuckDB connection/session initialized only with approved views, or enforce a disposable read-only context. Add test coverage for rejected mutating statements and consider query/result/time limits.

Medium

v_senders and v_domains omit chat/non-email senders
Files: internal/query/views.go:423, internal/query/views.go:446
These aggregates only derive senders from message_recipients rows where recipient_type = 'from'. Chat and other non-email messages use messages.sender_id instead, so they disappear from both views even though v_messages already supports that fallback.
Fix: Reuse the same sender-resolution logic as v_messages, or union in a messages.sender_id path before grouping. Add tests for chat-only messages.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

roborev-ci · 2026-04-08T16:43:26Z

roborev: Combined Review (`9beb2e4`)

Code review verdict: no medium-or-higher issues found.

All provided reviews agree the diff is clean at Medium, High, and Critical severity.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

0xDarkMatter and others added 23 commits April 8, 2026 07:31

fix: label validator regex syntax error in bash

8573ac0

The & character in the bash regex character class caused a parse error. Switched to denylist approach (reject single quotes, semicolons, backslashes) which is more robust for label names containing special characters like &.

docs: add design spec for SQL analytical query interface

b956ef8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: add implementation plan for SQL query interface

4838a9e

7 tasks: base views, convenience views, QuerySQL method, CLI command, HTTP endpoint, Claude Code skill, final verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add convenience views (v_messages, v_senders, v_domains, v_labe…

df763c8

…ls, v_threads) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: correct convenience view columns and aliases per spec

8aaf7b9

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add 'msgvault query' CLI command with json/csv/table output

be0363c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove specs

a103a4f

fix: remove old bash-script skill files replaced by SQL query interface

c7c5d47

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

wesm mentioned this pull request Apr 8, 2026

Add Claude Code skill with DuckDB query layer #236

Closed

fix: auto-build stale analytics cache in query command

5771e7d

Match the TUI pattern: check cacheNeedsBuild and auto-rebuild before querying instead of telling the user to run build-cache manually. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

wesm and others added 2 commits April 8, 2026 11:34

fix: add query trust model to roborev review guidelines

9beb2e4

CLI/API users are privileged — don't flag SQL injection, DDL bypass, or statement validation on the query interface. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

wesm merged commit 479e309 into main Apr 8, 2026
3 of 4 checks passed

wesm mentioned this pull request Apr 8, 2026

Consider a Claude Code skill as alternative to MCP server #230

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SQL query interface and Claude Code skill#249

Add SQL query interface and Claude Code skill#249
wesm merged 26 commits intomainfrom
0xdarkmatter/feat/claude-code-skill

wesm commented Apr 8, 2026

Uh oh!

roborev-ci bot commented Apr 8, 2026

Uh oh!

roborev-ci bot commented Apr 8, 2026

Uh oh!

roborev-ci bot commented Apr 8, 2026

Uh oh!

roborev-ci bot commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wesm commented Apr 8, 2026

Summary

Changes

Uh oh!

roborev-ci bot commented Apr 8, 2026

roborev: Combined Review (c7c5d47)

High

Medium

Uh oh!

roborev-ci bot commented Apr 8, 2026

roborev: Combined Review (5771e7d)

High

Medium

Uh oh!

roborev-ci bot commented Apr 8, 2026

roborev: Combined Review (8fc23d2)

High

Medium

Uh oh!

roborev-ci bot commented Apr 8, 2026

roborev: Combined Review (9beb2e4)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

roborev: Combined Review (`c7c5d47`)

roborev: Combined Review (`5771e7d`)

roborev: Combined Review (`8fc23d2`)

roborev: Combined Review (`9beb2e4`)