Add SQL query interface and Claude Code skill#249
Conversation
Adds a Claude Code skill for msgvault that covers the full CLI surface and includes direct DuckDB queries against the Parquet analytics cache for operations the CLI search can't handle (boolean logic, multi-domain, aggregations, thread analysis). Includes: - SKILL.md with verified JSON output shapes, search strategy, safety rules - scripts/query.sh helper wrapping common DuckDB patterns (9 subcommands) - references/duckdb-queries.md with full Parquet schema and query patterns - references/workflows.md with multi-step analysis patterns Tested against a ~755k message archive. All documented commands, jq patterns, and DuckDB queries verified against live data. Ref: #230
… query - Add input validation to all query.sh subcommands (integers, dates, domains, emails, labels) to prevent SQL injection via crafted arguments - Fix senders subcommand to accept flags before or after optional limit - Fix thread analysis workflow to use query.sh instead of search --json (search does not return to/cc fields) - Guard all search-to-jq pipelines against non-JSON empty results - Add note about sql subcommand passing input unvalidated
The & character in the bash regex character class caused a parse error. Switched to denylist approach (reject single quotes, semicolons, backslashes) which is more robust for label names containing special characters like &.
Security: - Add duckdb binary existence check before running queries - Tighten domain validation: reject underscores, require start/end with alphanumeric (closes injection via underscore identifiers) - Add write-operation guard to sql subcommand: blocks DROP, DELETE, INSERT, UPDATE, CREATE, ALTER, COPY TO - Add security note to SKILL.md about sql subcommand risks Correctness: - Replace shift || true with explicit guard (prevents masked errors) - Add bounds check to validate_int (1-100000) Completeness: - Add build-cache and sync-full to SKILL.md Quick Reference - Add MSGVAULT_HOME path note to duckdb-queries.md - Document analytics cache prerequisite in DuckDB section
The write-operation guard used a case-sensitive blacklist that could be bypassed with lowercase or mixed-case statements. Replace with a strict allowlist that normalizes input to uppercase and only permits SELECT, WITH, EXPLAIN, DESCRIBE, SHOW, and PRAGMA statements.
- Reject semicolons in sql subcommand input to prevent multi-statement bypass (e.g. "SELECT 1; DROP TABLE messages") - Remove PRAGMA from allowlist (can modify DuckDB state) - Clarify threads subcommand matches any participant role (from/to/cc/bcc) not just senders — this is intentional for "who else is on threads involving this person" use case. Updated help text to document this.
EXPLAIN ANALYZE executes the underlying statement, so allowing EXPLAIN breaks the read-only guarantee. Remove EXPLAIN from the allowlist entirely — agents rarely need it and can use DESCRIBE/SHOW instead. Allowlist is now: SELECT, WITH, DESCRIBE, SHOW.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Drop HTTP endpoint from V1 scope (CLI-only), document security requirements for future remote access - Fix connection model: require single-connection pinning for view registration, matching existing DuckDBEngine pattern - Fix v_messages sender resolution to include dual-path logic (message_recipients + sender_id) and phone_number for chat sources - Document serve fallback behavior as future HTTP concern Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add POST /api/v1/query endpoint back in scope (users responsible for securing their installations) - Define labels/participant lists as JSON text via to_json(list(...)) in view contract, matching existing DuckDB engine pattern - 503 when Parquet cache unavailable (no SQLite fallback for query) - HTTP handler reuses DuckDBEngine connection (views already registered) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7 tasks: base views, convenience views, QuerySQL method, CLI command, HTTP endpoint, Claude Code skill, final verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Create reusable DuckDB views over the 8 Parquet tables (messages, participants, message_recipients, labels, message_labels, attachments, conversations, sources). Each view normalises column types with CAST and handles optional columns (attachment_count, sender_id, message_type, phone_number, title, conversation_type, source_type) via probing and COALESCE fallbacks for older cache files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ls, v_threads) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
v_threads: COUNT(*) overcounted because LEFT JOINs on message_recipients and participants multiply rows per message. Changed to COUNT(DISTINCT m.id). v_senders: FIRST(mr.display_name) could return NULL/empty. Now uses COALESCE(NULLIF(TRIM(...), ''), email_address) to guarantee a non-empty from_name. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Register SQL views automatically during NewDuckDBEngine so they are available without a separate RegisterViews call. Add QuerySQL method and SQLQuerier interface for raw SQL access over the views. Deduplicate probeParquetColumns by delegating to the standalone probeColumns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NewDuckDBEngine probed Parquet schemas to populate engine.optionalCols, then called RegisterViews which probed the same schemas again inside createBaseViews. Extract probeAllOptionalColumns and RegisterViewsWithColumns so NewDuckDBEngine can pass its already-computed map and skip the redundant I/O. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds handleQuery handler that type-asserts s.engine to query.SQLQuerier at runtime. DuckDB engines support it; SQLite engines return 503. Route registered alongside other engine-dependent endpoints in the authenticated API v1 group. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the previous ops-heavy skill with a thin query-focused skill that teaches Claude to use `msgvault query` with SQL views. Add a full schema reference in references/views.md derived from views.go. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add union_by_name=true to probeColumns for mixed-schema Parquet caches - Guard QuerySQL with read-only prefix check (SELECT/WITH/DESCRIBE/EXPLAIN) - Normalize SQL NULLs to empty string in scanRow for clean CSV/table output - Fix v_senders from_name to prefer mr.display_name (consistent with v_messages) - Fix SKILL.md "Large attachments" example to use v_messages (has from_email) - Make views.md type descriptions open-ended for message_type, conversation_type, source_type - Add tests for DDL rejection, v_senders from_name, and exact v_threads message_count Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
Match the TUI pattern: check cacheNeedsBuild and auto-rebuild before querying instead of telling the user to run build-cache manually. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
- JSON output now preserves nil (SQL NULL) instead of collapsing to empty string. CSV/table still display NULL as empty. - Remove isReadOnlySQL allowlist — users with CLI/API access are privileged and the prefix check was bypassable anyway. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CLI/API users are privileged — don't flag SQL injection, DDL bypass, or statement validation on the query interface. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
roborev: Combined Review (
|
Summary
Supersedes #236 with a different approach: instead of shelling out to the
duckdbCLI via bash scripts, this builds a SQL view layer inside msgvault itself.RegisterViews), registered at engine startup — 8 base views + 5 convenience views (v_messages, v_senders, v_domains, v_labels, v_threads)msgvault queryCLI command with--format json|csv|tableoutputPOST /api/v1/queryHTTP endpoint on the existing serve daemon (503 when Parquet cache unavailable)msgvault query "SELECT ..."No bash wrapper scripts, no external
duckdbCLI dependency, no Parquet path knowledge leaked to consumers.Changes
internal/query/views.go(base + convenience views,RegisterViews)internal/query/duckdb.go(method + read-only validation + view wiring)cmd/msgvault/cmd/query.gointernal/api/handlers.go,internal/api/server.goskills/claude-code/SKILL.md,skills/claude-code/references/views.mdinternal/query/views_test.go,cmd/msgvault/cmd/query_test.go,internal/api/handlers_test.go🤖 Generated with Claude Code