Skip to content

Add SQL query interface and Claude Code skill#249

Merged
wesm merged 26 commits intomainfrom
0xdarkmatter/feat/claude-code-skill
Apr 8, 2026
Merged

Add SQL query interface and Claude Code skill#249
wesm merged 26 commits intomainfrom
0xdarkmatter/feat/claude-code-skill

Conversation

@wesm
Copy link
Copy Markdown
Owner

@wesm wesm commented Apr 8, 2026

Summary

Supersedes #236 with a different approach: instead of shelling out to the duckdb CLI via bash scripts, this builds a SQL view layer inside msgvault itself.

  • DuckDB views over Parquet files (RegisterViews), registered at engine startup — 8 base views + 5 convenience views (v_messages, v_senders, v_domains, v_labels, v_threads)
  • msgvault query CLI command with --format json|csv|table output
  • POST /api/v1/query HTTP endpoint on the existing serve daemon (503 when Parquet cache unavailable)
  • Claude Code skill that teaches agents to use msgvault query "SELECT ..."

No bash wrapper scripts, no external duckdb CLI dependency, no Parquet path knowledge leaked to consumers.

Changes

Component Files
View layer internal/query/views.go (base + convenience views, RegisterViews)
QuerySQL internal/query/duckdb.go (method + read-only validation + view wiring)
CLI cmd/msgvault/cmd/query.go
HTTP internal/api/handlers.go, internal/api/server.go
Skill skills/claude-code/SKILL.md, skills/claude-code/references/views.md
Tests internal/query/views_test.go, cmd/msgvault/cmd/query_test.go, internal/api/handlers_test.go

🤖 Generated with Claude Code

0xDarkMatter and others added 23 commits April 8, 2026 07:31
Adds a Claude Code skill for msgvault that covers the full CLI surface
and includes direct DuckDB queries against the Parquet analytics cache
for operations the CLI search can't handle (boolean logic, multi-domain,
aggregations, thread analysis).

Includes:
- SKILL.md with verified JSON output shapes, search strategy, safety rules
- scripts/query.sh helper wrapping common DuckDB patterns (9 subcommands)
- references/duckdb-queries.md with full Parquet schema and query patterns
- references/workflows.md with multi-step analysis patterns

Tested against a ~755k message archive. All documented commands, jq
patterns, and DuckDB queries verified against live data.

Ref: #230
… query

- Add input validation to all query.sh subcommands (integers, dates,
  domains, emails, labels) to prevent SQL injection via crafted arguments
- Fix senders subcommand to accept flags before or after optional limit
- Fix thread analysis workflow to use query.sh instead of search --json
  (search does not return to/cc fields)
- Guard all search-to-jq pipelines against non-JSON empty results
- Add note about sql subcommand passing input unvalidated
The & character in the bash regex character class caused a parse error.
Switched to denylist approach (reject single quotes, semicolons, backslashes)
which is more robust for label names containing special characters like &.
Security:
- Add duckdb binary existence check before running queries
- Tighten domain validation: reject underscores, require start/end
  with alphanumeric (closes injection via underscore identifiers)
- Add write-operation guard to sql subcommand: blocks DROP, DELETE,
  INSERT, UPDATE, CREATE, ALTER, COPY TO
- Add security note to SKILL.md about sql subcommand risks

Correctness:
- Replace shift || true with explicit guard (prevents masked errors)
- Add bounds check to validate_int (1-100000)

Completeness:
- Add build-cache and sync-full to SKILL.md Quick Reference
- Add MSGVAULT_HOME path note to duckdb-queries.md
- Document analytics cache prerequisite in DuckDB section
The write-operation guard used a case-sensitive blacklist that could be
bypassed with lowercase or mixed-case statements. Replace with a strict
allowlist that normalizes input to uppercase and only permits SELECT,
WITH, EXPLAIN, DESCRIBE, SHOW, and PRAGMA statements.
- Reject semicolons in sql subcommand input to prevent multi-statement
  bypass (e.g. "SELECT 1; DROP TABLE messages")
- Remove PRAGMA from allowlist (can modify DuckDB state)
- Clarify threads subcommand matches any participant role (from/to/cc/bcc)
  not just senders — this is intentional for "who else is on threads
  involving this person" use case. Updated help text to document this.
EXPLAIN ANALYZE executes the underlying statement, so allowing EXPLAIN
breaks the read-only guarantee. Remove EXPLAIN from the allowlist
entirely — agents rarely need it and can use DESCRIBE/SHOW instead.

Allowlist is now: SELECT, WITH, DESCRIBE, SHOW.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Drop HTTP endpoint from V1 scope (CLI-only), document security
  requirements for future remote access
- Fix connection model: require single-connection pinning for view
  registration, matching existing DuckDBEngine pattern
- Fix v_messages sender resolution to include dual-path logic
  (message_recipients + sender_id) and phone_number for chat sources
- Document serve fallback behavior as future HTTP concern

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add POST /api/v1/query endpoint back in scope (users responsible
  for securing their installations)
- Define labels/participant lists as JSON text via to_json(list(...))
  in view contract, matching existing DuckDB engine pattern
- 503 when Parquet cache unavailable (no SQLite fallback for query)
- HTTP handler reuses DuckDBEngine connection (views already registered)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7 tasks: base views, convenience views, QuerySQL method,
CLI command, HTTP endpoint, Claude Code skill, final verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Create reusable DuckDB views over the 8 Parquet tables (messages,
participants, message_recipients, labels, message_labels, attachments,
conversations, sources). Each view normalises column types with CAST
and handles optional columns (attachment_count, sender_id, message_type,
phone_number, title, conversation_type, source_type) via probing and
COALESCE fallbacks for older cache files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ls, v_threads)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
v_threads: COUNT(*) overcounted because LEFT JOINs on
message_recipients and participants multiply rows per message.
Changed to COUNT(DISTINCT m.id).

v_senders: FIRST(mr.display_name) could return NULL/empty.
Now uses COALESCE(NULLIF(TRIM(...), ''), email_address) to
guarantee a non-empty from_name.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Register SQL views automatically during NewDuckDBEngine so they are
available without a separate RegisterViews call. Add QuerySQL method
and SQLQuerier interface for raw SQL access over the views. Deduplicate
probeParquetColumns by delegating to the standalone probeColumns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NewDuckDBEngine probed Parquet schemas to populate engine.optionalCols,
then called RegisterViews which probed the same schemas again inside
createBaseViews. Extract probeAllOptionalColumns and RegisterViewsWithColumns
so NewDuckDBEngine can pass its already-computed map and skip the
redundant I/O.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds handleQuery handler that type-asserts s.engine to
query.SQLQuerier at runtime. DuckDB engines support it;
SQLite engines return 503. Route registered alongside
other engine-dependent endpoints in the authenticated
API v1 group.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the previous ops-heavy skill with a thin query-focused skill
that teaches Claude to use `msgvault query` with SQL views. Add a full
schema reference in references/views.md derived from views.go.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add union_by_name=true to probeColumns for mixed-schema Parquet caches
- Guard QuerySQL with read-only prefix check (SELECT/WITH/DESCRIBE/EXPLAIN)
- Normalize SQL NULLs to empty string in scanRow for clean CSV/table output
- Fix v_senders from_name to prefer mr.display_name (consistent with v_messages)
- Fix SKILL.md "Large attachments" example to use v_messages (has from_email)
- Make views.md type descriptions open-ended for message_type, conversation_type, source_type
- Add tests for DDL rejection, v_senders from_name, and exact v_threads message_count

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci bot commented Apr 8, 2026

roborev: Combined Review (c7c5d47)

Verdict: Changes introduce a read-only query bypass and a couple of correctness regressions that should be addressed before merge.

High

  • Read-only SQL guard is bypassable
    Locations: internal/query/duckdb.go:190, internal/query/duckdb.go:228, internal/query/duckdb.go:233, internal/api/handlers.go:631
    The read-only check relies on prefix matching, which is not sufficient to guarantee safety for attacker-controlled SQL sent through POST /api/v1/query. Queries such as SELECT 1; DROP VIEW messages can bypass the intended restriction if multi-statement execution is accepted, and EXPLAIN ANALYZE ... can execute the wrapped statement rather than just describe it. This breaks the documented read-only contract and can mutate the shared DuckDB session.
    Suggested fix: Validate the full parsed statement list rather than the first keyword, reject multi-statement input, remove or tightly constrain EXPLAIN, and add regression tests covering semicolon-separated statements and EXPLAIN ANALYZE bypasses.

Medium

  • CLI query path loses SQL NULL semantics
    Location: cmd/msgvault/cmd/query.go:129
    scanRow now converts every SQL NULL into "", so nullable fields and expressions no longer round-trip correctly in JSON/CSV/table output. That changes result meaning for callers by collapsing null and empty string into the same value.
    Suggested fix: Preserve nil during scanning and handle display formatting per output mode instead of normalizing at scan time.

  • v_senders / v_domains undercount senders by ignoring the sender_id fallback path
    Locations: internal/query/views.go:409, internal/query/views.go:431
    These views aggregate only message_recipients rows where recipient_type = 'from', so messages that rely on the messages.sender_id fallback already used in v_messages are omitted. That makes sender/domain analytics inconsistent with the rest of the view layer and undercounts chat-style data.
    Suggested fix: Build these aggregates from v_messages or union in the sender_id fallback before grouping.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Match the TUI pattern: check cacheNeedsBuild and auto-rebuild before
querying instead of telling the user to run build-cache manually.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci bot commented Apr 8, 2026

roborev: Combined Review (5771e7d)

Verdict: Changes are not ready to merge; the new query surface has two high-severity read-only bypasses and two medium-severity correctness regressions.

High

  • Read-only bypass via EXPLAIN ANALYZE
    Location: internal/query/duckdb.go:192, internal/query/duckdb.go:236
    QuerySQL/isReadOnlySQL allows statements starting with EXPLAIN, but in DuckDB EXPLAIN ANALYZE executes the wrapped statement. That lets /api/v1/query run mutating SQL against the shared session and regresses the prior read-only hardening.
    Fix: Remove EXPLAIN from the allowlist, or at minimum block EXPLAIN ANALYZE, and add a regression test covering destructive input behind EXPLAIN ANALYZE.

  • Multi-statement SQL bypass
    Location: internal/query/duckdb.go (isReadOnlySQL)
    The read-only check only validates the prefix of the full SQL string. A query like SELECT 1; DROP VIEW messages can pass the initial check and then execute a destructive trailing statement.
    Fix: Reject semicolons / multiple statements, or parse and validate every statement in the SQL input.

Medium

  • Sender/domain analytics drop messages that resolve sender via messages.sender_id
    Location: internal/query/views.go:423, internal/query/views.go:446
    v_senders and v_domains only aggregate message_recipients rows with recipient_type = 'from'. That excludes non-email/chat messages whose sender comes from messages.sender_id, even though v_messages already supports that fallback, so analytics are incomplete for mixed archives.
    Fix: Reuse the same dual-path sender resolution as v_messages, or union in the messages.sender_id path, and add coverage for chat-style messages.

  • CLI query JSON output collapses SQL NULL into empty string
    Location: cmd/msgvault/cmd/query.go:153
    scanRow rewrites all SQL NULL values to "", which changes result semantics and makes NULL indistinguishable from an actual empty string in default JSON output.
    Fix: Preserve nil for JSON results and only apply display formatting in CSV/table renderers.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

wesm and others added 2 commits April 8, 2026 11:34
- JSON output now preserves nil (SQL NULL) instead of collapsing to
  empty string. CSV/table still display NULL as empty.
- Remove isReadOnlySQL allowlist — users with CLI/API access are
  privileged and the prefix check was bypassable anyway.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CLI/API users are privileged — don't flag SQL injection,
DDL bypass, or statement validation on the query interface.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci bot commented Apr 8, 2026

roborev: Combined Review (8fc23d2)

Verdict: Changes introduce a high-severity raw-SQL execution path in the API that enables server-side tampering and unintended DuckDB access.

High

  • Raw SQL over /api/v1/query is not restricted to a read-only subset
    Files: internal/api/server.go (added /query route), internal/api/handlers.go:643, internal/query/duckdb.go:187
    The new endpoint forwards attacker-controlled SQL to DuckDB without enforcing a single-statement, read-only query shape. As implemented, callers can issue non-SELECT statements and DuckDB-specific primitives (COPY, ATTACH, INSTALL, LOAD, PRAGMA, SET, DDL/DML), which expands this from an analytics endpoint into arbitrary database/file access from the server process. This is also a regression from the previous read-only guard.
    Fix: Restrict the API to parsed, single-statement SELECT/WITH queries over an explicit allowlist of views/functions, and reject non-read-only or multi-statement input. If raw SQL must remain, keep it local-only and behind explicit opt-in.

  • API queries run on the shared long-lived DuckDB session, allowing persistent state tampering and DoS
    Files: internal/api/handlers.go:643, internal/query/duckdb.go:187
    Requests execute against the same reused DuckDB session, so statements like DROP VIEW messages, CREATE OR REPLACE VIEW ..., SET threads = 1, or temp-object creation can alter behavior for subsequent requests until restart. Because the engine is single-connection, expensive queries can also pin the only connection and degrade the entire service.
    Fix: Execute API SQL on a fresh per-request DuckDB connection/session initialized only with approved views, or enforce a disposable read-only context. Add test coverage for rejected mutating statements and consider query/result/time limits.

Medium

  • v_senders and v_domains omit chat/non-email senders
    Files: internal/query/views.go:423, internal/query/views.go:446
    These aggregates only derive senders from message_recipients rows where recipient_type = 'from'. Chat and other non-email messages use messages.sender_id instead, so they disappear from both views even though v_messages already supports that fallback.
    Fix: Reuse the same sender-resolution logic as v_messages, or union in a messages.sender_id path before grouping. Add tests for chat-only messages.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci bot commented Apr 8, 2026

roborev: Combined Review (9beb2e4)

Code review verdict: no medium-or-higher issues found.

All provided reviews agree the diff is clean at Medium, High, and Critical severity.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@wesm wesm merged commit 479e309 into main Apr 8, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants