feat: multi-table support and data dictionary by cpsievert · Pull Request #195 · posit-dev/querychat

cpsievert · 2026-01-15T01:42:44Z

For users

querychat has been single-table: one data frame, one schema, one conversation. If you had orders and customers and products, your options were to pre-join everything into one wide frame (losing the relational structure the LLM could reason about) or to spin up separate querychat sessions (losing the ability to ask cross-table questions). This PR removes that limitation.

You can now register multiple tables and the LLM will reason across all of them — joins, cross-table filters, aggregations:

qc = QueryChat(orders_df, "orders")
qc.add_table(customers_df, "customers")

If your data lives in a SQLAlchemy engine (Python) or DBI connection (R), add_tables() bulk-registers all tables — or a named subset — in a single call, and builds the system prompt exactly once rather than once per table:

qc = QueryChat()
qc.add_tables(engine)                          # all tables
qc.add_tables(engine, ["orders", "customers"]) # specific subset

The LLM's SQL queries can reference any registered table. To consume per-table filter state in a Shiny app, call qc.server() and use table() on the returned values — each table gets its own reactive sql/df/title:

# In your Shiny server function:
qc_vals = qc.server()
qc_vals.table("orders").df()
qc_vals.table("customers").sql()

As a side note: qc_vals.current_table() (R: qc_vals\$current_table()) is also available — a reactive that returns the name of the most recently LLM-queried table, or None/NULL before any query. Useful when you want an output to follow whatever table the LLM just acted on without hard-coding a name.

Under the hood, a new QueryExecutor layer dispatches queries to the right backend. DataFrame sources (pandas, polars, pyarrow) get a shared DuckDB connection; Polars LazyFrames get a shared SQLContext; SQLAlchemy/Ibis tables use the existing shared backend directly. All tables in a session must be the same source type.

But multi-table introduces a scaling problem. The old design embedded full column statistics in the system prompt at startup — every conversation paid the cost of computing stats for every column of every table, whether the LLM needed them or not. With multiple tables, that cost multiplies. So schema fetching is now on-demand: the system prompt contains only a lightweight table overview, and the LLM calls the new querychat_get_schema tool to get column details for specific tables when it actually needs them.

And when you have several tables, column names get ambiguous. What does status mean in the orders table vs. the customers table? A new DataDict type — integrating with the https://data-dict.tidyverse.org/ spec — lets you annotate tables and columns with plain-English descriptions, loaded from a YAML file:

QueryChat(data_dict="data_dict.yaml")

The schema tool uses these annotations directly, skipping the live stats query for any column that already has metadata. A fully-documented table incurs zero stats overhead. DataDict also supports relationships (JOIN hints for multi-table) and a glossary (domain term definitions), both surfaced in the system prompt so the LLM understands your domain language.

These three pieces — multi-table, on-demand schema, and data dictionary — aren't independent features. They're designed as layers: multi-table creates the need for smarter schema handling, on-demand fetching solves the cost problem, and DataDict solves the precision problem.

Breaking changes

Before	After
`qc.data_source`	`qc.table("name").data_source`
`qc.data_source = new_df`	`qc.add_table(new_df, "name", replace=True)`
`qc.server(data_source=df)` (Shiny)	`qc.add_table(df, "name")` before `server()`
`qc.df()` / `qc.sql()` / `qc.title()` (multi-table)	`qc_vals.table("name").df()` etc. via server return

Bookmark keys changed: Python uses flat per-table keys (querychat_sql_{name}, querychat_title_{name}); R uses a nested querychat_tables dict. Old single-table flat keys (querychat_sql) are no longer written.

data_description is deprecated in favor of data_dict.

For reviewers

The code mirrors the same layered story. Each layer depends only on the ones below it, so reviewing bottom-up avoids forward references:

Layer 1 — The data contract (_datasource.py). DataSource gained two methods: get_column_metas() (cheap — column names and types via LIMIT 0) and populate_column_stats() (expensive — min/max/categories). This split is the foundation everything else builds on: it's what makes on-demand fetching possible and what lets DataDict selectively skip stats queries. ColumnMeta also gained a description field for DataDict annotations.

Layer 2 — Multi-table dispatch (_query_executor.py, new). A QueryExecutor sits between the tools and the data sources. It exposes the same get_column_metas(table) / populate_column_stats(table) / execute_query() surface but routes to the right backend. Three implementations: DataSourceExecutor for single-table or shared-backend multi-table (SQLAlchemy, Ibis); DuckDBExecutor for multi-table DataFrames (registers all tables in one in-memory DuckDB); and PolarsSQLExecutor for multi-table Polars LazyFrames (shared SQLContext). check_source_compatibility() enforces that all tables in a session share the same source type and backend.

Layer 3 — Column metadata (_data_dict.py, new). A Pydantic model hierarchy: DataDict → TableSpec → ColumnSpec. The key method, DataDict.get_table_schema(), merges static annotations onto live ColumnMeta objects and only sends undocumented columns to populate_column_stats(). This is where the "skip stats for documented columns" optimization lives.

Layer 4 — The schema tool (tools.py, prompts/tool-get-schema.md). querychat_get_schema dispatches to either DataDict.get_table_schema() or executor.get_schema() depending on whether a dict is present. It returns a GetSchemaResult subclass that renders a compact status line in Shiny chat rather than dumping the raw schema.

Layer 5 — Prompt rewiring (_system_prompt.py, prompts/prompt.md). The old {{schema}} Mustache variable is gone. Replaced by {{tables_overview}} (lightweight table list), {{relationships}}, {{glossary}}, and a {{multi_table}} flag that drives conditional instructions requiring table= parameters in tool calls.

Layer 6 — Orchestration (_querychat_base.py). The base class now holds dict[str, DataSource]. add_table() uses a staged-rebuild pattern: it builds the new system prompt and executor against a copy of the sources dict, committing only if both succeed, to avoid partial state on failure. add_tables() (Python: SQLAlchemy engine; R: DBI connection) bulk-stages all tables and calls build_system_prompt() exactly once, avoiding the N-1 spurious rebuilds that calling add_table() in a loop would cause. table(name) returns a config-only TableAccessor — data_source and table_name work, but df()/sql()/title() raise with guidance to use the server return value instead.

Layer 7 — Shiny reactive state (_shiny_module.py). ServerValues has per-table state in tables: dict[str, TableState], each with its own reactive sql, title, and df. It also gained a table(name) method that returns a TableAccessor backed by the per-session TableState — this is the correct way to consume per-table reactive results. Also added: current_table(), which tracks the most recently LLM-touched table name across update_dashboard, reset_dashboard, chat_update, and bookmark restore. Single-table apps still get top-level .df, .sql, .title on ServerValues for backward compatibility; multi-table replaces them with _MultiTableBlockedReactive sentinels that raise with migration guidance. Bookmark keys are per-table (Python: flat querychat_sql_{name}; R: nested querychat_tables).

Layer 8 — Public accessors and framework adapters (_table_accessor.py new, _dash.py, _gradio.py, _streamlit.py). TableAccessor is a thin proxy that can be constructed in two modes: config-only (from qc.table(), raises on reactive methods) or state-backed (from qc_vals.table(), fully reactive). StateDictTableAccessor is the Dash/Gradio variant that always raises on reactive methods with instructions to use the state-dict API instead. Framework adapters gained table= parameters on df(), sql(), title().

Test plan

make py-check passes (format, types, tests)
Multi-table Shiny app: LLM correctly joins/filters across tables
Single-table apps still work with no API changes
DataDict: YAML loads, enriched descriptions reach the LLM, documented columns skip stats queries
Bookmark round-trip: new per-table keys save/restore correctly; old flat keys still load

- format_tool_result falls back to str(value) so querychat_get_schema output is visible in console/Streamlit/Gradio/Dash UIs - cleanup_failed_staged_source now covers PinSource (DuckDB connection) in addition to DataFrameSource, preventing a resource leak on failed staged add/replace - RuntimeError message in mod_server updated to reference add_table() instead of the removed data_source property

…ndering Multiple DataDicts can now be passed to QueryChat, each acting as a named domain namespace. The system prompt renders each dict as a named XML tag wrapping a YAML body, giving the LLM clear domain boundaries without the verbosity of raw XML or the ambiguity of a flat table list. DataDict gains optional name/description fields; from_yaml() derives name from the file stem when not set explicitly. A UserWarning nudges multi-table apps that have no dict toward providing one.

querychat.js was sending {query, title} but omitting `table` when the Apply Filter button was clicked. The Shiny handler silently ignored every click because update.get("table", "") returned "" (falsy). Adds a Playwright test that reproduces the failure before the fix.

- Update prompt templates for multi-table support - Add DataDict S3 constructors with YAML loading - Add QueryExecutor private R6 classes for multi-table - Add multi-table support to QueryChat, tools, and module - Resolve integration issues

DataDict S3 constructors (data_dict, table_spec, column_spec, etc.) are removed in favour of read_data_dict() returning a plain YAML-parsed list. This avoids an API commitment to a still-evolving spec and removes ~250 lines of constructor/parser scaffolding. The get_schema tool now actually uses data_dicts: it finds the first dict whose tables key covers the requested table and passes that table spec to get_schema_impl, which bypasses live MIN/MAX queries for columns with a DataDict range and bypasses SELECT DISTINCT for columns with a DataDict values list. Description, type, units, and constraints from the dict are surfaced directly in the schema output. Prompt instruction for get_schema also strengthened to always call it before writing SQL rather than treating it as optional.

…ompt - R add_table() now calls check_source_compatibility() before committing, matching Python's validation that prevents mixing incompatible source types - Python tool_query() now passes multi_table to the prompt template so multi-table query guidance renders when multiple tables are registered

…n schema ColumnMeta now carries units and constraints fields. DataDict.get_table_schema copies type, units, and constraints from the ColumnSpec onto the ColumnMeta, and format_schema renders them — matching R's get_schema_impl output.

Replaces the private shinychat:::new_tool_card call with a plain shiny::tags$p(), matching the Python implementation's ChatMessage approach.

TableAccessor no longer reaches into QueryChat's private fields via .__enclos_env__. Instead it takes table_name, data_source, and an optional state list. Config-only accessors (from qc$table()) error on reactive methods with guidance to use the server return value.

The schema popover UI previously re-parsed the LLM-facing text representation of schema info to build its display. This required a fragile regex that broke on types with nested parens (e.g. number(id)). Now the server serializes structured column metadata (ColumnMeta in Python, named lists in R) directly to JSON and attaches it to the sentinel element as data-schema-json. The JS reads JSON and never touches the text. The text parser and its regex are gone. Also fixes XML attribute escaping for data-dict name/description fields, and tightens the RuntimeError message in _shiny_module.py.

The schema tool executes quickly, so the <shiny-tool-request> element flashed briefly before disappearing. Other tools hide the request via <shiny-tool-result>'s useEffect, which runs in the same React render cycle as the result appearing. The schema tool bypasses ToolResultComponent entirely, so it relied on a separate hide_tool_request server action — which could arrive in a different render cycle and cause a visible flash. Fix: intercept the ContentToolRequest dispatch for querychat_get_schema in both Python and R and return empty/null content so the request element never renders in the first place.

@Keywords

- Escape HTML attributes (query, title, table) in Apply/Reset Filter buttons to prevent attribute injection - Pass multi_table flag to tool_update_dashboard() and tool_visualize() so prompt templates render multi-table guidance - Normalize empty strings to NULL/None in chat_update observers so Reset Filter button correctly clears state - Remove @export/@Keywords internal from read_data_dict (unexported internal)

Conflicts resolved: - NEWS.md / CHANGELOG.md: kept both multi-table entries (branch) and PinSource entry (main) - DataFrameSource.R: used new_dataframe_connection() helper from main, kept DuckDB cleanup method from branch - QueryChat.R: added setBookmarkExclude(c("close_btn", "reset_query")) from main - querychat_module.R: kept restore_ui=FALSE from branch, added setBookmarkExclude("chat_update") from main - _shiny.py: kept _mark_server_initialized() from branch, added bookmark exclude for reset_query from main

Replace the mixin-based approach (which required ~10 type: ignore annotations for attributes the mixin couldn't own) with a proper subclass of QueryChatBase. Dash and Gradio QueryChat classes now use single inheritance from StateDictQueryChat instead of dual inheritance.

…rnings Instead of raising AttributeError, calling .df()/.sql()/.title() without a table name now warns once and delegates to the primary table. This gives users a migration path rather than an immediate hard break. Also fixes the stacklevel so warnings point at user call sites, removes the spurious warning on .set(), and replaces the _df_warned list trick with nonlocal.

Extract the shared require_all_columns check from DuckDBExecutor and PolarsSQLExecutor into a single static helper on the abstract base class.

Adds QueryChat$add_tables() (R) and QueryChatBase.add_tables() (Python) to register multiple tables from a single connection/engine in one call. Compared to calling add_table() in a loop, this builds the system prompt exactly once after all tables are staged, avoiding N-1 intermediate rebuilds. Includes validation, replace support, cleanup of displaced sources, and tests.

Prompt caching established over prior turns is silently invalidated when add_table() rebuilds the system prompt mid-conversation.

cpsievert

Thanks for the detailed review. I've addressed the near-term items in sections 2 and 3; thoughts on section 1 below.

Sections 3A–3D

All four simplification items landed:

905a8db — replaces StateDictAccessorMixin with StateDictQueryChat, eliminating all 15 type: ignore[attr-defined] comments
53a7788 — softens the flat-accessor error to FutureWarning with primary-table fallback, with session-scoped deduplication so Shiny reactives don't spam
f36db45 — extracts _validate_missing_columns to the QueryExecutor base, removing the ~19-line duplication
47d8fae — warns when add_table rebuilds the system prompt after a client already has chat history, covering both the _client_spec and _client_console paths

Section 2 — Bulk table registration

99e011d adds add_tables() for bulk registration in both R and Python, which handles the many-add_table-calls problem. Auto-discovery on construction is a natural next step I'd consider separately.

Section 1 — Coalescing source types into shared DuckDB

I've filed #256 to track loosening the class-identity check so PinSource and DataFrameSource can share a DuckDBExecutor — both already materialize an eager frame, so this is largely a routing change.

For steps 2 and 3 I'm more cautious:

Snapshot mode is hard to recommend as a general solution. Materializing a full remote table into local DuckDB memory at startup works for small tables and silently breaks for production ones, with no reliable way to gate on table size upfront.

Live attach is more interesting but comes with real costs: DuckDB's SQL dialect diverges from Postgres/MySQL enough to introduce compatibility risk (the LLM is already told what flavor to use, and "DuckDB mediating Postgres" isn't the same as talking to Postgres directly); ATTACH scanners only cover Postgres, MySQL, and SQLite, so Snowflake/Databricks users still fall back to snapshots; per-dialect connection-string translation from SQLAlchemy URLs is non-trivial glue; and relaxing enable_external_access is a meaningful security posture change that warrants its own rollout.

I'd keep the register_into contract extensible so steps 2/3 remain additive, but I'd want concrete user demand before pursuing remote-source coalescing.

…able Combined multi-table per-table reactive state (this branch) with the new greeting API and bookmark persistence (main). Resolutions: - _shiny_module.py / querychat_module.R: replace has_greeted with current_greeting; keep per-table table_states; bookmark saves per-table SQL/title + greeting; restore restores both + calls set_greeting() - _shiny.py: pass greeting=self.greeting to mod_ui() + keep _ensure_server_started() - NEWS.md / CHANGELOG.md: include both improvement entries

Extract _df_for_source and _title_for_table helpers to eliminate duplicated logic in df() and title(). Combine nested with-statements in tests to fix SIM117.

cpsievert · 2026-06-22T22:14:06Z

@gadenbuie thanks for the review, ok to merge?

cpsievert marked this pull request as draft January 15, 2026 21:10

Base automatically changed from feat/py-ibis-source to main January 16, 2026 22:41

cpsievert had a problem deploying to pypi January 16, 2026 23:05 — with GitHub Actions Error

cpsievert force-pushed the feat/py-multi-table branch 2 times, most recently from 76e009c to 193c2e1 Compare January 26, 2026 17:39

cpsievert temporarily deployed to pypi January 26, 2026 17:39 — with GitHub Actions Inactive

cpsievert force-pushed the feat/py-multi-table branch from 193c2e1 to 96c1c1d Compare January 26, 2026 18:00

cpsievert temporarily deployed to pypi January 26, 2026 18:00 — with GitHub Actions Inactive

cpsievert force-pushed the feat/py-multi-table branch from 96c1c1d to 4654e57 Compare June 17, 2026 14:22

cpsievert changed the title ~~feat(pkg-py): add multi-table support~~ feat(pkg-py): multi-table support and data dictionary Jun 17, 2026

cpsievert force-pushed the feat/py-multi-table branch from 7f0b8d7 to 3c54a32 Compare June 17, 2026 17:33

cpsievert changed the title ~~feat(pkg-py): multi-table support and data dictionary~~ feat: multi-table support and data dictionary Jun 17, 2026

cpsievert added 4 commits June 17, 2026 16:48

feat: add multi-table support

1d5d103

feat: add DataDict for per-column metadata

6daa693

feat: add per-table accessor API

a7f9184

feat: show schema-fetch status and address review feedback

89361d8

cpsievert force-pushed the feat/py-multi-table branch from 9d5151e to 89361d8 Compare June 17, 2026 21:51

cpsievert requested a review from Copilot June 17, 2026 21:52

Copilot started reviewing on behalf of cpsievert June 17, 2026 21:52 View session

This comment was marked as resolved.

Sign in to view

cpsievert added 5 commits June 17, 2026 17:00

feat(r): add multi-table support

14b4d32

- Update prompt templates for multi-table support - Add DataDict S3 constructors with YAML loading - Add QueryExecutor private R6 classes for multi-table - Add multi-table support to QueryChat, tools, and module - Resolve integration issues

cpsievert force-pushed the feat/py-multi-table branch from 7bba78a to f03c0cb Compare June 18, 2026 15:31

cpsievert added 4 commits June 18, 2026 10:48

fix(r): render get_schema tool result as inline HTML, not a card

af34732

Replaces the private shinychat:::new_tool_card call with a plain shiny::tags$p(), matching the Python implementation's ChatMessage approach.

This comment was marked as resolved.

Sign in to view

cpsievert and others added 8 commits June 19, 2026 14:17

fix remote

db01377

air format (GitHub Actions)

a46281d

usethis::use_tidy_description() (GitHub Actions)

686e71c

devtools::document() (GitHub Actions)

fcc4c71

air format (GitHub Actions)

c33ed26

fix test mock

eb61d09

cpsievert marked this pull request as ready for review June 19, 2026 20:55

cpsievert requested a review from Copilot June 19, 2026 20:55

Copilot started reviewing on behalf of cpsievert June 19, 2026 20:55 View session

This comment was marked as resolved.

Sign in to view

cpsievert and others added 2 commits June 19, 2026 16:23

devtools::document() (GitHub Actions)

9e4e7a6

cpsievert requested a review from gadenbuie June 22, 2026 16:19

This comment was marked as resolved.

Sign in to view

cpsievert added 5 commits June 22, 2026 13:18

refactor(pkg-py): deduplicate missing-column validation in QueryExecutor

f36db45

Extract the shared require_all_columns check from DuckDBExecutor and PolarsSQLExecutor into a single static helper on the abstract base class.

fix(pkg-py): warn when system prompt rebuilds after chat history exists

47d8fae

Prompt caching established over prior turns is silently invalidated when add_table() rebuilds the system prompt mid-conversation.

cpsievert mentioned this pull request Jun 22, 2026

Allow mixing DataFrameSource and PinSource in multi-table sessions #256

Open

cpsievert commented Jun 22, 2026

View reviewed changes

cpsievert and others added 4 commits June 22, 2026 16:02

air format (GitHub Actions)

c750ec1

devtools::document() (GitHub Actions)

84cc802

fix(pkg-py): reduce return statement count to satisfy ruff PLR0911

ea9b977

Extract _df_for_source and _title_for_table helpers to eliminate duplicated logic in df() and title(). Combine nested with-statements in tests to fix SIM117.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: multi-table support and data dictionary#195

feat: multi-table support and data dictionary#195
cpsievert wants to merge 73 commits into
mainfrom
feat/py-multi-table

cpsievert commented Jan 15, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

cpsievert left a comment

Uh oh!

cpsievert commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cpsievert commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

For users

Breaking changes

For reviewers

Test plan

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

cpsievert left a comment

Choose a reason for hiding this comment

Uh oh!

cpsievert commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cpsievert commented Jan 15, 2026 •

edited

Loading