Skip to content

feat: multi-table support and data dictionary#195

Open
cpsievert wants to merge 73 commits into
mainfrom
feat/py-multi-table
Open

feat: multi-table support and data dictionary#195
cpsievert wants to merge 73 commits into
mainfrom
feat/py-multi-table

Conversation

@cpsievert

@cpsievert cpsievert commented Jan 15, 2026

Copy link
Copy Markdown
Contributor

For users

querychat has been single-table: one data frame, one schema, one conversation. If you had orders and customers and products, your options were to pre-join everything into one wide frame (losing the relational structure the LLM could reason about) or to spin up separate querychat sessions (losing the ability to ask cross-table questions). This PR removes that limitation.

You can now register multiple tables and the LLM will reason across all of them — joins, cross-table filters, aggregations:

qc = QueryChat(orders_df, "orders")
qc.add_table(customers_df, "customers")

If your data lives in a SQLAlchemy engine (Python) or DBI connection (R), add_tables() bulk-registers all tables — or a named subset — in a single call, and builds the system prompt exactly once rather than once per table:

qc = QueryChat()
qc.add_tables(engine)                          # all tables
qc.add_tables(engine, ["orders", "customers"]) # specific subset

The LLM's SQL queries can reference any registered table. To consume per-table filter state in a Shiny app, call qc.server() and use table() on the returned values — each table gets its own reactive sql/df/title:

# In your Shiny server function:
qc_vals = qc.server()
qc_vals.table("orders").df()
qc_vals.table("customers").sql()

As a side note: qc_vals.current_table() (R: qc_vals\$current_table()) is also available — a reactive that returns the name of the most recently LLM-queried table, or None/NULL before any query. Useful when you want an output to follow whatever table the LLM just acted on without hard-coding a name.

Under the hood, a new QueryExecutor layer dispatches queries to the right backend. DataFrame sources (pandas, polars, pyarrow) get a shared DuckDB connection; Polars LazyFrames get a shared SQLContext; SQLAlchemy/Ibis tables use the existing shared backend directly. All tables in a session must be the same source type.

But multi-table introduces a scaling problem. The old design embedded full column statistics in the system prompt at startup — every conversation paid the cost of computing stats for every column of every table, whether the LLM needed them or not. With multiple tables, that cost multiplies. So schema fetching is now on-demand: the system prompt contains only a lightweight table overview, and the LLM calls the new querychat_get_schema tool to get column details for specific tables when it actually needs them.

And when you have several tables, column names get ambiguous. What does status mean in the orders table vs. the customers table? A new DataDict type — integrating with the https://data-dict.tidyverse.org/ spec — lets you annotate tables and columns with plain-English descriptions, loaded from a YAML file:

QueryChat(data_dict="data_dict.yaml")

The schema tool uses these annotations directly, skipping the live stats query for any column that already has metadata. A fully-documented table incurs zero stats overhead. DataDict also supports relationships (JOIN hints for multi-table) and a glossary (domain term definitions), both surfaced in the system prompt so the LLM understands your domain language.

These three pieces — multi-table, on-demand schema, and data dictionary — aren't independent features. They're designed as layers: multi-table creates the need for smarter schema handling, on-demand fetching solves the cost problem, and DataDict solves the precision problem.

Breaking changes

Before After
qc.data_source qc.table("name").data_source
qc.data_source = new_df qc.add_table(new_df, "name", replace=True)
qc.server(data_source=df) (Shiny) qc.add_table(df, "name") before server()
qc.df() / qc.sql() / qc.title() (multi-table) qc_vals.table("name").df() etc. via server return

Bookmark keys changed: Python uses flat per-table keys (querychat_sql_{name}, querychat_title_{name}); R uses a nested querychat_tables dict. Old single-table flat keys (querychat_sql) are no longer written.

data_description is deprecated in favor of data_dict.


For reviewers

The code mirrors the same layered story. Each layer depends only on the ones below it, so reviewing bottom-up avoids forward references:

Layer 1 — The data contract (_datasource.py). DataSource gained two methods: get_column_metas() (cheap — column names and types via LIMIT 0) and populate_column_stats() (expensive — min/max/categories). This split is the foundation everything else builds on: it's what makes on-demand fetching possible and what lets DataDict selectively skip stats queries. ColumnMeta also gained a description field for DataDict annotations.

Layer 2 — Multi-table dispatch (_query_executor.py, new). A QueryExecutor sits between the tools and the data sources. It exposes the same get_column_metas(table) / populate_column_stats(table) / execute_query() surface but routes to the right backend. Three implementations: DataSourceExecutor for single-table or shared-backend multi-table (SQLAlchemy, Ibis); DuckDBExecutor for multi-table DataFrames (registers all tables in one in-memory DuckDB); and PolarsSQLExecutor for multi-table Polars LazyFrames (shared SQLContext). check_source_compatibility() enforces that all tables in a session share the same source type and backend.

Layer 3 — Column metadata (_data_dict.py, new). A Pydantic model hierarchy: DataDictTableSpecColumnSpec. The key method, DataDict.get_table_schema(), merges static annotations onto live ColumnMeta objects and only sends undocumented columns to populate_column_stats(). This is where the "skip stats for documented columns" optimization lives.

Layer 4 — The schema tool (tools.py, prompts/tool-get-schema.md). querychat_get_schema dispatches to either DataDict.get_table_schema() or executor.get_schema() depending on whether a dict is present. It returns a GetSchemaResult subclass that renders a compact status line in Shiny chat rather than dumping the raw schema.

Layer 5 — Prompt rewiring (_system_prompt.py, prompts/prompt.md). The old {{schema}} Mustache variable is gone. Replaced by {{tables_overview}} (lightweight table list), {{relationships}}, {{glossary}}, and a {{multi_table}} flag that drives conditional instructions requiring table= parameters in tool calls.

Layer 6 — Orchestration (_querychat_base.py). The base class now holds dict[str, DataSource]. add_table() uses a staged-rebuild pattern: it builds the new system prompt and executor against a copy of the sources dict, committing only if both succeed, to avoid partial state on failure. add_tables() (Python: SQLAlchemy engine; R: DBI connection) bulk-stages all tables and calls build_system_prompt() exactly once, avoiding the N-1 spurious rebuilds that calling add_table() in a loop would cause. table(name) returns a config-only TableAccessordata_source and table_name work, but df()/sql()/title() raise with guidance to use the server return value instead.

Layer 7 — Shiny reactive state (_shiny_module.py). ServerValues has per-table state in tables: dict[str, TableState], each with its own reactive sql, title, and df. It also gained a table(name) method that returns a TableAccessor backed by the per-session TableState — this is the correct way to consume per-table reactive results. Also added: current_table(), which tracks the most recently LLM-touched table name across update_dashboard, reset_dashboard, chat_update, and bookmark restore. Single-table apps still get top-level .df, .sql, .title on ServerValues for backward compatibility; multi-table replaces them with _MultiTableBlockedReactive sentinels that raise with migration guidance. Bookmark keys are per-table (Python: flat querychat_sql_{name}; R: nested querychat_tables).

Layer 8 — Public accessors and framework adapters (_table_accessor.py new, _dash.py, _gradio.py, _streamlit.py). TableAccessor is a thin proxy that can be constructed in two modes: config-only (from qc.table(), raises on reactive methods) or state-backed (from qc_vals.table(), fully reactive). StateDictTableAccessor is the Dash/Gradio variant that always raises on reactive methods with instructions to use the state-dict API instead. Framework adapters gained table= parameters on df(), sql(), title().

Test plan

  • make py-check passes (format, types, tests)
  • Multi-table Shiny app: LLM correctly joins/filters across tables
  • Single-table apps still work with no API changes
  • DataDict: YAML loads, enriched descriptions reach the LLM, documented columns skip stats queries
  • Bookmark round-trip: new per-table keys save/restore correctly; old flat keys still load

@cpsievert cpsievert marked this pull request as draft January 15, 2026 21:10
Base automatically changed from feat/py-ibis-source to main January 16, 2026 22:41
@cpsievert cpsievert force-pushed the feat/py-multi-table branch 2 times, most recently from 76e009c to 193c2e1 Compare January 26, 2026 17:39
@cpsievert cpsievert force-pushed the feat/py-multi-table branch from 193c2e1 to 96c1c1d Compare January 26, 2026 18:00
@cpsievert cpsievert force-pushed the feat/py-multi-table branch from 96c1c1d to 4654e57 Compare June 17, 2026 14:22
@cpsievert cpsievert changed the title feat(pkg-py): add multi-table support feat(pkg-py): multi-table support and data dictionary Jun 17, 2026
@cpsievert cpsievert force-pushed the feat/py-multi-table branch from 7f0b8d7 to 3c54a32 Compare June 17, 2026 17:33
@cpsievert cpsievert changed the title feat(pkg-py): multi-table support and data dictionary feat: multi-table support and data dictionary Jun 17, 2026

This comment was marked as resolved.

- format_tool_result falls back to str(value) so querychat_get_schema
  output is visible in console/Streamlit/Gradio/Dash UIs
- cleanup_failed_staged_source now covers PinSource (DuckDB connection)
  in addition to DataFrameSource, preventing a resource leak on failed
  staged add/replace
- RuntimeError message in mod_server updated to reference add_table()
  instead of the removed data_source property
…ndering

Multiple DataDicts can now be passed to QueryChat, each acting as a named
domain namespace. The system prompt renders each dict as a named XML tag
wrapping a YAML body, giving the LLM clear domain boundaries without the
verbosity of raw XML or the ambiguity of a flat table list.

DataDict gains optional name/description fields; from_yaml() derives name
from the file stem when not set explicitly. A UserWarning nudges multi-table
apps that have no dict toward providing one.
querychat.js was sending {query, title} but omitting `table` when the
Apply Filter button was clicked. The Shiny handler silently ignored
every click because update.get("table", "") returned "" (falsy).

Adds a Playwright test that reproduces the failure before the fix.
- Update prompt templates for multi-table support
- Add DataDict S3 constructors with YAML loading
- Add QueryExecutor private R6 classes for multi-table
- Add multi-table support to QueryChat, tools, and module
- Resolve integration issues
DataDict S3 constructors (data_dict, table_spec, column_spec, etc.) are
removed in favour of read_data_dict() returning a plain YAML-parsed list.
This avoids an API commitment to a still-evolving spec and removes ~250
lines of constructor/parser scaffolding.

The get_schema tool now actually uses data_dicts: it finds the first dict
whose tables key covers the requested table and passes that table spec to
get_schema_impl, which bypasses live MIN/MAX queries for columns with a
DataDict range and bypasses SELECT DISTINCT for columns with a DataDict
values list. Description, type, units, and constraints from the dict are
surfaced directly in the schema output.

Prompt instruction for get_schema also strengthened to always call it
before writing SQL rather than treating it as optional.
@cpsievert cpsievert force-pushed the feat/py-multi-table branch from 7bba78a to f03c0cb Compare June 18, 2026 15:31
…ompt

- R add_table() now calls check_source_compatibility() before committing,
  matching Python's validation that prevents mixing incompatible source types
- Python tool_query() now passes multi_table to the prompt template so
  multi-table query guidance renders when multiple tables are registered
…n schema

ColumnMeta now carries units and constraints fields. DataDict.get_table_schema
copies type, units, and constraints from the ColumnSpec onto the ColumnMeta,
and format_schema renders them — matching R's get_schema_impl output.
Replaces the private shinychat:::new_tool_card call with a plain
shiny::tags$p(), matching the Python implementation's ChatMessage approach.
TableAccessor no longer reaches into QueryChat's private fields via
.__enclos_env__. Instead it takes table_name, data_source, and an
optional state list. Config-only accessors (from qc$table()) error
on reactive methods with guidance to use the server return value.

This comment was marked as resolved.

cpsievert and others added 8 commits June 19, 2026 14:17
The schema popover UI previously re-parsed the LLM-facing text
representation of schema info to build its display. This required a
fragile regex that broke on types with nested parens (e.g. number(id)).

Now the server serializes structured column metadata (ColumnMeta in
Python, named lists in R) directly to JSON and attaches it to the
sentinel element as data-schema-json. The JS reads JSON and never
touches the text. The text parser and its regex are gone.

Also fixes XML attribute escaping for data-dict name/description fields,
and tightens the RuntimeError message in _shiny_module.py.
The schema tool executes quickly, so the <shiny-tool-request> element
flashed briefly before disappearing. Other tools hide the request via
<shiny-tool-result>'s useEffect, which runs in the same React render
cycle as the result appearing. The schema tool bypasses ToolResultComponent
entirely, so it relied on a separate hide_tool_request server action —
which could arrive in a different render cycle and cause a visible flash.

Fix: intercept the ContentToolRequest dispatch for querychat_get_schema
in both Python and R and return empty/null content so the request element
never renders in the first place.
@cpsievert cpsievert marked this pull request as ready for review June 19, 2026 20:55
@cpsievert cpsievert requested a review from Copilot June 19, 2026 20:55

This comment was marked as resolved.

cpsievert and others added 2 commits June 19, 2026 16:23
- Escape HTML attributes (query, title, table) in Apply/Reset Filter buttons to prevent attribute injection
- Pass multi_table flag to tool_update_dashboard() and tool_visualize() so prompt templates render multi-table guidance
- Normalize empty strings to NULL/None in chat_update observers so Reset Filter button correctly clears state
- Remove @export/@Keywords internal from read_data_dict (unexported internal)
@cpsievert cpsievert requested a review from gadenbuie June 22, 2026 16:19
Conflicts resolved:
- NEWS.md / CHANGELOG.md: kept both multi-table entries (branch) and PinSource entry (main)
- DataFrameSource.R: used new_dataframe_connection() helper from main, kept DuckDB cleanup method from branch
- QueryChat.R: added setBookmarkExclude(c("close_btn", "reset_query")) from main
- querychat_module.R: kept restore_ui=FALSE from branch, added setBookmarkExclude("chat_update") from main
- _shiny.py: kept _mark_server_initialized() from branch, added bookmark exclude for reset_query from main
gadenbuie

This comment was marked as resolved.

Replace the mixin-based approach (which required ~10 type: ignore annotations
for attributes the mixin couldn't own) with a proper subclass of QueryChatBase.
Dash and Gradio QueryChat classes now use single inheritance from
StateDictQueryChat instead of dual inheritance.
…rnings

Instead of raising AttributeError, calling .df()/.sql()/.title() without
a table name now warns once and delegates to the primary table. This gives
users a migration path rather than an immediate hard break.

Also fixes the stacklevel so warnings point at user call sites, removes
the spurious warning on .set(), and replaces the _df_warned list trick
with nonlocal.
Extract the shared require_all_columns check from DuckDBExecutor and
PolarsSQLExecutor into a single static helper on the abstract base class.
Adds QueryChat$add_tables() (R) and QueryChatBase.add_tables() (Python)
to register multiple tables from a single connection/engine in one call.
Compared to calling add_table() in a loop, this builds the system prompt
exactly once after all tables are staged, avoiding N-1 intermediate
rebuilds. Includes validation, replace support, cleanup of displaced
sources, and tests.
Prompt caching established over prior turns is silently invalidated
when add_table() rebuilds the system prompt mid-conversation.

@cpsievert cpsievert left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed review. I've addressed the near-term items in sections 2 and 3; thoughts on section 1 below.

Sections 3A–3D

All four simplification items landed:

  • 905a8db — replaces StateDictAccessorMixin with StateDictQueryChat, eliminating all 15 type: ignore[attr-defined] comments
  • 53a7788 — softens the flat-accessor error to FutureWarning with primary-table fallback, with session-scoped deduplication so Shiny reactives don't spam
  • f36db45 — extracts _validate_missing_columns to the QueryExecutor base, removing the ~19-line duplication
  • 47d8fae — warns when add_table rebuilds the system prompt after a client already has chat history, covering both the _client_spec and _client_console paths

Section 2 — Bulk table registration

99e011d adds add_tables() for bulk registration in both R and Python, which handles the many-add_table-calls problem. Auto-discovery on construction is a natural next step I'd consider separately.

Section 1 — Coalescing source types into shared DuckDB

I've filed #256 to track loosening the class-identity check so PinSource and DataFrameSource can share a DuckDBExecutor — both already materialize an eager frame, so this is largely a routing change.

For steps 2 and 3 I'm more cautious:

Snapshot mode is hard to recommend as a general solution. Materializing a full remote table into local DuckDB memory at startup works for small tables and silently breaks for production ones, with no reliable way to gate on table size upfront.

Live attach is more interesting but comes with real costs: DuckDB's SQL dialect diverges from Postgres/MySQL enough to introduce compatibility risk (the LLM is already told what flavor to use, and "DuckDB mediating Postgres" isn't the same as talking to Postgres directly); ATTACH scanners only cover Postgres, MySQL, and SQLite, so Snowflake/Databricks users still fall back to snapshots; per-dialect connection-string translation from SQLAlchemy URLs is non-trivial glue; and relaxing enable_external_access is a meaningful security posture change that warrants its own rollout.

I'd keep the register_into contract extensible so steps 2/3 remain additive, but I'd want concrete user demand before pursuing remote-source coalescing.

cpsievert and others added 4 commits June 22, 2026 16:02
…able

Combined multi-table per-table reactive state (this branch) with the new
greeting API and bookmark persistence (main). Resolutions:

- _shiny_module.py / querychat_module.R: replace has_greeted with
  current_greeting; keep per-table table_states; bookmark saves per-table
  SQL/title + greeting; restore restores both + calls set_greeting()
- _shiny.py: pass greeting=self.greeting to mod_ui() + keep _ensure_server_started()
- NEWS.md / CHANGELOG.md: include both improvement entries
Extract _df_for_source and _title_for_table helpers to eliminate
duplicated logic in df() and title(). Combine nested with-statements
in tests to fix SIM117.
@cpsievert

Copy link
Copy Markdown
Contributor Author

@gadenbuie thanks for the review, ok to merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants