feat: multi-table support and data dictionary#195
Conversation
76e009c to
193c2e1
Compare
193c2e1 to
96c1c1d
Compare
96c1c1d to
4654e57
Compare
7f0b8d7 to
3c54a32
Compare
9d5151e to
89361d8
Compare
- format_tool_result falls back to str(value) so querychat_get_schema output is visible in console/Streamlit/Gradio/Dash UIs - cleanup_failed_staged_source now covers PinSource (DuckDB connection) in addition to DataFrameSource, preventing a resource leak on failed staged add/replace - RuntimeError message in mod_server updated to reference add_table() instead of the removed data_source property
…ndering Multiple DataDicts can now be passed to QueryChat, each acting as a named domain namespace. The system prompt renders each dict as a named XML tag wrapping a YAML body, giving the LLM clear domain boundaries without the verbosity of raw XML or the ambiguity of a flat table list. DataDict gains optional name/description fields; from_yaml() derives name from the file stem when not set explicitly. A UserWarning nudges multi-table apps that have no dict toward providing one.
querychat.js was sending {query, title} but omitting `table` when the
Apply Filter button was clicked. The Shiny handler silently ignored
every click because update.get("table", "") returned "" (falsy).
Adds a Playwright test that reproduces the failure before the fix.
- Update prompt templates for multi-table support - Add DataDict S3 constructors with YAML loading - Add QueryExecutor private R6 classes for multi-table - Add multi-table support to QueryChat, tools, and module - Resolve integration issues
DataDict S3 constructors (data_dict, table_spec, column_spec, etc.) are removed in favour of read_data_dict() returning a plain YAML-parsed list. This avoids an API commitment to a still-evolving spec and removes ~250 lines of constructor/parser scaffolding. The get_schema tool now actually uses data_dicts: it finds the first dict whose tables key covers the requested table and passes that table spec to get_schema_impl, which bypasses live MIN/MAX queries for columns with a DataDict range and bypasses SELECT DISTINCT for columns with a DataDict values list. Description, type, units, and constraints from the dict are surfaced directly in the schema output. Prompt instruction for get_schema also strengthened to always call it before writing SQL rather than treating it as optional.
7bba78a to
f03c0cb
Compare
…ompt - R add_table() now calls check_source_compatibility() before committing, matching Python's validation that prevents mixing incompatible source types - Python tool_query() now passes multi_table to the prompt template so multi-table query guidance renders when multiple tables are registered
…n schema ColumnMeta now carries units and constraints fields. DataDict.get_table_schema copies type, units, and constraints from the ColumnSpec onto the ColumnMeta, and format_schema renders them — matching R's get_schema_impl output.
Replaces the private shinychat:::new_tool_card call with a plain shiny::tags$p(), matching the Python implementation's ChatMessage approach.
TableAccessor no longer reaches into QueryChat's private fields via .__enclos_env__. Instead it takes table_name, data_source, and an optional state list. Config-only accessors (from qc$table()) error on reactive methods with guidance to use the server return value.
The schema popover UI previously re-parsed the LLM-facing text representation of schema info to build its display. This required a fragile regex that broke on types with nested parens (e.g. number(id)). Now the server serializes structured column metadata (ColumnMeta in Python, named lists in R) directly to JSON and attaches it to the sentinel element as data-schema-json. The JS reads JSON and never touches the text. The text parser and its regex are gone. Also fixes XML attribute escaping for data-dict name/description fields, and tightens the RuntimeError message in _shiny_module.py.
The schema tool executes quickly, so the <shiny-tool-request> element flashed briefly before disappearing. Other tools hide the request via <shiny-tool-result>'s useEffect, which runs in the same React render cycle as the result appearing. The schema tool bypasses ToolResultComponent entirely, so it relied on a separate hide_tool_request server action — which could arrive in a different render cycle and cause a visible flash. Fix: intercept the ContentToolRequest dispatch for querychat_get_schema in both Python and R and return empty/null content so the request element never renders in the first place.
- Escape HTML attributes (query, title, table) in Apply/Reset Filter buttons to prevent attribute injection - Pass multi_table flag to tool_update_dashboard() and tool_visualize() so prompt templates render multi-table guidance - Normalize empty strings to NULL/None in chat_update observers so Reset Filter button correctly clears state - Remove @export/@Keywords internal from read_data_dict (unexported internal)
Conflicts resolved:
- NEWS.md / CHANGELOG.md: kept both multi-table entries (branch) and PinSource entry (main)
- DataFrameSource.R: used new_dataframe_connection() helper from main, kept DuckDB cleanup method from branch
- QueryChat.R: added setBookmarkExclude(c("close_btn", "reset_query")) from main
- querychat_module.R: kept restore_ui=FALSE from branch, added setBookmarkExclude("chat_update") from main
- _shiny.py: kept _mark_server_initialized() from branch, added bookmark exclude for reset_query from main
Replace the mixin-based approach (which required ~10 type: ignore annotations for attributes the mixin couldn't own) with a proper subclass of QueryChatBase. Dash and Gradio QueryChat classes now use single inheritance from StateDictQueryChat instead of dual inheritance.
…rnings Instead of raising AttributeError, calling .df()/.sql()/.title() without a table name now warns once and delegates to the primary table. This gives users a migration path rather than an immediate hard break. Also fixes the stacklevel so warnings point at user call sites, removes the spurious warning on .set(), and replaces the _df_warned list trick with nonlocal.
Extract the shared require_all_columns check from DuckDBExecutor and PolarsSQLExecutor into a single static helper on the abstract base class.
Adds QueryChat$add_tables() (R) and QueryChatBase.add_tables() (Python) to register multiple tables from a single connection/engine in one call. Compared to calling add_table() in a loop, this builds the system prompt exactly once after all tables are staged, avoiding N-1 intermediate rebuilds. Includes validation, replace support, cleanup of displaced sources, and tests.
Prompt caching established over prior turns is silently invalidated when add_table() rebuilds the system prompt mid-conversation.
cpsievert
left a comment
There was a problem hiding this comment.
Thanks for the detailed review. I've addressed the near-term items in sections 2 and 3; thoughts on section 1 below.
Sections 3A–3D
All four simplification items landed:
- 905a8db — replaces
StateDictAccessorMixinwithStateDictQueryChat, eliminating all 15type: ignore[attr-defined]comments - 53a7788 — softens the flat-accessor error to
FutureWarningwith primary-table fallback, with session-scoped deduplication so Shiny reactives don't spam - f36db45 — extracts
_validate_missing_columnsto theQueryExecutorbase, removing the ~19-line duplication - 47d8fae — warns when
add_tablerebuilds the system prompt after a client already has chat history, covering both the_client_specand_client_consolepaths
Section 2 — Bulk table registration
99e011d adds add_tables() for bulk registration in both R and Python, which handles the many-add_table-calls problem. Auto-discovery on construction is a natural next step I'd consider separately.
Section 1 — Coalescing source types into shared DuckDB
I've filed #256 to track loosening the class-identity check so PinSource and DataFrameSource can share a DuckDBExecutor — both already materialize an eager frame, so this is largely a routing change.
For steps 2 and 3 I'm more cautious:
Snapshot mode is hard to recommend as a general solution. Materializing a full remote table into local DuckDB memory at startup works for small tables and silently breaks for production ones, with no reliable way to gate on table size upfront.
Live attach is more interesting but comes with real costs: DuckDB's SQL dialect diverges from Postgres/MySQL enough to introduce compatibility risk (the LLM is already told what flavor to use, and "DuckDB mediating Postgres" isn't the same as talking to Postgres directly); ATTACH scanners only cover Postgres, MySQL, and SQLite, so Snowflake/Databricks users still fall back to snapshots; per-dialect connection-string translation from SQLAlchemy URLs is non-trivial glue; and relaxing enable_external_access is a meaningful security posture change that warrants its own rollout.
I'd keep the register_into contract extensible so steps 2/3 remain additive, but I'd want concrete user demand before pursuing remote-source coalescing.
…able Combined multi-table per-table reactive state (this branch) with the new greeting API and bookmark persistence (main). Resolutions: - _shiny_module.py / querychat_module.R: replace has_greeted with current_greeting; keep per-table table_states; bookmark saves per-table SQL/title + greeting; restore restores both + calls set_greeting() - _shiny.py: pass greeting=self.greeting to mod_ui() + keep _ensure_server_started() - NEWS.md / CHANGELOG.md: include both improvement entries
Extract _df_for_source and _title_for_table helpers to eliminate duplicated logic in df() and title(). Combine nested with-statements in tests to fix SIM117.
|
@gadenbuie thanks for the review, ok to merge? |
For users
querychat has been single-table: one data frame, one schema, one conversation. If you had orders and customers and products, your options were to pre-join everything into one wide frame (losing the relational structure the LLM could reason about) or to spin up separate querychat sessions (losing the ability to ask cross-table questions). This PR removes that limitation.
You can now register multiple tables and the LLM will reason across all of them — joins, cross-table filters, aggregations:
If your data lives in a SQLAlchemy engine (Python) or DBI connection (R),
add_tables()bulk-registers all tables — or a named subset — in a single call, and builds the system prompt exactly once rather than once per table:The LLM's SQL queries can reference any registered table. To consume per-table filter state in a Shiny app, call
qc.server()and usetable()on the returned values — each table gets its own reactive sql/df/title:As a side note:
qc_vals.current_table()(R:qc_vals\$current_table()) is also available — a reactive that returns the name of the most recently LLM-queried table, orNone/NULLbefore any query. Useful when you want an output to follow whatever table the LLM just acted on without hard-coding a name.Under the hood, a new
QueryExecutorlayer dispatches queries to the right backend. DataFrame sources (pandas, polars, pyarrow) get a shared DuckDB connection; Polars LazyFrames get a sharedSQLContext; SQLAlchemy/Ibis tables use the existing shared backend directly. All tables in a session must be the same source type.But multi-table introduces a scaling problem. The old design embedded full column statistics in the system prompt at startup — every conversation paid the cost of computing stats for every column of every table, whether the LLM needed them or not. With multiple tables, that cost multiplies. So schema fetching is now on-demand: the system prompt contains only a lightweight table overview, and the LLM calls the new
querychat_get_schematool to get column details for specific tables when it actually needs them.And when you have several tables, column names get ambiguous. What does
statusmean in the orders table vs. the customers table? A newDataDicttype — integrating with the https://data-dict.tidyverse.org/ spec — lets you annotate tables and columns with plain-English descriptions, loaded from a YAML file:The schema tool uses these annotations directly, skipping the live stats query for any column that already has metadata. A fully-documented table incurs zero stats overhead.
DataDictalso supportsrelationships(JOIN hints for multi-table) and aglossary(domain term definitions), both surfaced in the system prompt so the LLM understands your domain language.These three pieces — multi-table, on-demand schema, and data dictionary — aren't independent features. They're designed as layers: multi-table creates the need for smarter schema handling, on-demand fetching solves the cost problem, and
DataDictsolves the precision problem.Breaking changes
qc.data_sourceqc.table("name").data_sourceqc.data_source = new_dfqc.add_table(new_df, "name", replace=True)qc.server(data_source=df)(Shiny)qc.add_table(df, "name")beforeserver()qc.df()/qc.sql()/qc.title()(multi-table)qc_vals.table("name").df()etc. via server returnBookmark keys changed: Python uses flat per-table keys (
querychat_sql_{name},querychat_title_{name}); R uses a nestedquerychat_tablesdict. Old single-table flat keys (querychat_sql) are no longer written.data_descriptionis deprecated in favor ofdata_dict.For reviewers
The code mirrors the same layered story. Each layer depends only on the ones below it, so reviewing bottom-up avoids forward references:
Layer 1 — The data contract (
_datasource.py).DataSourcegained two methods:get_column_metas()(cheap — column names and types viaLIMIT 0) andpopulate_column_stats()(expensive — min/max/categories). This split is the foundation everything else builds on: it's what makes on-demand fetching possible and what letsDataDictselectively skip stats queries.ColumnMetaalso gained adescriptionfield forDataDictannotations.Layer 2 — Multi-table dispatch (
_query_executor.py, new). AQueryExecutorsits between the tools and the data sources. It exposes the sameget_column_metas(table)/populate_column_stats(table)/execute_query()surface but routes to the right backend. Three implementations:DataSourceExecutorfor single-table or shared-backend multi-table (SQLAlchemy, Ibis);DuckDBExecutorfor multi-table DataFrames (registers all tables in one in-memory DuckDB); andPolarsSQLExecutorfor multi-table Polars LazyFrames (sharedSQLContext).check_source_compatibility()enforces that all tables in a session share the same source type and backend.Layer 3 — Column metadata (
_data_dict.py, new). A Pydantic model hierarchy:DataDict→TableSpec→ColumnSpec. The key method,DataDict.get_table_schema(), merges static annotations onto liveColumnMetaobjects and only sends undocumented columns topopulate_column_stats(). This is where the "skip stats for documented columns" optimization lives.Layer 4 — The schema tool (
tools.py,prompts/tool-get-schema.md).querychat_get_schemadispatches to eitherDataDict.get_table_schema()orexecutor.get_schema()depending on whether a dict is present. It returns aGetSchemaResultsubclass that renders a compact status line in Shiny chat rather than dumping the raw schema.Layer 5 — Prompt rewiring (
_system_prompt.py,prompts/prompt.md). The old{{schema}}Mustache variable is gone. Replaced by{{tables_overview}}(lightweight table list),{{relationships}},{{glossary}}, and a{{multi_table}}flag that drives conditional instructions requiringtable=parameters in tool calls.Layer 6 — Orchestration (
_querychat_base.py). The base class now holdsdict[str, DataSource].add_table()uses a staged-rebuild pattern: it builds the new system prompt and executor against a copy of the sources dict, committing only if both succeed, to avoid partial state on failure.add_tables()(Python: SQLAlchemy engine; R: DBI connection) bulk-stages all tables and callsbuild_system_prompt()exactly once, avoiding the N-1 spurious rebuilds that callingadd_table()in a loop would cause.table(name)returns a config-onlyTableAccessor—data_sourceandtable_namework, butdf()/sql()/title()raise with guidance to use the server return value instead.Layer 7 — Shiny reactive state (
_shiny_module.py).ServerValueshas per-table state intables: dict[str, TableState], each with its own reactivesql,title, anddf. It also gained atable(name)method that returns aTableAccessorbacked by the per-sessionTableState— this is the correct way to consume per-table reactive results. Also added:current_table(), which tracks the most recently LLM-touched table name acrossupdate_dashboard,reset_dashboard,chat_update, and bookmark restore. Single-table apps still get top-level.df,.sql,.titleonServerValuesfor backward compatibility; multi-table replaces them with_MultiTableBlockedReactivesentinels that raise with migration guidance. Bookmark keys are per-table (Python: flatquerychat_sql_{name}; R: nestedquerychat_tables).Layer 8 — Public accessors and framework adapters (
_table_accessor.pynew,_dash.py,_gradio.py,_streamlit.py).TableAccessoris a thin proxy that can be constructed in two modes: config-only (fromqc.table(), raises on reactive methods) or state-backed (fromqc_vals.table(), fully reactive).StateDictTableAccessoris the Dash/Gradio variant that always raises on reactive methods with instructions to use the state-dict API instead. Framework adapters gainedtable=parameters ondf(),sql(),title().Test plan
make py-checkpasses (format, types, tests)