Summary
When a schema name is configured, datafaker sets the active schema by issuing a raw SET search_path TO <schema> SQL command on every new connection. This is PostgreSQL-specific syntax that does not work on MS-SQL (or most other databases).
Problem
There are two related code sites in datafaker/utils.py:
1. create_db_engine builds a settings dict keyed on search_path (line ~263):
if schema_name is not None:
settings["search_path"] = schema_name
2. set_db_settings issues a raw SET … TO … command (line ~338):
def set_db_settings(connection, settings):
existing_autocommit = connection.autocommit
connection.autocommit = True
cursor = connection.cursor()
sql = "".join(f"SET {k} TO {v};" for k, v in settings.items())
cursor.execute(sql)
cursor.close()
connection.autocommit = existing_autocommit
SET search_path TO <schema> is understood only by PostgreSQL. MS-SQL has no equivalent session-level command — schemas are addressed via two-part [schema].[table] naming. Running this on an MS-SQL connection raises a ProgrammingError before any real work can begin.
Note: the file_search_path key (used for DuckDB parquet paths) is also passed through set_db_settings, so any fix must preserve that path for DuckDB while only changing how the schema is communicated to the underlying engine.
Proposed steps
-
Pass schema to MetaData and Table objects rather than setting it at the session level. SQLAlchemy's cross-dialect way to scope all queries to a particular schema is to pass schema=schema_name when constructing MetaData:
MetaData(schema=schema_name)
and when reflecting:
md.reflect(engine, schema=schema_name)
This generates fully-qualified [schema].[table] references in SQL for MS-SQL, and schema.table for PostgreSQL, without any session-level SET command.
-
Remove search_path from the settings dict in create_db_engine. The set_db_settings / connection-event approach should be retained only for settings that have no SQLAlchemy equivalent (currently just file_search_path for DuckDB):
# Only use the connect-event hook for non-schema settings (e.g. DuckDB file_search_path)
if parquet_dir is not None:
settings["file_search_path"] = ...
if settings:
@event.listens_for(engine, "connect", insert=True)
def connect(dbapi_connection, _):
set_db_settings(dbapi_connection, settings)
-
Thread schema_name through to MetaData construction and reflection in utils.get_metadata, make.make_tables_file, create.create_db_tables_into, and anywhere else that currently relies on search_path being set at session level.
-
Update set_db_settings docstring to remove the PostgreSQL-specific description now that it only serves DuckDB's file_search_path.
-
Add tests that confirm schema-qualified queries are generated correctly for both a PostgreSQL DSN and a (mocked) MS-SQL DSN, without a SET search_path command being issued.
Acceptance criteria
- Passing
SRC_SCHEMA / DST_SCHEMA with an MS-SQL connection does not raise a ProgrammingError.
- Schema-qualified table names (e.g.
[dbo].[person]) appear in SQL emitted for MS-SQL.
SET search_path is no longer issued on any connection.
- PostgreSQL and DuckDB behaviour is unchanged.
Related
Summary
When a schema name is configured, datafaker sets the active schema by issuing a raw
SET search_path TO <schema>SQL command on every new connection. This is PostgreSQL-specific syntax that does not work on MS-SQL (or most other databases).Problem
There are two related code sites in
datafaker/utils.py:1.
create_db_enginebuilds asettingsdict keyed onsearch_path(line ~263):2.
set_db_settingsissues a rawSET … TO …command (line ~338):SET search_path TO <schema>is understood only by PostgreSQL. MS-SQL has no equivalent session-level command — schemas are addressed via two-part[schema].[table]naming. Running this on an MS-SQL connection raises aProgrammingErrorbefore any real work can begin.Note: the
file_search_pathkey (used for DuckDB parquet paths) is also passed throughset_db_settings, so any fix must preserve that path for DuckDB while only changing how the schema is communicated to the underlying engine.Proposed steps
Pass
schematoMetaDataandTableobjects rather than setting it at the session level. SQLAlchemy's cross-dialect way to scope all queries to a particular schema is to passschema=schema_namewhen constructingMetaData:and when reflecting:
This generates fully-qualified
[schema].[table]references in SQL for MS-SQL, andschema.tablefor PostgreSQL, without any session-levelSETcommand.Remove
search_pathfrom thesettingsdict increate_db_engine. Theset_db_settings/ connection-event approach should be retained only for settings that have no SQLAlchemy equivalent (currently justfile_search_pathfor DuckDB):Thread
schema_namethrough toMetaDataconstruction and reflection inutils.get_metadata,make.make_tables_file,create.create_db_tables_into, and anywhere else that currently relies onsearch_pathbeing set at session level.Update
set_db_settingsdocstring to remove the PostgreSQL-specific description now that it only serves DuckDB'sfile_search_path.Add tests that confirm schema-qualified queries are generated correctly for both a PostgreSQL DSN and a (mocked) MS-SQL DSN, without a
SET search_pathcommand being issued.Acceptance criteria
SRC_SCHEMA/DST_SCHEMAwith an MS-SQL connection does not raise aProgrammingError.[dbo].[person]) appear in SQL emitted for MS-SQL.SET search_pathis no longer issued on any connection.Related