Skip to content

Replace PostgreSQL search_path schema selection with dialect-neutral approach #95

@myyong

Description

@myyong

Summary

When a schema name is configured, datafaker sets the active schema by issuing a raw SET search_path TO <schema> SQL command on every new connection. This is PostgreSQL-specific syntax that does not work on MS-SQL (or most other databases).

Problem

There are two related code sites in datafaker/utils.py:

1. create_db_engine builds a settings dict keyed on search_path (line ~263):

if schema_name is not None:
    settings["search_path"] = schema_name

2. set_db_settings issues a raw SET … TO … command (line ~338):

def set_db_settings(connection, settings):
    existing_autocommit = connection.autocommit
    connection.autocommit = True
    cursor = connection.cursor()
    sql = "".join(f"SET {k} TO {v};" for k, v in settings.items())
    cursor.execute(sql)
    cursor.close()
    connection.autocommit = existing_autocommit

SET search_path TO <schema> is understood only by PostgreSQL. MS-SQL has no equivalent session-level command — schemas are addressed via two-part [schema].[table] naming. Running this on an MS-SQL connection raises a ProgrammingError before any real work can begin.

Note: the file_search_path key (used for DuckDB parquet paths) is also passed through set_db_settings, so any fix must preserve that path for DuckDB while only changing how the schema is communicated to the underlying engine.

Proposed steps

  1. Pass schema to MetaData and Table objects rather than setting it at the session level. SQLAlchemy's cross-dialect way to scope all queries to a particular schema is to pass schema=schema_name when constructing MetaData:

    MetaData(schema=schema_name)

    and when reflecting:

    md.reflect(engine, schema=schema_name)

    This generates fully-qualified [schema].[table] references in SQL for MS-SQL, and schema.table for PostgreSQL, without any session-level SET command.

  2. Remove search_path from the settings dict in create_db_engine. The set_db_settings / connection-event approach should be retained only for settings that have no SQLAlchemy equivalent (currently just file_search_path for DuckDB):

    # Only use the connect-event hook for non-schema settings (e.g. DuckDB file_search_path)
    if parquet_dir is not None:
        settings["file_search_path"] = ...
    
    if settings:
        @event.listens_for(engine, "connect", insert=True)
        def connect(dbapi_connection, _):
            set_db_settings(dbapi_connection, settings)
  3. Thread schema_name through to MetaData construction and reflection in utils.get_metadata, make.make_tables_file, create.create_db_tables_into, and anywhere else that currently relies on search_path being set at session level.

  4. Update set_db_settings docstring to remove the PostgreSQL-specific description now that it only serves DuckDB's file_search_path.

  5. Add tests that confirm schema-qualified queries are generated correctly for both a PostgreSQL DSN and a (mocked) MS-SQL DSN, without a SET search_path command being issued.

Acceptance criteria

  • Passing SRC_SCHEMA / DST_SCHEMA with an MS-SQL connection does not raise a ProgrammingError.
  • Schema-qualified table names (e.g. [dbo].[person]) appear in SQL emitted for MS-SQL.
  • SET search_path is no longer issued on any connection.
  • PostgreSQL and DuckDB behaviour is unchanged.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions