Skip to content

Handle SERIAL autoincrement columns when targeting MS-SQL #97

@myyong

Description

@myyong

Summary

PostgreSQL uses SERIAL (and its variants BIGSERIAL, SMALLSERIAL) as shorthand for autoincrement integer columns. When datafaker reflects a PostgreSQL source schema, these columns appear as SERIAL in the generated DDL. The existing code already works around this for DuckDB via a @compiles hook, but no equivalent exists for MS-SQL, which uses IDENTITY(1,1) instead.

Problem

In datafaker/create.py there are two dialect-specific DDL hooks registered with SQLAlchemy's @compiles decorator:

@compiles(CreateColumn, "duckdb")
def remove_serial(element: CreateColumn, compiler: Any, **kw: Any) -> str:
    text: str = compiler.visit_create_column(element, **kw)
    return text.replace(" SERIAL ", " INTEGER ")

@compiles(CreateTable, "duckdb")
def remove_on_delete_cascade(element: CreateTable, compiler: Any, **kw: Any) -> str:
    text: str = compiler.visit_create_table(element, **kw)
    return text.replace(" ON DELETE CASCADE", "")

The remove_serial hook replaces SERIAL with INTEGER for DuckDB. MS-SQL does not understand SERIAL either, so create-tables against an MS-SQL destination will raise a syntax error whenever the source schema has autoincrement columns.

Additionally, ON DELETE CASCADE is stripped for DuckDB but not for MS-SQL, where it is supported — so that hook does not need an MS-SQL equivalent. However, MS-SQL has its own DDL quirks that may need similar treatment (see proposed steps below).

How SERIAL arises

When datafaker reflects a PostgreSQL source, SQLAlchemy maps SERIAL columns to INTEGER columns with a Sequence or autoincrement=True. When the ORM YAML is then used to create-tables on a destination, SQLAlchemy re-generates DDL. For a PostgreSQL destination this renders as SERIAL; for MS-SQL the correct rendering is INTEGER IDENTITY(1,1).

In practice, SQLAlchemy already handles this correctly when the column is defined with autoincrement=True and the engine is an MS-SQL engine — it will emit IDENTITY(1,1) automatically. The problem is the @compiles(CreateColumn, "duckdb") hook, which replaces the compiled text at the string level. A naive port of the same approach to "mssql" is unnecessary and could interfere with SQLAlchemy's correct MS-SQL DDL generation.

The real risk is the reverse: if a source schema is reflected with SERIAL stored literally as the column type string in orm.yaml (rather than being resolved to INTEGER + autoincrement), parsing and re-emitting that string on an MS-SQL engine will produce invalid DDL.

Proposed steps

1. Confirm how SERIAL is stored in orm.yaml

Run make-tables against a PostgreSQL source that contains SERIAL columns and inspect the resulting orm.yaml. Determine whether:

  • (a) The type string is INTEGER (autoincrement handled separately), or
  • (b) The type string is literally SERIAL / BIGSERIAL / SMALLSERIAL.

If (a), SQLAlchemy will generate IDENTITY(1,1) for MS-SQL automatically and no further action is needed. If (b), proceed with steps 2–4.

2. Add SERIAL type strings to the type parser

If orm.yaml contains SERIAL, BIGSERIAL, or SMALLSERIAL, add parsers for them in datafaker/serialize_metadata.py that map to the appropriate generic integer types with autoincrement=True:

# SERIAL variants → generic integers; autoincrement is set via column flag
parsy.string("BIGSERIAL").result(sqltypes.BigInteger),   # must precede SERIAL
parsy.string("SMALLSERIAL").result(sqltypes.SmallInteger),
parsy.string("SERIAL").result(sqltypes.Integer),

Note: longer names (BIGSERIAL, SMALLSERIAL) must appear before SERIAL in the parsy.alt() list to avoid the prefix-collision issue described in #96.

3. Add a @compiles hook for MS-SQL if needed

If SQLAlchemy does not automatically strip SERIAL when targeting MS-SQL (i.e. it passes the token through unchanged), add a @compiles(CreateColumn, "mssql") hook in datafaker/create.py modelled on the existing DuckDB one:

@compiles(CreateColumn, "mssql")
def remove_serial_mssql(element: CreateColumn, compiler: Any, **kw: Any) -> str:
    text: str = compiler.visit_create_column(element, **kw)
    return text.replace(" SERIAL ", " INTEGER ")

This is only a safety net; the preferred fix is step 2 (parser-level mapping).

4. Add tests

  • A unit test confirming that SERIAL, BIGSERIAL, and SMALLSERIAL type strings parse to the correct generic integer types.
  • An integration test (or a DDL-compilation test using a mock MS-SQL engine) confirming that create_all() on an MS-SQL engine does not emit SERIAL in the generated DDL.

Acceptance criteria

  • create-tables against an MS-SQL destination succeeds when the orm.yaml was produced from a PostgreSQL source containing SERIAL / BIGSERIAL / SMALLSERIAL columns.
  • The generated MS-SQL DDL uses INTEGER IDENTITY(1,1) (or equivalent) rather than SERIAL.
  • Existing PostgreSQL and DuckDB behaviour is unchanged.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions