Summary
PostgreSQL uses SERIAL (and its variants BIGSERIAL, SMALLSERIAL) as shorthand for autoincrement integer columns. When datafaker reflects a PostgreSQL source schema, these columns appear as SERIAL in the generated DDL. The existing code already works around this for DuckDB via a @compiles hook, but no equivalent exists for MS-SQL, which uses IDENTITY(1,1) instead.
Problem
In datafaker/create.py there are two dialect-specific DDL hooks registered with SQLAlchemy's @compiles decorator:
@compiles(CreateColumn, "duckdb")
def remove_serial(element: CreateColumn, compiler: Any, **kw: Any) -> str:
text: str = compiler.visit_create_column(element, **kw)
return text.replace(" SERIAL ", " INTEGER ")
@compiles(CreateTable, "duckdb")
def remove_on_delete_cascade(element: CreateTable, compiler: Any, **kw: Any) -> str:
text: str = compiler.visit_create_table(element, **kw)
return text.replace(" ON DELETE CASCADE", "")
The remove_serial hook replaces SERIAL with INTEGER for DuckDB. MS-SQL does not understand SERIAL either, so create-tables against an MS-SQL destination will raise a syntax error whenever the source schema has autoincrement columns.
Additionally, ON DELETE CASCADE is stripped for DuckDB but not for MS-SQL, where it is supported — so that hook does not need an MS-SQL equivalent. However, MS-SQL has its own DDL quirks that may need similar treatment (see proposed steps below).
How SERIAL arises
When datafaker reflects a PostgreSQL source, SQLAlchemy maps SERIAL columns to INTEGER columns with a Sequence or autoincrement=True. When the ORM YAML is then used to create-tables on a destination, SQLAlchemy re-generates DDL. For a PostgreSQL destination this renders as SERIAL; for MS-SQL the correct rendering is INTEGER IDENTITY(1,1).
In practice, SQLAlchemy already handles this correctly when the column is defined with autoincrement=True and the engine is an MS-SQL engine — it will emit IDENTITY(1,1) automatically. The problem is the @compiles(CreateColumn, "duckdb") hook, which replaces the compiled text at the string level. A naive port of the same approach to "mssql" is unnecessary and could interfere with SQLAlchemy's correct MS-SQL DDL generation.
The real risk is the reverse: if a source schema is reflected with SERIAL stored literally as the column type string in orm.yaml (rather than being resolved to INTEGER + autoincrement), parsing and re-emitting that string on an MS-SQL engine will produce invalid DDL.
Proposed steps
1. Confirm how SERIAL is stored in orm.yaml
Run make-tables against a PostgreSQL source that contains SERIAL columns and inspect the resulting orm.yaml. Determine whether:
- (a) The type string is
INTEGER (autoincrement handled separately), or
- (b) The type string is literally
SERIAL / BIGSERIAL / SMALLSERIAL.
If (a), SQLAlchemy will generate IDENTITY(1,1) for MS-SQL automatically and no further action is needed. If (b), proceed with steps 2–4.
2. Add SERIAL type strings to the type parser
If orm.yaml contains SERIAL, BIGSERIAL, or SMALLSERIAL, add parsers for them in datafaker/serialize_metadata.py that map to the appropriate generic integer types with autoincrement=True:
# SERIAL variants → generic integers; autoincrement is set via column flag
parsy.string("BIGSERIAL").result(sqltypes.BigInteger), # must precede SERIAL
parsy.string("SMALLSERIAL").result(sqltypes.SmallInteger),
parsy.string("SERIAL").result(sqltypes.Integer),
Note: longer names (BIGSERIAL, SMALLSERIAL) must appear before SERIAL in the parsy.alt() list to avoid the prefix-collision issue described in #96.
3. Add a @compiles hook for MS-SQL if needed
If SQLAlchemy does not automatically strip SERIAL when targeting MS-SQL (i.e. it passes the token through unchanged), add a @compiles(CreateColumn, "mssql") hook in datafaker/create.py modelled on the existing DuckDB one:
@compiles(CreateColumn, "mssql")
def remove_serial_mssql(element: CreateColumn, compiler: Any, **kw: Any) -> str:
text: str = compiler.visit_create_column(element, **kw)
return text.replace(" SERIAL ", " INTEGER ")
This is only a safety net; the preferred fix is step 2 (parser-level mapping).
4. Add tests
- A unit test confirming that
SERIAL, BIGSERIAL, and SMALLSERIAL type strings parse to the correct generic integer types.
- An integration test (or a DDL-compilation test using a mock MS-SQL engine) confirming that
create_all() on an MS-SQL engine does not emit SERIAL in the generated DDL.
Acceptance criteria
create-tables against an MS-SQL destination succeeds when the orm.yaml was produced from a PostgreSQL source containing SERIAL / BIGSERIAL / SMALLSERIAL columns.
- The generated MS-SQL DDL uses
INTEGER IDENTITY(1,1) (or equivalent) rather than SERIAL.
- Existing PostgreSQL and DuckDB behaviour is unchanged.
Related
Summary
PostgreSQL uses
SERIAL(and its variantsBIGSERIAL,SMALLSERIAL) as shorthand for autoincrement integer columns. When datafaker reflects a PostgreSQL source schema, these columns appear asSERIALin the generated DDL. The existing code already works around this for DuckDB via a@compileshook, but no equivalent exists for MS-SQL, which usesIDENTITY(1,1)instead.Problem
In
datafaker/create.pythere are two dialect-specific DDL hooks registered with SQLAlchemy's@compilesdecorator:The
remove_serialhook replacesSERIALwithINTEGERfor DuckDB. MS-SQL does not understandSERIALeither, socreate-tablesagainst an MS-SQL destination will raise a syntax error whenever the source schema has autoincrement columns.Additionally,
ON DELETE CASCADEis stripped for DuckDB but not for MS-SQL, where it is supported — so that hook does not need an MS-SQL equivalent. However, MS-SQL has its own DDL quirks that may need similar treatment (see proposed steps below).How SERIAL arises
When datafaker reflects a PostgreSQL source, SQLAlchemy maps
SERIALcolumns toINTEGERcolumns with aSequenceorautoincrement=True. When the ORM YAML is then used tocreate-tableson a destination, SQLAlchemy re-generates DDL. For a PostgreSQL destination this renders asSERIAL; for MS-SQL the correct rendering isINTEGER IDENTITY(1,1).In practice, SQLAlchemy already handles this correctly when the column is defined with
autoincrement=Trueand the engine is an MS-SQL engine — it will emitIDENTITY(1,1)automatically. The problem is the@compiles(CreateColumn, "duckdb")hook, which replaces the compiled text at the string level. A naive port of the same approach to"mssql"is unnecessary and could interfere with SQLAlchemy's correct MS-SQL DDL generation.The real risk is the reverse: if a source schema is reflected with
SERIALstored literally as the column type string inorm.yaml(rather than being resolved toINTEGER + autoincrement), parsing and re-emitting that string on an MS-SQL engine will produce invalid DDL.Proposed steps
1. Confirm how SERIAL is stored in orm.yaml
Run
make-tablesagainst a PostgreSQL source that containsSERIALcolumns and inspect the resultingorm.yaml. Determine whether:INTEGER(autoincrement handled separately), orSERIAL/BIGSERIAL/SMALLSERIAL.If (a), SQLAlchemy will generate
IDENTITY(1,1)for MS-SQL automatically and no further action is needed. If (b), proceed with steps 2–4.2. Add SERIAL type strings to the type parser
If
orm.yamlcontainsSERIAL,BIGSERIAL, orSMALLSERIAL, add parsers for them indatafaker/serialize_metadata.pythat map to the appropriate generic integer types withautoincrement=True:Note: longer names (
BIGSERIAL,SMALLSERIAL) must appear beforeSERIALin theparsy.alt()list to avoid the prefix-collision issue described in #96.3. Add a @compiles hook for MS-SQL if needed
If SQLAlchemy does not automatically strip
SERIALwhen targeting MS-SQL (i.e. it passes the token through unchanged), add a@compiles(CreateColumn, "mssql")hook indatafaker/create.pymodelled on the existing DuckDB one:This is only a safety net; the preferred fix is step 2 (parser-level mapping).
4. Add tests
SERIAL,BIGSERIAL, andSMALLSERIALtype strings parse to the correct generic integer types.create_all()on an MS-SQL engine does not emitSERIALin the generated DDL.Acceptance criteria
create-tablesagainst an MS-SQL destination succeeds when theorm.yamlwas produced from a PostgreSQL source containingSERIAL/BIGSERIAL/SMALLSERIALcolumns.INTEGER IDENTITY(1,1)(or equivalent) rather thanSERIAL.Related