Skip to content

Extend type parser to support MS-SQL column types #96

@myyong

Description

@myyong

Summary

datafaker/serialize_metadata.py contains a type parser (SIMPLE_TYPE_PARSER / type_parser) that converts SQL type strings in orm.yaml back into SQLAlchemy type objects. It currently only registers parsers for PostgreSQL-specific dialect types, which means:

  1. Types that appear in a PostgreSQL-sourced orm.yaml but have no MS-SQL equivalent will fail to parse (or produce incorrect types) when datafaker is pointed at an MS-SQL database.
  2. Native MS-SQL type strings produced by SQL Server (e.g. DATETIMEOFFSET, UNIQUEIDENTIFIER, VARBINARY) are not recognised by the parser at all, so make-tables on an MS-SQL source will fail.

Affected code

All PostgreSQL-specific entries in SIMPLE_TYPE_PARSER and type_parser in datafaker/serialize_metadata.py:

Current parser entry Problem for MS-SQL
simple(postgresql.TSVECTOR) No native MS-SQL equivalent; full-text indexing works differently
simple(postgresql.BYTEA) MS-SQL uses VARBINARY(n) / VARBINARY(MAX)
simple(postgresql.CIDR) No native network address type in MS-SQL
numeric_type(postgresql.BIT) MS-SQL BIT is a single boolean flag (0/1), not a multi-bit field
numeric_type(postgresql.REAL) Should map to sqltypes.REAL (generic) or mssql.REAL
time_type(sqltypes.TIMESTAMP, postgresql.types.TIMESTAMP) MS-SQL uses DATETIME2; timezone-aware timestamps use DATETIMEOFFSET
time_type(sqltypes.TIME, postgresql.types.TIME) MS-SQL TIME does not carry timezone info natively
postgresql.ARRAY(base, dimensions=…) in type_parser MS-SQL has no native array type
isinstance(type_, postgresql.DOMAIN) in column_to_dict MS-SQL has no domain type
isinstance(type_, postgresql.ENUM) in column_to_dict MS-SQL has no ENUM; typically a CHECK constraint or lookup table

Additionally, several MS-SQL-native type strings are entirely absent from the parser:

MS-SQL type Notes
UNIQUEIDENTIFIER UUID equivalent
VARBINARY(n) / VARBINARY(MAX) Binary data (replaces BYTEA)
DATETIMEOFFSET Timezone-aware timestamp
DATETIME2 High-precision datetime
NTEXT Deprecated but may appear in legacy schemas
MONEY / SMALLMONEY Monetary types
IMAGE Deprecated binary type

Proposed steps

1. Add MS-SQL dialect types to SIMPLE_TYPE_PARSER

Import sqlalchemy.dialects.mssql and register parsers for native MS-SQL types that are not already covered by the generic sqltypes parsers:

from sqlalchemy.dialects import mssql, oracle, postgresql

SIMPLE_TYPE_PARSER = parsy.alt(
    ...
    # MS-SQL native types
    simple(mssql.UNIQUEIDENTIFIER),
    numeric_type(mssql.VARBINARY),       # covers VARBINARY(n) and VARBINARY(MAX)
    simple(mssql.DATETIMEOFFSET),
    simple(mssql.DATETIME2),
    simple(mssql.MONEY),
    simple(mssql.SMALLMONEY),
    simple(mssql.IMAGE),
    numeric_type(mssql.BIT),             # single-bit boolean
    ...
)

2. Make BYTEA and CIDR gracefully degrade

When column_to_dict encounters postgresql.BYTEA, render it as VARBINARY(MAX) (or the generic sqltypes.LargeBinary) so that an orm.yaml sourced from PostgreSQL can still be used against MS-SQL. Do the same for postgresql.CIDR (no good equivalent — map to VARCHAR(43), the maximum length of an IPv6 CIDR string, and document the limitation).

3. Handle TSVECTOR

postgresql.TSVECTOR has no MS-SQL equivalent. Options:

  • Omit the column from the generated schema when targeting MS-SQL (log a warning).
  • Map it to NVARCHAR(MAX) and store the raw text; datafaker will generate random strings rather than valid tsvectors, but the table will at least be created.

Document whichever approach is chosen.

4. Handle ARRAY

postgresql.ARRAY has no direct MS-SQL equivalent. The type_parser should:

  • Raise a clear error (or warning) when an array type is encountered while targeting MS-SQL, rather than silently producing invalid DDL.
  • Alternatively, flatten to NVARCHAR(MAX) (JSON-serialised array) and document the limitation.

5. Update time_type to use generic / MS-SQL types

Replace postgresql.types.TIMESTAMP and postgresql.types.TIME in the time_type parsers with the generic sqltypes.TIMESTAMP and sqltypes.TIME for the non-timezone case, and mssql.DATETIMEOFFSET for the timezone-aware case:

time_type(sqltypes.TIMESTAMP, mssql.DATETIMEOFFSET),  # or a dialect-dispatched helper
time_type(sqltypes.TIME, sqltypes.TIME),

6. Update column_to_dict DOMAIN / ENUM handling

The existing isinstance(type_, postgresql.DOMAIN) and isinstance(type_, postgresql.ENUM) guards in column_to_dict are already dialect-conditional, but should be extended to handle MS-SQL equivalents (or degrade gracefully when those types are not present).

7. Add tests

  • Unit tests for type_parser.parse(...) covering each new MS-SQL type string.
  • Regression tests confirming that existing PostgreSQL type strings still parse correctly.
  • A test confirming that an orm.yaml file containing PostgreSQL-specific types can be processed against an MS-SQL engine without raising an unhandled exception.

Acceptance criteria

  • poetry run datafaker make-tables against an MS-SQL source produces a valid orm.yaml.
  • An orm.yaml produced from a PostgreSQL source can be used with create-tables against an MS-SQL destination (with documented limitations for types that have no equivalent).
  • All existing PostgreSQL type-parser tests continue to pass.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions