Summary
datafaker/serialize_metadata.py contains a type parser (SIMPLE_TYPE_PARSER / type_parser) that converts SQL type strings in orm.yaml back into SQLAlchemy type objects. It currently only registers parsers for PostgreSQL-specific dialect types, which means:
- Types that appear in a PostgreSQL-sourced
orm.yaml but have no MS-SQL equivalent will fail to parse (or produce incorrect types) when datafaker is pointed at an MS-SQL database.
- Native MS-SQL type strings produced by SQL Server (e.g.
DATETIMEOFFSET, UNIQUEIDENTIFIER, VARBINARY) are not recognised by the parser at all, so make-tables on an MS-SQL source will fail.
Affected code
All PostgreSQL-specific entries in SIMPLE_TYPE_PARSER and type_parser in datafaker/serialize_metadata.py:
| Current parser entry |
Problem for MS-SQL |
simple(postgresql.TSVECTOR) |
No native MS-SQL equivalent; full-text indexing works differently |
simple(postgresql.BYTEA) |
MS-SQL uses VARBINARY(n) / VARBINARY(MAX) |
simple(postgresql.CIDR) |
No native network address type in MS-SQL |
numeric_type(postgresql.BIT) |
MS-SQL BIT is a single boolean flag (0/1), not a multi-bit field |
numeric_type(postgresql.REAL) |
Should map to sqltypes.REAL (generic) or mssql.REAL |
time_type(sqltypes.TIMESTAMP, postgresql.types.TIMESTAMP) |
MS-SQL uses DATETIME2; timezone-aware timestamps use DATETIMEOFFSET |
time_type(sqltypes.TIME, postgresql.types.TIME) |
MS-SQL TIME does not carry timezone info natively |
postgresql.ARRAY(base, dimensions=…) in type_parser |
MS-SQL has no native array type |
isinstance(type_, postgresql.DOMAIN) in column_to_dict |
MS-SQL has no domain type |
isinstance(type_, postgresql.ENUM) in column_to_dict |
MS-SQL has no ENUM; typically a CHECK constraint or lookup table |
Additionally, several MS-SQL-native type strings are entirely absent from the parser:
| MS-SQL type |
Notes |
UNIQUEIDENTIFIER |
UUID equivalent |
VARBINARY(n) / VARBINARY(MAX) |
Binary data (replaces BYTEA) |
DATETIMEOFFSET |
Timezone-aware timestamp |
DATETIME2 |
High-precision datetime |
NTEXT |
Deprecated but may appear in legacy schemas |
MONEY / SMALLMONEY |
Monetary types |
IMAGE |
Deprecated binary type |
Proposed steps
1. Add MS-SQL dialect types to SIMPLE_TYPE_PARSER
Import sqlalchemy.dialects.mssql and register parsers for native MS-SQL types that are not already covered by the generic sqltypes parsers:
from sqlalchemy.dialects import mssql, oracle, postgresql
SIMPLE_TYPE_PARSER = parsy.alt(
...
# MS-SQL native types
simple(mssql.UNIQUEIDENTIFIER),
numeric_type(mssql.VARBINARY), # covers VARBINARY(n) and VARBINARY(MAX)
simple(mssql.DATETIMEOFFSET),
simple(mssql.DATETIME2),
simple(mssql.MONEY),
simple(mssql.SMALLMONEY),
simple(mssql.IMAGE),
numeric_type(mssql.BIT), # single-bit boolean
...
)
2. Make BYTEA and CIDR gracefully degrade
When column_to_dict encounters postgresql.BYTEA, render it as VARBINARY(MAX) (or the generic sqltypes.LargeBinary) so that an orm.yaml sourced from PostgreSQL can still be used against MS-SQL. Do the same for postgresql.CIDR (no good equivalent — map to VARCHAR(43), the maximum length of an IPv6 CIDR string, and document the limitation).
3. Handle TSVECTOR
postgresql.TSVECTOR has no MS-SQL equivalent. Options:
- Omit the column from the generated schema when targeting MS-SQL (log a warning).
- Map it to
NVARCHAR(MAX) and store the raw text; datafaker will generate random strings rather than valid tsvectors, but the table will at least be created.
Document whichever approach is chosen.
4. Handle ARRAY
postgresql.ARRAY has no direct MS-SQL equivalent. The type_parser should:
- Raise a clear error (or warning) when an array type is encountered while targeting MS-SQL, rather than silently producing invalid DDL.
- Alternatively, flatten to
NVARCHAR(MAX) (JSON-serialised array) and document the limitation.
5. Update time_type to use generic / MS-SQL types
Replace postgresql.types.TIMESTAMP and postgresql.types.TIME in the time_type parsers with the generic sqltypes.TIMESTAMP and sqltypes.TIME for the non-timezone case, and mssql.DATETIMEOFFSET for the timezone-aware case:
time_type(sqltypes.TIMESTAMP, mssql.DATETIMEOFFSET), # or a dialect-dispatched helper
time_type(sqltypes.TIME, sqltypes.TIME),
6. Update column_to_dict DOMAIN / ENUM handling
The existing isinstance(type_, postgresql.DOMAIN) and isinstance(type_, postgresql.ENUM) guards in column_to_dict are already dialect-conditional, but should be extended to handle MS-SQL equivalents (or degrade gracefully when those types are not present).
7. Add tests
- Unit tests for
type_parser.parse(...) covering each new MS-SQL type string.
- Regression tests confirming that existing PostgreSQL type strings still parse correctly.
- A test confirming that an
orm.yaml file containing PostgreSQL-specific types can be processed against an MS-SQL engine without raising an unhandled exception.
Acceptance criteria
poetry run datafaker make-tables against an MS-SQL source produces a valid orm.yaml.
- An
orm.yaml produced from a PostgreSQL source can be used with create-tables against an MS-SQL destination (with documented limitations for types that have no equivalent).
- All existing PostgreSQL type-parser tests continue to pass.
Related
Summary
datafaker/serialize_metadata.pycontains a type parser (SIMPLE_TYPE_PARSER/type_parser) that converts SQL type strings inorm.yamlback into SQLAlchemy type objects. It currently only registers parsers for PostgreSQL-specific dialect types, which means:orm.yamlbut have no MS-SQL equivalent will fail to parse (or produce incorrect types) when datafaker is pointed at an MS-SQL database.DATETIMEOFFSET,UNIQUEIDENTIFIER,VARBINARY) are not recognised by the parser at all, somake-tableson an MS-SQL source will fail.Affected code
All PostgreSQL-specific entries in
SIMPLE_TYPE_PARSERandtype_parserin datafaker/serialize_metadata.py:simple(postgresql.TSVECTOR)simple(postgresql.BYTEA)VARBINARY(n)/VARBINARY(MAX)simple(postgresql.CIDR)numeric_type(postgresql.BIT)BITis a single boolean flag (0/1), not a multi-bit fieldnumeric_type(postgresql.REAL)sqltypes.REAL(generic) ormssql.REALtime_type(sqltypes.TIMESTAMP, postgresql.types.TIMESTAMP)DATETIME2; timezone-aware timestamps useDATETIMEOFFSETtime_type(sqltypes.TIME, postgresql.types.TIME)TIMEdoes not carry timezone info nativelypostgresql.ARRAY(base, dimensions=…)intype_parserisinstance(type_, postgresql.DOMAIN)incolumn_to_dictisinstance(type_, postgresql.ENUM)incolumn_to_dictAdditionally, several MS-SQL-native type strings are entirely absent from the parser:
UNIQUEIDENTIFIERVARBINARY(n)/VARBINARY(MAX)DATETIMEOFFSETDATETIME2NTEXTMONEY/SMALLMONEYIMAGEProposed steps
1. Add MS-SQL dialect types to
SIMPLE_TYPE_PARSERImport
sqlalchemy.dialects.mssqland register parsers for native MS-SQL types that are not already covered by the genericsqltypesparsers:2. Make BYTEA and CIDR gracefully degrade
When
column_to_dictencounterspostgresql.BYTEA, render it asVARBINARY(MAX)(or the genericsqltypes.LargeBinary) so that anorm.yamlsourced from PostgreSQL can still be used against MS-SQL. Do the same forpostgresql.CIDR(no good equivalent — map toVARCHAR(43), the maximum length of an IPv6 CIDR string, and document the limitation).3. Handle TSVECTOR
postgresql.TSVECTORhas no MS-SQL equivalent. Options:NVARCHAR(MAX)and store the raw text; datafaker will generate random strings rather than valid tsvectors, but the table will at least be created.Document whichever approach is chosen.
4. Handle ARRAY
postgresql.ARRAYhas no direct MS-SQL equivalent. Thetype_parsershould:NVARCHAR(MAX)(JSON-serialised array) and document the limitation.5. Update
time_typeto use generic / MS-SQL typesReplace
postgresql.types.TIMESTAMPandpostgresql.types.TIMEin thetime_typeparsers with the genericsqltypes.TIMESTAMPandsqltypes.TIMEfor the non-timezone case, andmssql.DATETIMEOFFSETfor the timezone-aware case:6. Update
column_to_dictDOMAIN / ENUM handlingThe existing
isinstance(type_, postgresql.DOMAIN)andisinstance(type_, postgresql.ENUM)guards incolumn_to_dictare already dialect-conditional, but should be extended to handle MS-SQL equivalents (or degrade gracefully when those types are not present).7. Add tests
type_parser.parse(...)covering each new MS-SQL type string.orm.yamlfile containing PostgreSQL-specific types can be processed against an MS-SQL engine without raising an unhandled exception.Acceptance criteria
poetry run datafaker make-tablesagainst an MS-SQL source produces a validorm.yaml.orm.yamlproduced from a PostgreSQL source can be used withcreate-tablesagainst an MS-SQL destination (with documented limitations for types that have no equivalent).Related