-
Notifications
You must be signed in to change notification settings - Fork 48
DOC: mssql-python driver API documentation for Bulk Copy #454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,387 @@ | ||
| # Bulk Copy (BCP) API Reference | ||
|
|
||
| ## Overview | ||
|
|
||
| The `bulkcopy()` method on the `Cursor` object provides high-performance bulk data loading into SQL Server, Azure SQL Database, and Azure SQL Managed Instance. It is designed for ETL workloads and scenarios that require inserting large volumes of data far more efficiently than individual `INSERT` statements. | ||
|
|
||
| Internally, `bulkcopy()` opens a dedicated connection through the native `mssql_py_core` library and streams data in configurable batches, leveraging the same TDS Bulk Insert protocol used by tools like `bcp` and `SqlBulkCopy`. | ||
|
|
||
| > **Availability:** Requires the `mssql_py_core` native extension (bundled with the `mssql-python` wheel). | ||
|
subrata-ms marked this conversation as resolved.
Outdated
subrata-ms marked this conversation as resolved.
Outdated
|
||
|
|
||
| --- | ||
| ## Key Benefits | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Key benefits are not really required in this doc unless others feel differently. This is a better candidate for blog post.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. May be good to have as user should be aware of the properties/efficiency of the BCP. But this is surely a good candidate for Blog too. |
||
|
|
||
| • High-Performance Batch Inserts | ||
| • Configurable Batch Sizes for Optimal Throughput | ||
| • Flexible Column Mapping (Ordinal or Explicit) | ||
| • Support for Identity Columns, Constraints, and Triggers | ||
| • Transaction Control per Batch | ||
|
subrata-ms marked this conversation as resolved.
Outdated
|
||
|
|
||
| ## Quick Start | ||
|
|
||
| ```python | ||
| import mssql_python | ||
|
|
||
| conn = mssql_python.connect( | ||
| "SERVER=localhost;DATABASE=TestDB;UID=sa;PWD=YourPassword123;" | ||
|
subrata-ms marked this conversation as resolved.
Outdated
subrata-ms marked this conversation as resolved.
Outdated
|
||
| ) | ||
| cursor = conn.cursor() | ||
|
|
||
| # Target table must already exist | ||
|
subrata-ms marked this conversation as resolved.
Outdated
|
||
| cursor.execute(""" | ||
| CREATE TABLE Products ( | ||
| id INT, | ||
| name VARCHAR(100), | ||
| price FLOAT | ||
| ) | ||
| """) | ||
| conn.commit() | ||
|
|
||
| data = [ | ||
| (1, "Widget", 9.99), | ||
| (2, "Gadget", 24.50), | ||
| (3, "Gizmo", 14.75), | ||
| ] | ||
|
|
||
| result = cursor.bulkcopy("Products", data) | ||
| print(result) | ||
| # {'rows_copied': 3, 'batch_count': 1, 'elapsed_time': 0.12} | ||
|
|
||
| cursor.close() | ||
| conn.close() | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Method Signature | ||
|
|
||
| ```python | ||
| Cursor.bulkcopy( | ||
| table_name: str, | ||
| data: Iterable[Union[Tuple, List]], | ||
| batch_size: int = 0, | ||
| timeout: int = 30, | ||
| column_mappings: Optional[Union[List[str], List[Tuple[int, str]]]] = None, | ||
| keep_identity: bool = False, | ||
| check_constraints: bool = False, | ||
| table_lock: bool = False, | ||
| keep_nulls: bool = False, | ||
| fire_triggers: bool = False, | ||
| use_internal_transaction: bool = False, | ||
| ) -> dict | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Parameters | ||
|
|
||
| ### Required | ||
|
|
||
| | Parameter | Type | Description | | ||
| |-----------|------|-------------| | ||
| | `table_name` | `str` | Target table name. May include schema (`dbo.MyTable`) or be fully qualified (`[MyDB].[dbo].[MyTable]`). The table must already exist and the caller must have `INSERT` permission. | | ||
| | `data` | `Iterable[Tuple \| List]` | Iterable of rows. Each row is a tuple or list of column values. Column order must match the table's ordinal column order unless `column_mappings` is provided. | | ||
|
|
||
| ### Optional | ||
|
|
||
| | Parameter | Type | Default | Description | | ||
| |-----------|------|---------|-------------| | ||
| | `batch_size` | `int` | `0` | Rows per batch. `0` lets the server choose the optimal size. | | ||
| | `timeout` | `int` | `30` | Operation timeout in seconds. | | ||
|
subrata-ms marked this conversation as resolved.
Outdated
|
||
| | `column_mappings` | `List[str]` or `List[Tuple[int, str]]` | `None` | Column mapping specification (see [Column Mappings](#column-mappings) below). When `None`, columns are mapped by ordinal position. | | ||
| | `keep_identity` | `bool` | `False` | When `True`, identity values from the source data are preserved. Requires `IDENTITY_INSERT` permission. | | ||
| | `check_constraints` | `bool` | `False` | When `True`, CHECK and FOREIGN KEY constraints are enforced during the insert. | | ||
| | `table_lock` | `bool` | `False` | When `True`, acquires a table-level `BULK UPDATE` lock instead of row-level locks. Improves throughput when the table has no concurrent readers. | | ||
|
subrata-ms marked this conversation as resolved.
Outdated
|
||
| | `keep_nulls` | `bool` | `False` | When `True`, source `NULL` values are inserted as-is. When `False`, the column's default value is used instead. | | ||
| | `fire_triggers` | `bool` | `False` | When `True`, INSERT triggers defined on the target table are executed. | | ||
| | `use_internal_transaction` | `bool` | `False` | When `True`, each batch is committed in its own transaction, enabling partial-success scenarios. | | ||
|
|
||
| --- | ||
|
|
||
| ## Column Mappings | ||
|
|
||
| Column mappings control how columns in the source data line up with columns in the target table. | ||
|
|
||
| ### Ordinal (Default) | ||
|
|
||
| When `column_mappings` is omitted, columns are mapped by position: | ||
|
|
||
| ``` | ||
| source index 0 → first table column | ||
| source index 1 → second table column | ||
| … | ||
| ``` | ||
|
|
||
| ### Simple Format — `List[str]` | ||
|
|
||
| A list of destination column names. Position in the list equals the source index. | ||
|
|
||
| ```python | ||
| column_mappings = ["UserID", "FirstName", "Email"] | ||
| # index 0 → UserID, index 1 → FirstName, index 2 → Email | ||
| ``` | ||
|
|
||
| ### Advanced Format — `List[Tuple[int, str]]` | ||
|
|
||
| Explicit `(source_index, target_column_name)` tuples. Allows skipping or reordering columns. | ||
|
|
||
| ```python | ||
| column_mappings = [ | ||
| (0, "UserID"), # source[0] → UserID | ||
| (1, "FirstName"), # source[1] → FirstName | ||
| (3, "Email"), # source[3] → Email (source[2] skipped) | ||
| ] | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Return Value | ||
|
|
||
| A dictionary with operation metrics: | ||
|
|
||
| | Key | Type | Description | | ||
| |-----|------|-------------| | ||
| | `rows_copied` | `int` | Total number of rows successfully inserted. | | ||
| | `batch_count` | `int` | Number of batches processed. | | ||
| | `elapsed_time` | `float` | Wall-clock time for the operation (seconds). | | ||
|
subrata-ms marked this conversation as resolved.
Outdated
subrata-ms marked this conversation as resolved.
Outdated
|
||
|
|
||
| ```python | ||
| result = cursor.bulkcopy("Products", data) | ||
| # result == {'rows_copied': 3, 'batch_count': 1, 'elapsed_time': 0.12} | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Exceptions | ||
|
|
||
| | Exception | When | | ||
| |-----------|------| | ||
| | `ImportError` | `mssql_py_core` native library is not available. | | ||
| | `TypeError` | `data` is `None`, not iterable, or is a `str`/`bytes`; `batch_size` or `timeout` is not `int`. | | ||
| | `ValueError` | `table_name` is empty; `batch_size` is negative; `timeout` is not positive; `SERVER` parameter missing from connection string. | | ||
|
subrata-ms marked this conversation as resolved.
Outdated
|
||
| | `RuntimeError` | Connection string is unavailable; Azure AD token acquisition fails. | | ||
|
|
||
| --- | ||
|
|
||
| ## Connection String Requirements | ||
|
subrata-ms marked this conversation as resolved.
Outdated
|
||
|
|
||
| `bulkcopy()` opens its own internal connection using the connection string from the parent `Connection` object. The following applies: | ||
|
|
||
| | Keyword | Required | Notes | | ||
| |---------|----------|-------| | ||
| | `SERVER` (or synonyms `ADDR`, `ADDRESS`) | **Yes** | Target server. All three synonyms are accepted. | | ||
| | `DATABASE` | No | Optional. When omitted the server's default database for the login is used. Use a fully qualified table name (e.g., `[MyDB].[dbo].[MyTable]`) to target a specific database. | | ||
|
subrata-ms marked this conversation as resolved.
Outdated
|
||
|
|
||
| All other supported connection-string keywords (encryption, authentication, keep-alive, etc.) are forwarded to the internal connection automatically. | ||
|
subrata-ms marked this conversation as resolved.
Outdated
|
||
|
|
||
| ### Supported Connection-String Keywords | ||
|
|
||
| Keywords forwarded to the internal bulk copy connection: | ||
|
|
||
| | Category | Keywords | | ||
| |----------|----------| | ||
| | **Auth / Credentials** | `UID`, `PWD`, `Trusted_Connection`, `Authentication` | | ||
| | **Server** | `Server`, `Addr`, `Address` | | ||
| | **Database** | `Database` | | ||
| | **Encryption / TLS** | `Encrypt`, `TrustServerCertificate`, `HostNameInCertificate`, `ServerCertificate` | | ||
| | **Kerberos** | `ServerSPN` | | ||
| | **Network** | `MultiSubnetFailover`, `IPAddressPreference`, `KeepAlive`, `KeepAliveInterval` | | ||
| | **Sizing / Limits** | `PacketSize`, `ConnectRetryCount`, `ConnectRetryInterval` | | ||
| | **Intent** | `ApplicationIntent` | | ||
|
|
||
| > Keywords not in this list are silently ignored when constructing the internal bulk copy connection. | ||
|
|
||
| --- | ||
|
|
||
| ## Usage Examples | ||
|
|
||
| ### 1. Basic Insert (Ordinal Mapping) | ||
|
|
||
| ```python | ||
| data = [ | ||
| (1, "Alice", 100.5), | ||
| (2, "Bob", 200.75), | ||
| (3, "Charlie", 300.25), | ||
| ] | ||
|
|
||
| result = cursor.bulkcopy("Sales", data) | ||
| assert result["rows_copied"] == 3 | ||
| ``` | ||
|
|
||
| ### 2. Named Column Mapping | ||
|
|
||
| ```python | ||
| # Table: Users (UserID INT, FirstName VARCHAR, LastName VARCHAR, Email VARCHAR) | ||
| data = [ | ||
| (1, "Alice", "Smith", "alice@example.com"), | ||
| (2, "Bob", "Jones", "bob@example.com"), | ||
| ] | ||
|
|
||
| result = cursor.bulkcopy( | ||
| "Users", | ||
| data, | ||
| column_mappings=["UserID", "FirstName", "LastName", "Email"], | ||
| ) | ||
| ``` | ||
|
|
||
| ### 3. Selective Column Mapping (Skip & Reorder) | ||
|
|
||
| ```python | ||
| # Source rows have 4 values; only 3 are inserted (index 2 is skipped) | ||
| data = [ | ||
| (1, "Alice", "SKIP_ME", "alice@example.com"), | ||
| (2, "Bob", "SKIP_ME", "bob@example.com"), | ||
| ] | ||
|
|
||
| result = cursor.bulkcopy( | ||
| "Users", | ||
| data, | ||
| column_mappings=[ | ||
| (0, "UserID"), | ||
| (1, "FirstName"), | ||
| (3, "Email"), # index 2 intentionally skipped | ||
| ], | ||
| ) | ||
| ``` | ||
|
|
||
| ### 4. Preserving Identity Values | ||
|
|
||
| ```python | ||
| # Table: Users (UserID INT IDENTITY PRIMARY KEY, Name VARCHAR) | ||
| data = [ | ||
| (100, "Alice"), | ||
| (200, "Bob"), | ||
| ] | ||
|
|
||
| result = cursor.bulkcopy( | ||
| "Users", | ||
| data, | ||
| keep_identity=True, | ||
| column_mappings=["UserID", "Name"], | ||
| ) | ||
| ``` | ||
|
|
||
| ### 5. High-Throughput Import with Performance Options | ||
|
|
||
| ```python | ||
| def generate_rows(): | ||
| """Memory-efficient generator for large datasets.""" | ||
| for i in range(1_000_000): | ||
| yield (i, f"User{i}", f"user{i}@example.com") | ||
|
|
||
| result = cursor.bulkcopy( | ||
| "Users", | ||
| generate_rows(), | ||
| batch_size=10_000, | ||
| timeout=300, | ||
| table_lock=True, | ||
| use_internal_transaction=True, | ||
| ) | ||
|
|
||
| print(f"{result['rows_copied']} rows in {result['elapsed_time']:.1f}s " | ||
| f"({result['rows_copied'] / result['elapsed_time']:.0f} rows/sec)") | ||
| ``` | ||
|
|
||
| ### 6. Fully Qualified Table Name (Cross-Database) | ||
|
|
||
| ```python | ||
| # Useful when DATABASE is omitted from the connection string | ||
| result = cursor.bulkcopy("[ProductionDB].[dbo].[Users]", data) | ||
| ``` | ||
|
|
||
| ### 7. Connection with SERVER Synonym | ||
|
|
||
| ```python | ||
| # All three keywords resolve identically: | ||
| conn = mssql_python.connect("ADDR=localhost,1433;DATABASE=TestDB;UID=sa;PWD=pass;") | ||
| cursor = conn.cursor() | ||
| result = cursor.bulkcopy("Users", data) | ||
| ``` | ||
|
|
||
| ### 8. Comprehensive Error Handling | ||
|
|
||
| ```python | ||
| try: | ||
| result = cursor.bulkcopy("Users", data, batch_size=1000, timeout=120) | ||
| print(f"Copied {result['rows_copied']} rows") | ||
|
|
||
| except ImportError: | ||
| print("mssql_py_core is not installed — bulkcopy unavailable") | ||
|
|
||
| except TypeError as e: | ||
| print(f"Bad data format: {e}") | ||
|
|
||
| except ValueError as e: | ||
| print(f"Invalid parameter: {e}") | ||
|
|
||
| except RuntimeError as e: | ||
| print(f"Connection / auth error: {e}") | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Architecture Notes | ||
|
subrata-ms marked this conversation as resolved.
Outdated
|
||
|
|
||
| ``` | ||
| ┌──────────────────────┐ | ||
| │ cursor.bulkcopy() │ Python layer — validates inputs, acquires AAD tokens | ||
| └──────────┬───────────┘ | ||
| │ connstr_to_pycore_params() | ||
| ▼ | ||
| ┌──────────────────────┐ | ||
| │ mssql_py_core (Rust) │ Opens a SEPARATE connection via TDS Bulk Insert | ||
| │ PyCoreConnection │ Streams rows in batches | ||
| │ PyCoreCursor │ | ||
| └──────────┬───────────┘ | ||
| │ | ||
| ▼ | ||
| ┌──────────────────────┐ | ||
| │ SQL Server / Azure │ | ||
| └──────────────────────┘ | ||
| ``` | ||
|
|
||
| Key design points: | ||
|
|
||
| 1. **Separate connection** — `bulkcopy()` creates its own connection through `mssql_py_core`. This means the bulk operation does not share the calling cursor's transaction or session state. Use fully qualified table names when the `DATABASE` keyword is absent from the connection string. | ||
|
|
||
| 2. **Credential isolation** — Sensitive keys (`password`, `user_name`, `access_token`) are cleared from the parameter dict in a `finally` block after the operation completes, minimising memory exposure. | ||
|
|
||
| 3. **Token refresh** — When Microsoft Entra ID (Azure AD) authentication is in use, a fresh token is acquired immediately before bulk copy to avoid expiration during long-running imports. | ||
|
|
||
| 4. **Keyword translation** — ODBC-style connection-string keywords are mapped to `mssql_py_core`'s snake_case equivalents by `connstr_to_pycore_params()`. Unrecognised keywords are silently dropped. | ||
|
|
||
| --- | ||
|
|
||
| ## Performance Tips | ||
|
|
||
| | Tip | Rationale | | ||
| |-----|-----------| | ||
| | Use generators instead of lists for large datasets | Avoids loading the entire dataset into memory. | | ||
| | Set `table_lock=True` for exclusive inserts | Eliminates row-lock overhead. | | ||
| | Leave `batch_size=0` unless tuning | Server-optimal batching is generally the best default. | | ||
| | Set `check_constraints=False` and `fire_triggers=False` | Reduces per-row overhead; validate constraints after load. | | ||
| | Increase `timeout` proportionally to data volume | Prevents premature timeout for million-row imports. | | ||
| | Use `use_internal_transaction=True` for partial-commit semantics | Each batch commits independently — useful when you want to keep rows already loaded even if a later batch fails. | | ||
|
|
||
| --- | ||
|
|
||
| ## Requirements | ||
|
subrata-ms marked this conversation as resolved.
Outdated
|
||
|
|
||
| | Requirement | Detail | | ||
| |-------------|--------| | ||
| | Python | 3.10+ | | ||
| | `mssql-python` | 1.3.0+ | | ||
| | `mssql_py_core` | Bundled native extension | | ||
| | SQL Server | 2012 or later, Azure SQL Database, Azure SQL Managed Instance, SQL in Fabric | | ||
| | Permissions | `INSERT` on target table; `IDENTITY_INSERT` if `keep_identity=True` | | ||
|
|
||
| --- | ||
|
|
||
| ## See Also | ||
|
|
||
| - [Getting Started (README)](README.md) | ||
| - [Roadmap](ROADMAP.md) | ||
| - [Connection Wiki](https://github.com/microsoft/mssql-python/wiki/Connection) | ||
| - [Microsoft Entra ID Support](https://github.com/microsoft/mssql-python/wiki/Microsoft-Entra-ID-support) | ||
| - [SQL Server Bulk Import/Export](https://learn.microsoft.com/sql/relational-databases/import-export/bulk-import-and-export-of-data-sql-server) | ||
| - [DB API 2.0 Specification](https://peps.python.org/pep-0249/) | ||
Uh oh!
There was an error while loading. Please reload this page.