Skip to content

FIX: Fetch VARCHAR UTF-8 collation columns as SQL_C_WCHAR on Windows to prevent lossy ACP conversion#543

Open
jahnvi480 wants to merge 7 commits intomainfrom
jahnvi/utf8encoding
Open

FIX: Fetch VARCHAR UTF-8 collation columns as SQL_C_WCHAR on Windows to prevent lossy ACP conversion#543
jahnvi480 wants to merge 7 commits intomainfrom
jahnvi/utf8encoding

Conversation

@jahnvi480
Copy link
Copy Markdown
Contributor

@jahnvi480 jahnvi480 commented Apr 27, 2026

Work Item / Issue Reference

AB#44618

GitHub Issue: #531


Summary

This pull request introduces a major improvement to Unicode support for VARCHAR columns with UTF-8 collation on Windows, ensuring lossless round-trip of Unicode data. The main change is a new logic path that detects when UTF-8 decoding is requested and, on Windows, fetches VARCHAR data as SQL_C_WCHAR (UTF-16LE) instead of the default lossy ANSI code page (ACP) conversion. This affects all fetch paths (fetchone, fetchall, fetchmany) and is thoroughly tested with a new test case.

Problem

Since SQL Server 2019, VARCHAR columns can store full Unicode via UTF-8 collations (e.g., Latin1_General_100_CI_AS_SC_UTF8). However, mssql-python on Windows could not correctly read this data back.

When fetching VARCHAR data as SQL_C_CHAR, the Windows ODBC Driver Manager converts the bytes to the system's ANSI code page (typically CP1252) before delivering them to the application. This conversion is lossy:

  • Characters outside CP1252 (Chinese, Japanese, Russian, Emoji, etc.) are irreversibly replaced with ?
  • Characters within CP1252 but outside ASCII (German ü/ß, French é) are returned as CP1252 bytes which then fail UTF-8 decoding, resulting in raw bytes instead of str

The existing setdecoding(SQL_CHAR, encoding="utf-8") API only controls decoding after the ODBC driver has already performed this lossy conversion — the data is already damaged before our code sees it.

This issue does not affect Linux/macOS because the unixODBC Driver Manager converts to the system locale (typically UTF-8), which is lossless for all Unicode characters.

Before the fix (Windows)

Test Case Original VARCHAR Result
ASCII 'Hello World' 'Hello World'
German 'Grüße' b'Gr\xfc\xdfe' ⚠️ raw bytes
Chinese '你好世界' '????' ❌ data loss
Japanese 'こんにちは' '?????' ❌ data loss
Russian 'Привет' '??????' ❌ data loss
Mixed 'Hello 世界' 'Hello ??' ❌ data loss
Emoji '😀😃😄😁' '????????' ❌ data loss
French 'café résumé' b'caf\xe9 r\xe9sum\xe9' ⚠️ raw bytes

After the fix (Windows)

Test Case Original VARCHAR Result
ASCII 'Hello World' 'Hello World'
German 'Grüße' 'Grüße'
Chinese '你好世界' '你好世界'
Japanese 'こんにちは' 'こんにちは'
Russian 'Привет' 'Привет'
Mixed 'Hello 世界' 'Hello 世界'
Emoji '😀😃😄😁' '😀😃😄😁'
French 'café résumé' 'café résumé'

Fix

On Windows, when UTF-8 decoding is configured for SQL_CHAR (the default setting), we now fetch VARCHAR columns as SQL_C_WCHAR (UTF-16LE) instead of SQL_C_CHAR. The ODBC driver converts losslessly to UTF-16LE since UTF-16 can represent every Unicode character, completely bypassing the lossy ACP conversion.

Unicode/encoding improvements:

  • Added ShouldFetchCharAsWChar helper function to detect when VARCHAR columns should be fetched as SQL_C_WCHAR (UTF-16LE) on Windows if UTF-8 decoding is requested, avoiding lossy ACP conversion. (mssql_python/pybind/ddbc_bindings.cpp)
  • Updated all fetch paths (SQLGetData_wrap, SQLBindColums, FetchBatchData, FetchMany_wrap, FetchAll_wrap) to use this logic, binding and decoding VARCHAR as WCHAR where appropriate. (mssql_python/pybind/ddbc_bindings.cpp)

Testing:

  • Added a comprehensive test, test_varchar_utf8_collation_unicode_roundtrip, to verify that VARCHAR columns with UTF-8 collation round-trip full Unicode data (including CJK, Cyrillic, Arabic, Greek, Emoji, and mixed strings) correctly across all three fetch paths (fetchone, fetchall, fetchmany). (tests/test_013_encoding_decoding.py)

How it works

The fix introduces a gate function ShouldFetchCharAsWChar(charEncoding):

  • On Windows: returns true when charEncoding is "utf-8" (the default)
  • On Linux/macOS: always returns false (no change needed)

When active, the following changes apply to each fetch path:

Fetch path Change
SQLGetData_wrap (fetchone) Requests SQL_C_WCHAR from SQLGetData, decodes via PyUnicode_FromWideChar
SQLBindColums (batch bind) Binds VARCHAR columns into wcharBuffers as SQL_C_WCHAR
FetchBatchData (batch process) Routes VARCHAR columns to ProcessWChar instead of ProcessChar
LOB fallback Streams LOB VARCHAR data as SQL_C_WCHAR via FetchLobColumnData

Interaction with setencoding / setdecoding APIs

API call Effect on fix
setdecoding(SQL_CHAR, encoding="utf-8") (default) Activates WCHAR fetch on Windows
setdecoding(SQL_CHAR, encoding="cp1252") Old code path, no change
setdecoding(SQL_CHAR, encoding="latin-1") Old code path, no change
setdecoding(SQL_WCHAR, ...) Unaffected — NVARCHAR was always correct
setencoding(...) Unaffected — controls write path only

Impact on existing users

Scenario Behavior change?
Windows + charEncoding="utf-8" (default) Yes — VARCHAR now returns correct str instead of bytes or ?
Windows + charEncoding="cp1252", "latin-1", etc. No — old code path, unchanged
Linux/macOS (any encoding) No — ShouldFetchCharAsWChar always returns false
NVARCHAR columns (any platform) No — already used SQL_C_WCHAR
Arrow fetch path No — does not pass charEncoding, uses old path
setencoding (write/send path) No — completely separate code path

The only behavioral change is for Windows users with the default UTF-8 decoding: extended Latin characters in VARCHAR columns that previously returned as raw bytes (e.g., b'Gr\xfc\xdfe') now correctly return as str (e.g., 'Grüße'). This is a correctness improvement — the API contract always specified returning str for character columns.

Why Windows-only?

The ODBC driver manager behaves differently per platform:

Linux/macOS Windows
ODBC layer unixODBC Driver Manager Windows ODBC Driver Manager
SQL_C_CHAR output System locale (typically UTF-8) System ANSI code page (typically CP1252)
UTF-8 collation VARCHAR UTF-8 → UTF-8 (lossless) ✅ UTF-8 → CP1252 (lossy) ❌
Fix needed? No Yes — fetch as SQL_C_WCHAR

…to prevent lossy ACP conversion

On Windows, the ODBC Driver Manager converts SQL_C_CHAR data to the system's
ANSI code page (typically CP1252) before delivering it to the application.
This is lossy for characters outside CP1252: CJK/Emoji get replaced with '?'
(irreversible data loss) and extended Latin characters arrive as CP1252 bytes
that fail UTF-8 decoding (returned as raw bytes instead of str).

Fix: When charEncoding is 'utf-8' on Windows, fetch VARCHAR/CHAR/LONGVARCHAR
columns as SQL_C_WCHAR (UTF-16LE) instead of SQL_C_CHAR. The ODBC driver
converts losslessly to UTF-16LE, bypassing the lossy ACP conversion entirely.

Changes in ddbc_bindings.cpp:
- Add ShouldFetchCharAsWChar() helper (Windows-only, UTF-8 only)
- SQLGetData_wrap (fetchone path): fetch VARCHAR as SQL_C_WCHAR when active,
  decode via PyUnicode_FromWideChar, with LOB streaming fallback
- SQLBindColums (batch path): accept charEncoding param, bind VARCHAR into
  wcharBuffers as SQL_C_WCHAR when active
- FetchBatchData: route VARCHAR columns to ProcessWChar dispatcher when active,
  compute correct fetchBufferSize for WCHAR buffers
- FetchMany_wrap/FetchAll_wrap: pass charEncoding to SQLBindColums
- Arrow path: unchanged (uses default empty charEncoding, no WCHAR workaround)

Not affected:
- Linux/macOS (ShouldFetchCharAsWChar always returns false)
- Non-UTF-8 encodings (cp1252, latin-1, gbk, etc. use old SQL_C_CHAR path)
- NVARCHAR columns (already use SQL_C_WCHAR)
- setencoding API (write path, unrelated to fetch)

Test: add test_varchar_utf8_collation_unicode_roundtrip covering ASCII, German,
Chinese, Japanese, Russian, Greek, Arabic, Emoji, French through fetchone,
fetchall, and fetchmany paths.
Copilot AI review requested due to automatic review settings April 27, 2026 10:46
@jahnvi480 jahnvi480 marked this pull request as draft April 27, 2026 10:46
@github-actions github-actions Bot added the pr-size: medium Moderate update size label Apr 27, 2026
Comment thread mssql_python/pybind/ddbc_bindings.cpp Dismissed
@jahnvi480 jahnvi480 linked an issue Apr 27, 2026 that may be closed by this pull request
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 27, 2026

📊 Code Coverage Report

🔥 Diff Coverage

70%


🎯 Overall Coverage

78%


📈 Total Lines Covered: 6849 out of 8698
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/pybind/ddbc_bindings.cpp (70.8%): Missing lines 2068,3355-3359,3373-3381,3386-3402,3404-3433,3979-3983,4256,4518,4983-4985,5245-5246,5296-5298,5599-5600

Summary

  • Total: 271 lines
  • Missing: 79 lines
  • Coverage: 70%

mssql_python/pybind/ddbc_bindings.cpp

Lines 2064-2072

  2064                         size_t chunkBytes = DAE_CHUNK_SIZE;
  2065                         while (offset < totalBytes) {
  2066                             size_t len = std::min(chunkBytes, totalBytes - offset);
  2067 
! 2068                             rc = putData((SQLPOINTER)(dataPtr + offset), static_cast<SQLLEN>(len));
  2069                             if (!SQL_SUCCEEDED(rc)) {
  2070                                 LOG("SQLExecute: SQLPutData failed for "
  2071                                     "SQL_C_CHAR chunk - offset=%zu",
  2072                                     offset, totalBytes, len, rc);

Lines 3351-3363

  3351                     if (ShouldFetchCharAsWChar(charEncoding)) {
  3352                         // LCOV_EXCL_START - Windows-only: ShouldFetchCharAsWChar always false on Linux
  3353                         // On Windows with UTF-8, fetch LOB VARCHAR as WCHAR to avoid
  3354                         // lossy ACP conversion
! 3355                         LOG("SQLGetData: Streaming LOB for column %d (SQL_C_WCHAR via "
! 3356                             "UTF-8 workaround) - columnSize=%lu",
! 3357                             i, (unsigned long)columnSize);
! 3358                         row.append(
! 3359                             FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false, charEncoding));
  3360                         // LCOV_EXCL_STOP
  3361                     } else {
  3362                         LOG("SQLGetData: Streaming LOB for column %d (SQL_C_CHAR) "
  3363                             "- columnSize=%lu",

Lines 3369-3437

  3369                     // LCOV_EXCL_START - Windows-only: ShouldFetchCharAsWChar always false on Linux
  3370                     // On Windows with UTF-8 decoding: fetch VARCHAR as SQL_C_WCHAR
  3371                     // to bypass the ODBC driver's lossy ACP (e.g. CP1252) conversion.
  3372                     // The ODBC driver converts losslessly to UTF-16LE for SQL_C_WCHAR.
! 3373                     uint64_t wcharBufSize = (columnSize + 1);  // in SQLWCHAR units
! 3374                     std::vector<SQLWCHAR> wdataBuffer(wcharBufSize);
! 3375                     SQLLEN dataLen;
! 3376                     ret = SQLGetData_ptr(hStmt, i, SQL_C_WCHAR, wdataBuffer.data(),
! 3377                                          wcharBufSize * sizeof(SQLWCHAR), &dataLen);
! 3378                     if (SQL_SUCCEEDED(ret)) {
! 3379                         if (dataLen > 0) {
! 3380                             uint64_t numCharsInData = dataLen / sizeof(SQLWCHAR);
! 3381                             if (numCharsInData <= columnSize) {
  3382 #if defined(_WIN32)
  3383                                 PyObject* pyStr = PyUnicode_FromWideChar(
  3384                                     reinterpret_cast<wchar_t*>(wdataBuffer.data()), numCharsInData);
  3385 #else
! 3386                                 PyObject* pyStr = PyUnicode_DecodeUTF16(
! 3387                                     reinterpret_cast<const char*>(wdataBuffer.data()),
! 3388                                     numCharsInData * sizeof(SQLWCHAR), NULL, NULL);
! 3389 #endif
! 3390                                 if (pyStr) {
! 3391                                     row.append(py::reinterpret_steal<py::object>(pyStr));
! 3392                                     LOG("SQLGetData: CHAR column %d fetched as WCHAR (UTF-8 "
! 3393                                         "workaround), %zu bytes -> decoded",
! 3394                                         i, (size_t)dataLen);
! 3395                                 } else {
! 3396                                     PyErr_Clear();
! 3397                                     LOG_ERROR("SQLGetData: Failed to decode WCHAR data for "
! 3398                                               "CHAR column %d",
! 3399                                               i);
! 3400                                     row.append(py::none());
! 3401                                 }
! 3402                             } else {
  3403                                 // Buffer too small, fallback to LOB streaming
! 3404                                 LOG("SQLGetData: CHAR column %d WCHAR data truncated, "
! 3405                                     "using streaming LOB",
! 3406                                     i);
! 3407                                 row.append(FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false,
! 3408                                                               charEncoding));
! 3409                             }
! 3410                         } else if (dataLen == SQL_NULL_DATA) {
! 3411                             LOG("SQLGetData: Column %d is NULL (CHAR via WCHAR)", i);
! 3412                             row.append(py::none());
! 3413                         } else if (dataLen == 0) {
! 3414                             row.append(py::str(""));
! 3415                         } else if (dataLen == SQL_NO_TOTAL) {
! 3416                             LOG("SQLGetData: SQL_NO_TOTAL for column %d (CHAR via WCHAR), "
! 3417                                 "falling back to LOB",
! 3418                                 i);
! 3419                             row.append(FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false,
! 3420                                                           charEncoding));
! 3421                         } else if (dataLen < 0) {
! 3422                             LOG("SQLGetData: Unexpected negative data length "
! 3423                                 "for column %d (CHAR via WCHAR) - dataLen=%ld",
! 3424                                 i, (long)dataLen);
! 3425                             ThrowStdException("SQLGetData returned an unexpected negative "
! 3426                                               "data length");
! 3427                         }
! 3428                     } else {
! 3429                         LOG("SQLGetData: Error retrieving WCHAR data for CHAR column %d "
! 3430                             "- SQLRETURN=%d, returning NULL",
! 3431                             i, ret);
! 3432                         row.append(py::none());
! 3433                     }
  3434                     // LCOV_EXCL_STOP
  3435                 } else {
  3436                     // Allocate columnSize * 4 + 1 on ALL platforms (no #if guard).
  3437                     //

Lines 3975-3987

  3975                 if (fetchCharAsWChar) {
  3976                     // LCOV_EXCL_START - Windows-only: fetchCharAsWChar always false on Linux
  3977                     // On Windows with UTF-8: bind VARCHAR as SQL_C_WCHAR to
  3978                     // bypass the ODBC driver's lossy ACP conversion.
! 3979                     uint64_t fetchBufferSize = columnSize + 1 /*null-terminator*/;
! 3980                     buffers.wcharBuffers[col - 1].resize(fetchSize * fetchBufferSize);
! 3981                     ret = SQLBindCol_ptr(
! 3982                         hStmt, col, SQL_C_WCHAR, buffers.wcharBuffers[col - 1].data(),
! 3983                         fetchBufferSize * sizeof(SQLWCHAR), buffers.indicators[col - 1].data());
  3984                     // LCOV_EXCL_STOP
  3985                 } else {
  3986                     // Use columnSize * 4 + 1 on Linux/macOS to accommodate UTF-8
  3987                     // expansion. The ODBC driver returns UTF-8 for SQL_C_CHAR where

Lines 4252-4260

  4252             case SQL_LONGVARCHAR:
  4253                 // When fetchCharAsWChar is active, VARCHAR data is in wcharBuffers
  4254                 // (bound as SQL_C_WCHAR) so use the WCHAR processor for decoding.
  4255                 if (fetchCharAsWChar) {
! 4256                     columnProcessors[col] = ColumnProcessors::ProcessWChar;  // LCOV_EXCL_LINE - Windows-only
  4257                 } else {
  4258                     columnProcessors[col] = ColumnProcessors::ProcessChar;
  4259                 }
  4260                 break;

Lines 4514-4522

  4514                 // When UTF-8 WCHAR workaround is active on Windows,
  4515                 // VARCHAR is bound as SQL_C_WCHAR (2 bytes per char).
  4516                 // Account for this in memory estimation.
  4517                 if (ShouldFetchCharAsWChar(charEncoding)) {
! 4518                     rowSize += columnSize * sizeof(SQLWCHAR);  // LCOV_EXCL_LINE - Windows-only
  4519                 } else {
  4520                     rowSize += columnSize;
  4521                 }
  4522                 break;

Lines 4979-4989

  4979                 arrowColumnProducer->ptrValueBuffer = arrowColumnProducer->bitVal.get();
  4980                 break;
  4981             default:
  4982                 std::ostringstream errorString;
! 4983                 errorString << "Unsupported data type for Arrow batch fetch for column - "
! 4984                             << columnName.c_str() << ", Type - " << dataType << ", column ID - "
! 4985                             << (i + 1);
  4986                 LOG(errorString.str().c_str());
  4987                 ThrowStdException(errorString.str());
  4988                 break;
  4989         }

Lines 5241-5250

  5241                                                  buffers.datetimeoffsetBuffers[idxCol].data(),
  5242                                                  sizeof(DateTimeOffset),
  5243                                                  buffers.indicators[idxCol].data());
  5244                             if (!SQL_SUCCEEDED(ret)) {
! 5245                                 LOG("Error fetching SS_TIMESTAMPOFFSET data for column %d",
! 5246                                     idxCol + 1);
  5247                                 return ret;
  5248                             }
  5249                             break;
  5250                         }

Lines 5292-5302

  5292                     nullCounts[idxCol] += 1;
  5293                     continue;
  5294                 } else if (indicator < 0) {
  5295                     // Negative value is unexpected, log column index, SQL type & raise exception
! 5296                     LOG("Unexpected negative data length. Column ID - %d, SQL Type - %d, Data "
! 5297                         "Length - %lld",
! 5298                         idxCol + 1, dataType, (long long)indicator);
  5299                     ThrowStdException("Unexpected negative data length.");
  5300                 }
  5301                 auto dataLen = static_cast<uint64_t>(indicator);

Lines 5595-5604

  5595         arrowSchemaBatchCapsule =
  5596             py::capsule(arrowSchemaBatch.get(), "arrow_schema", [](void* ptr) {
  5597                 auto arrowSchema = static_cast<ArrowSchema*>(ptr);
  5598                 if (arrowSchema->release) {
! 5599                     arrowSchema->release(arrowSchema);
! 5600                 }
  5601                 delete arrowSchema;
  5602             });
  5603     } catch (...) {
  5604         arrowSchemaBatch->release(arrowSchemaBatch.get());


📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 67.9%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 73.4%
mssql_python.pybind.connection.connection.cpp: 76.2%
mssql_python.__init__.py: 77.3%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 85.3%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves Unicode correctness on Windows when fetching VARCHAR columns using SQL Server UTF-8 collations by routing UTF-8-decoded SQL_CHAR fetches through SQL_C_WCHAR (UTF-16) to avoid lossy ANSI code page (ACP) conversion in the Windows ODBC Driver Manager.

Changes:

  • Added a Windows-only decision helper (ShouldFetchCharAsWChar) to fetch VARCHAR/CHAR as SQL_C_WCHAR when UTF-8 decoding is requested.
  • Updated row-by-row (SQLGetData_wrap) and bound/batch fetch paths (SQLBindColums/FetchBatchData via fetchmany/fetchall) to apply the WCHAR workaround consistently.
  • Added a regression test covering Unicode round-trip for UTF-8-collated VARCHAR across fetchone, fetchall, and fetchmany.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
mssql_python/pybind/ddbc_bindings.cpp Implements the Windows UTF-8→WCHAR fetch workaround across all main fetch paths.
tests/test_013_encoding_decoding.py Adds a test validating lossless Unicode round-trip for UTF-8-collated VARCHAR.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mssql_python/pybind/ddbc_bindings.cpp
Comment thread mssql_python/pybind/ddbc_bindings.cpp Outdated
Comment thread mssql_python/pybind/ddbc_bindings.cpp
@jahnvi480 jahnvi480 marked this pull request as ready for review May 4, 2026 09:07
@github-actions github-actions Bot added pr-size: large Substantial code update and removed pr-size: medium Moderate update size labels May 4, 2026
Comment thread mssql_python/pybind/ddbc_bindings.cpp Dismissed
Comment thread mssql_python/pybind/ddbc_bindings.cpp Dismissed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-size: large Substantial code update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Decoding VARCHAR with UTF-8 collations fails

3 participants