Skip to content

FIX: Release GIL around blocking SQLSetConnectAttr calls (#565)#568

Open
saurabh500 wants to merge 6 commits into
mainfrom
dev/saurabh/fix-565-gil-release-conn-attrs
Open

FIX: Release GIL around blocking SQLSetConnectAttr calls (#565)#568
saurabh500 wants to merge 6 commits into
mainfrom
dev/saurabh/fix-565-gil-release-conn-attrs

Conversation

@saurabh500
Copy link
Copy Markdown
Contributor

@saurabh500 saurabh500 commented May 11, 2026

Work Item / Issue Reference

GitHub Issue: #565


Summary

Fixes the indefinite hang of mssql_python.connect() when the target server is reached through an in-process SSH tunnel created with paramiko + sshtunnel (issue #565, originally reported as #491).

Root cause

PR #541 released the GIL around SQLDriverConnect, SQLDisconnect, and SQLEndTran, but a handful of SQLSetConnectAttr call sites in mssql_python/pybind/connection/connection.cpp were left holding the GIL:

  • Connection::setAutocommitSQL_ATTR_AUTOCOMMIT. This is the call that hangs in the issue's repro: it runs by default immediately after every connect().
  • Connection::setAttribute (int / wide-string / bytes paths) — user-facing set_attr().
  • Connection::resetSQL_ATTR_RESET_CONNECTION and SQL_ATTR_TXN_ISOLATION, called by the connection pool on acquire.

When the GIL is held during a call that ends up doing blocking work, paramiko's transport thread cannot run and so cannot forward bytes through the SSH channel — the driver waits forever for a response that never comes. pyodbc does not exhibit this because it releases the GIL around its blocking ODBC calls.

Fix

Wrap each of these call sites in py::gil_scoped_release, matching the pattern PR #541 already established for SQLDriverConnect / SQLEndTran. Local-only attribute reads (SQLGetConnectAttr for SQL_ATTR_CONNECTION_DEAD / SQL_ATTR_AUTOCOMMIT, SQLGetInfo, SQLAllocHandle) are intentionally left alone.

Additional fix: Connection pool mutex/GIL deadlock

The initial GIL-release change in Connection::reset() introduced a secondary deadlock in ConnectionPool::acquire(). The pool's acquire() method called conn->isAlive() and conn->reset() while holding the pool _mutex. With reset() now releasing the GIL internally, this created a classic ABBA lock-ordering inversion:

Thread A Thread B
Holds _mutex, calls reset() Holds GIL, waiting on _mutex
reset() releases GIL, ODBC call Gets nothing (blocked on mutex)
ODBC returns, tries to reacquire GIL — blocked Still blocked on _mutex
Deadlock: holds mutex, needs GIL Deadlock: holds GIL, needs mutex

The fix restructures ConnectionPool::acquire() to pop one candidate connection from the pool under the mutex, then validate it (isAlive() + reset()) outside the mutex. This follows the same pattern already established in the pool for connect() (Phase 2.5) and disconnect() (Phase 3/4), where blocking ODBC calls that release the GIL are performed without holding the pool lock.

Why this is safe:

  1. No data races on _pool: The candidate is popped from the deque atomically under the mutex before any validation. No other thread can access the same connection object.
  2. _current_size stays accurate: Stale connections are decremented during pruning (Phase 1). Failed candidates are decremented individually after validation. New connections increment the count under the mutex before connect() is called.
  3. Pool size limit is respected: The "reserve a new slot" check (_current_size < _max_size) still happens under the mutex, after confirming no viable candidates remain.
  4. Exception safety: If isAlive() or reset() throws, the connection is caught and treated as dead — it goes to the disconnect list and _current_size is decremented.
  5. Single-threaded behavior unchanged: With one thread, the loop simply pops, validates, and returns — identical to the previous while-loop but without holding the mutex during ODBC calls.

Verification

  • Reproduced Reopen Issue #491: SSH tunneling fails using paramiko+sshtunnel #565 locally with an in-process Python TCP forwarder (simulating paramiko/sshtunnel). Before the GIL-release fix, mssql_python.connect() hung indefinitely; after the fix it completes in ~0.3s.
  • The pool deadlock was reproduced by returning connections to the pool and having 2+ threads call connect() concurrently — this hung indefinitely before the pool fix and completes instantly after.
  • Full test suite: 1707 passed, 64 skipped, 0 failures (~2 minutes).

Diff: 2 files changed — connection.cpp (+42/-12 GIL release) and connection_pool.cpp (+39/-26 mutex restructure).

PR #541 released the GIL around SQLDriverConnect / SQLDisconnect /
SQLEndTran, but several SQLSetConnectAttr sites in connection.cpp
were left holding the GIL:

  - Connection::setAutocommit (SQL_ATTR_AUTOCOMMIT)
  - Connection::setAttribute (user-facing set_attr)
  - Connection::reset (SQL_ATTR_RESET_CONNECTION,
    SQL_ATTR_TXN_ISOLATION) — called by the pool on acquire.

setAutocommit is the call that hangs in the issue #565 repro: it
runs by default immediately after connect, and when the network
path goes through another Python thread (e.g. an in-process SSH
tunnel via paramiko + sshtunnel), that thread cannot make progress
while the GIL is held — deadlock.

Wrap each of these calls with py::gil_scoped_release, matching the
pattern PR #541 introduced for the other ODBC entry points.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@saurabh500 saurabh500 force-pushed the dev/saurabh/fix-565-gil-release-conn-attrs branch from 74814ca to 204849c Compare May 11, 2026 17:45
@github-actions github-actions Bot added pr-size: small Minimal code update pr-size: medium Moderate update size labels May 11, 2026
Comment thread tests/test_023_ssh_tunnel_gil_release.py Fixed
@saurabh500 saurabh500 force-pushed the dev/saurabh/fix-565-gil-release-conn-attrs branch from 827eea5 to 92a568b Compare May 11, 2026 18:35
@github-actions github-actions Bot removed the pr-size: small Minimal code update label May 11, 2026
Routes mssql_python.connect()+SELECT through a pure-Python TCP
forwarder running in a subprocess, with a hard external watchdog
(Popen.communicate(timeout=...)).

The forwarder's _pipe loop calls dst.sendall(...) on every chunk;
that bound-method dispatch needs the GIL, which is exactly the
deadlock condition from issue #565: with the GIL held across
SQLSetConnectAttr(SQL_ATTR_AUTOCOMMIT), the forwarder thread
cannot run, the driver waits forever for a reply, and the
subprocess hangs.

Verified that the test fails (in 30s, with the issue-#565 diagnostic
message) against the unfixed binary on main, and passes (~0.2s)
once the GIL release on setAutocommit/setAttribute/reset is in place.

Linux-only and runs in the default functional suite (not stress).
A subprocess + external watchdog is required because once the
worker thread deadlocks while holding the GIL, the entire Python
interpreter is starved — an in-process watchdog thread cannot
make progress either.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@saurabh500 saurabh500 force-pushed the dev/saurabh/fix-565-gil-release-conn-attrs branch from 92a568b to 0f31307 Compare May 11, 2026 18:42
Comment thread tests/test_023_ssh_tunnel_gil_release.py Dismissed
saurabh500 and others added 2 commits May 12, 2026 13:48
The PR's GIL-release fix in Connection::reset() creates a mutex/GIL
lock-ordering deadlock when multiple threads acquire pooled connections
concurrently:
- Thread A holds pool mutex, releases GIL inside reset()
- Thread B gets GIL, blocks on pool mutex
- Thread A's ODBC call returns, blocks reacquiring GIL → deadlock

Fix: Pop one candidate at a time from the pool under the mutex, then
validate (isAlive + reset) outside the mutex. This follows the same
pattern already used for connect() and disconnect() in the pool.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The CI environment does not have mssql_python installed system-wide;
pytest's path manipulation makes it importable in the test process but
the spawned subprocess does not inherit that. Pass sys.path via
PYTHONPATH so the subprocess can locate mssql_python.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

📊 Code Coverage Report

🔥 Diff Coverage

81%


🎯 Overall Coverage

79%


📈 Total Lines Covered: 6885 out of 8666
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/pybind/connection/connection.cpp (85.3%): Missing lines 381-385
  • mssql_python/pybind/connection/connection_pool.cpp (76.9%): Missing lines 70,85,90-93

Summary

  • Total: 60 lines
  • Missing: 11 lines
  • Coverage: 81%

mssql_python/pybind/connection/connection.cpp

Lines 377-389

  377             this->strBytesBuffer = std::move(binary_data);
  378             SQLPOINTER ptr = const_cast<char*>(this->strBytesBuffer.c_str());
  379             SQLINTEGER length = static_cast<SQLINTEGER>(this->strBytesBuffer.size());
  380 
! 381             SQLRETURN ret;
! 382             {
! 383                 py::gil_scoped_release release;
! 384                 ret = SQLSetConnectAttr_ptr(_dbcHandle->get(), attribute, ptr, length);
! 385             }
  386             if (!SQL_SUCCEEDED(ret)) {
  387                 LOG("Failed to set binary attribute=%d, ret=%d", attribute, ret);
  388             } else {
  389                 LOG("Set binary attribute=%d successfully (length=%d)", attribute, length);

mssql_python/pybind/connection/connection_pool.cpp

Lines 66-74

  66                     // slot will open up — but we can't wait for it here without
  67                     // adding a condition-variable retry loop.  This is an
  68                     // acceptable trade-off: transient "pool full" errors under
  69                     // heavy contention are rare and callers can retry.
! 70                     throw std::runtime_error("ConnectionPool::acquire: pool size limit reached");
  71                 }
  72                 break;
  73             }
  74             candidate = _pool.front();

Lines 81-97

  81                 valid_conn = candidate;
  82                 break;
  83             }
  84         } catch (const std::exception& ex) {
! 85             LOG("Candidate connection validation failed: %s", ex.what());
  86         }
  87 
  88         // Candidate is dead or reset failed — mark for disconnect and
  89         // decrement the pool size.
! 90         to_disconnect.push_back(candidate);
! 91         {
! 92             std::lock_guard<std::mutex> lock(_mutex);
! 93             if (_current_size > 0) --_current_size;
  94         }
  95     }
  96 
  97     // Phase 3: Connect the new connection outside the mutex.


📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 67.9%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 74.6%
mssql_python.pybind.connection.connection.cpp: 76.5%
mssql_python.__init__.py: 77.3%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.connection.py: 85.3%
mssql_python.logging.py: 85.5%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

@saurabh500 saurabh500 marked this pull request as ready for review May 12, 2026 23:49
Copilot AI review requested due to automatic review settings May 12, 2026 23:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a GIL-related deadlock where mssql_python.connect() could hang indefinitely when the connection’s network path is mediated by another in-process Python thread (e.g., paramiko/sshtunnel). It does so by releasing the GIL around blocking SQLSetConnectAttr calls, and adjusts connection-pool acquisition to avoid a mutex/GIL lock-order inversion when reset() now releases the GIL.

Changes:

  • Release the Python GIL around potentially blocking SQLSetConnectAttr call sites (autocommit, user set_attr() paths, and pooled reset paths).
  • Restructure ConnectionPool::acquire() to validate candidate connections outside the pool mutex to prevent mutex/GIL deadlocks.
  • Add a Linux-only functional regression test that reproduces the deadlock via a pure-Python in-process TCP forwarder executed in a subprocess with a hard watchdog.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
mssql_python/pybind/connection/connection.cpp Releases GIL around blocking SQLSetConnectAttr calls to prevent tunnel-thread starvation deadlocks.
mssql_python/pybind/connection/connection_pool.cpp Refactors acquire path to avoid holding the pool mutex across ODBC calls that may release/reacquire the GIL.
tests/test_023_ssh_tunnel_gil_release.py Adds a subprocess-based watchdog regression test reproducing the hang scenario.
Comments suppressed due to low confidence (1)

mssql_python/pybind/connection/connection_pool.cpp:62

  • In acquire(), a thread can pop a candidate for validation (outside the mutex) while another thread sees the pool empty with _current_size == _max_size and immediately throws pool size limit reached. If the in-flight candidate later fails validation and decrements _current_size, the second thread has already failed even though capacity would have become available. Consider tracking “in validation” candidates (or temporarily reserving/releasing capacity) so that exhaustion is decided after all popped candidates are conclusively validated, not while a candidate is being checked outside the lock.
    while (true) {
        std::shared_ptr<Connection> candidate;
        {
            std::lock_guard<std::mutex> lock(_mutex);
            if (_pool.empty()) {
                // No more candidates — try to reserve a slot for a new connection.
                if (_current_size < _max_size) {
                    valid_conn = std::make_shared<Connection>(connStr, true);
                    ++_current_size;
                    needs_connect = true;
                } else {
                    throw std::runtime_error("ConnectionPool::acquire: pool size limit reached");
                }
                break;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 21 to +43
@@ -39,39 +40,51 @@ std::shared_ptr<Connection> ConnectionPool::acquire(const std::wstring& connStr,

size_t pruned = before - _pool.size();
_current_size = (_current_size >= pruned) ? (_current_size - pruned) : 0;
}
env["PYTHONPATH"] = os.pathsep.join(sys.path)

proc = subprocess.Popen(
[sys.executable, __file__],
- Use os.path.abspath(__file__) for subprocess re-exec robustness
- Add comments clarifying _current_size tracks reserved capacity
- Document the pool-full race window as an acceptable trade-off

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-size: medium Moderate update size

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants