Describe the bug
Summary
When plugins=iam is combined with SqlAlchemyPooledConnectionProvider, the SQLAlchemy pool's creator callable is built once, at pool-creation time, and closes over a frozen copy of the connection properties — including the IAM-generated password. The pool itself is long-lived and cached on the provider, but the creator is never regenerated. Any new physical connection the pool opens later (initial fill beyond the first connection, overflow growth, pool recycle, or connection invalidation) is attempted with the original token and fails once that token is older than 15 minutes.
The IamAuthPlugin._token_cache is refreshed correctly by subsequent AwsWrapperConnection.connect() calls, but the cache has no effect on the pool's captured creator, which bypasses the plugin chain entirely.
Root cause
In aws_advanced_python_wrapper/sql_alchemy_connection_provider.py:
def _create_pool(self, target_func, driver_dialect, database_dialect, host_info, props):
kwargs = dict() if self._pool_configurator is None else self._pool_configurator(host_info, props)
prepared_properties = driver_dialect.prepare_connect_info(host_info, props) # copy of props
database_dialect.prepare_conn_props(prepared_properties)
kwargs["creator"] = self._get_connection_func(target_func, prepared_properties) # <-- one-shot
return self._create_sql_alchemy_pool(**kwargs)
def _get_connection_func(self, target_connect_func, props):
return lambda: target_connect_func(**props) # closes over frozen props
prepare_connect_info is explicitly a copy (Properties(original_props.copy()) in pg_driver_dialect.py). The resulting dict — which contains password=<IAM token T1> produced by IamAuthPlugin on the first connect — is captured by the lambda and handed directly to the raw driver (psycopg.connect) for every subsequent pool fill. No plugin in the chain runs, so no token refresh happens on the creator path.
The provider's pool cache (_database_pools, a SlidingExpirationCache) keeps the pool alive across all connect() calls with the same (host_url, user) key, so the stale-creator pool persists for the lifetime of the process in practice.
Why existing mitigations don't help
iam_expiration: correctly drives the IamAuthPlugin._token_cache TTL, but that cache is only consulted on calls routed through the plugin chain (i.e. AwsWrapperConnection.connect). The pool's creator bypasses it.
- Retry-on-login-error in
IamAuthPlugin._connect: does not fire here. psycopg raises a PAM failure whose message begins with connection failed:, which is in _NETWORK_ERROR_MESSAGES in pg_exception_handler._NETWORK_ERROR_MESSAGES; _is_network_error wins before the login-exception branch, so the plugin re-raises as AwsConnectError without regenerating the token. But even if the retry did fire, it runs inside the plugin chain — which the pool's creator doesn't invoke.
Impact
In any long-lived process using plugins=iam + SqlAlchemyPooledConnectionProvider, authentication failures are inevitable once the pool needs to grow or replace a physical connection more than 15 minutes after the pool was first created. The failure rate is proportional to (new-physical-connection events) / (total connect-requests), which in practice is a small but non-zero percentage and scales with traffic burstiness. It is not addressable by tuning iam_expiration.
Expected Behavior
New physical connections opened by a pool created via SqlAlchemyPooledConnectionProvider with plugins=iam should use a current IAM token rather than a snapshot taken at pool-creation time.
What plugins are used? What other connection properties were set?
iam
Current Behavior
Intermittent failures on database queries:
aws_advanced_python_wrapper.AwsConnectError: [IamAuthPlugin] Error occurred while
opening a connection: connection failed: connection to server at "<rds-ip>",
port 5432 failed: FATAL: PAM authentication failed for user "<db_user>"
Coincident with spikes in the IamDbAuthConnectionFailureInvalidToken RDS CloudWatch metric, and RDS iam-db-auth-error logs reporting:
Failed to authenticate the connection request for user '<db_user>' because the token age is longer than 15 minutes
Errors cluster with traffic bursts (because bursts are what trigger pool overflow → new physical connection → frozen token).
Reproduction Steps
- Unit-level — constructs the provider's creator lambda directly and shows that mutating the originating
Properties has no effect on what the creator passes to the driver.
- End-to-end — stubs out
boto3 IAM token generation to return a new token each call, stubs out psycopg.connect to capture the password it receives, drives a full AwsWrapperConnection.connect() with plugins="iam" and SqlAlchemyPooledConnectionProvider installed, then forces the pool to open a new physical connection and observes that the captured password is still the original token even though the plugin's token cache has since been refreshed.
"""
Minimal reproducer for the frozen-IAM-token-in-pool bug in
aws-advanced-python-wrapper `SqlAlchemyPooledConnectionProvider`.
Run:
pip install "aws-advanced-python-wrapper==2.1.0" "SQLAlchemy>=2" "psycopg[binary]>=3"
python reproducer.py
Expected output (relevant lines):
-- unit level --
passwords seen by pool creator: ['TOKEN_1']
current token in the (refreshed) props: TOKEN_2
BUG CONFIRMED at unit level: pool creator still using TOKEN_1
-- end-to-end --
passwords handed to psycopg.connect: ['TOKEN_1', 'TOKEN_1']
IAM plugin token cache now holds: TOKEN_2
BUG CONFIRMED end-to-end: the pool's second physical connection
was opened with the stale TOKEN_1 even though the plugin cache
had already been refreshed to TOKEN_2
"""
from __future__ import annotations
import itertools
from unittest.mock import MagicMock, patch
# --------------------------------------------------------------------------- #
# Part 1: Unit-level reproducer.
#
# This part demonstrates the core mechanism without standing up the full
# wrapper: `_get_connection_func` closes over a snapshot of props, so updates
# to the originating props dict (as would happen when the IAM plugin refreshes
# its cached token) do not reach the pool's creator.
# --------------------------------------------------------------------------- #
def unit_reproducer() -> None:
from aws_advanced_python_wrapper.sql_alchemy_connection_provider import (
SqlAlchemyPooledConnectionProvider,
)
from aws_advanced_python_wrapper.utils.properties import Properties
passwords_seen: list[str] = []
def fake_target_connect(**kwargs) -> MagicMock:
passwords_seen.append(kwargs.get("password"))
return MagicMock()
provider = SqlAlchemyPooledConnectionProvider()
# These are the props the plugin chain would hand to `_create_pool` after
# `IamAuthPlugin` wrote the initial token into them.
props = Properties({"user": "u", "password": "TOKEN_1", "host": "h"})
# `_create_pool` calls `prepare_connect_info` which returns a *copy*; that
# copy is what the creator closes over. Model that faithfully here.
frozen_snapshot = Properties(props.copy())
creator = provider._get_connection_func(fake_target_connect, frozen_snapshot)
# IAM plugin refreshes its cached token after 10 minutes and updates the
# props it is given on the next AwsWrapperConnection.connect() call.
props["password"] = "TOKEN_2"
# SQLAlchemy pool grows to meet demand: invokes the stored creator to open
# a brand-new physical connection.
creator()
print("-- unit level --")
print("passwords seen by pool creator:", passwords_seen)
print("current token in the (refreshed) props:", props["password"])
assert passwords_seen == ["TOKEN_1"], passwords_seen
print("BUG CONFIRMED at unit level: pool creator still using TOKEN_1\n")
# --------------------------------------------------------------------------- #
# Part 2: End-to-end reproducer.
#
# Drives a real `AwsWrapperConnection.connect(plugins="iam", ...)` against a
# `SqlAlchemyPooledConnectionProvider`, with `boto3.client("rds").generate_db_auth_token`
# and `psycopg.connect` both patched. The pool is configured `pool_size=1,
# max_overflow=1` so the second logical connect forces the pool to open a
# second *physical* connection via its stored creator lambda.
# --------------------------------------------------------------------------- #
def end_to_end_reproducer() -> None:
from aws_advanced_python_wrapper import AwsWrapperConnection
from aws_advanced_python_wrapper.connection_provider import (
ConnectionProviderManager,
)
from aws_advanced_python_wrapper.sql_alchemy_connection_provider import (
SqlAlchemyPooledConnectionProvider,
)
from aws_advanced_python_wrapper.iam_plugin import IamAuthPlugin
# Deterministically hand out TOKEN_1, TOKEN_2, TOKEN_3, ...
tokens = itertools.count(1)
def fake_generate_db_auth_token(*_a, **_kw):
return f"TOKEN_{next(tokens)}"
passwords_to_psycopg: list[str] = []
class FakePsycopgConn:
closed = False
def cursor(self, *_a, **_kw):
cur = MagicMock()
cur.execute = MagicMock()
cur.fetchone = MagicMock(return_value=(1,))
return cur
def close(self): self.closed = True
def commit(self): pass
def rollback(self): pass
def fake_psycopg_connect(*args, **kwargs):
# psycopg.connect(dsn, password=..., ...) — the wrapper passes password
# as a kwarg after the plugin chain populates it.
passwords_to_psycopg.append(kwargs.get("password"))
return FakePsycopgConn()
# Ensure we start with a clean token cache and pool cache.
IamAuthPlugin._token_cache.clear()
SqlAlchemyPooledConnectionProvider._database_pools.clear()
provider = SqlAlchemyPooledConnectionProvider(
pool_configurator=lambda host_info, props: {
"pool_size": 1,
"max_overflow": 1, # allow exactly one overflow connection
}
)
ConnectionProviderManager.set_connection_provider(provider)
# Patch the IAM boto3 client factory and the raw driver.
rds_client = MagicMock()
rds_client.generate_db_auth_token.side_effect = fake_generate_db_auth_token
with patch("boto3.client", return_value=rds_client), \
patch("psycopg.connect", side_effect=fake_psycopg_connect):
import psycopg # noqa: F401 ensures our patched attribute is hit
connect_kwargs = dict(
wrapper_dialect="rds-pg",
plugins="iam",
iam_region="us-west-2",
# Keep plugin cache TTL short so we can force a refresh quickly.
iam_expiration=1,
user="cashflow_insights_service_app",
host="fake-rds.cluster-xyz.us-west-2.rds.amazonaws.com",
port=5432,
dbname="cashflow",
)
# 1) First connect: creates the pool, plugin generates TOKEN_1,
# pool creator lambda closes over a props snapshot with TOKEN_1.
c1 = AwsWrapperConnection.connect(psycopg.connect, "", **connect_kwargs)
c1_checkedout = True # keep it out of the pool, force overflow next
# 2) Force the plugin's token cache to rotate. The cleanest way: invoke
# the token generator directly through the plugin's path by
# clearing the cache, so the next plugin-level connect produces
# TOKEN_2. (In production this happens naturally every ~10 min.)
IamAuthPlugin._token_cache.clear()
# 3) Second connect: the provider reuses the existing pool (cached on
# host+user). Because pool_size=1 and c1 is still checked out, the
# pool grows via overflow — invoking the frozen creator lambda,
# which still references TOKEN_1.
c2 = AwsWrapperConnection.connect(psycopg.connect, "", **connect_kwargs)
# Force a cursor use so SQLAlchemy actually materializes the DBAPI conn.
c2.cursor().execute("SELECT 1")
c1.cursor().execute("SELECT 1") if c1_checkedout else None
# What the plugin cache holds now (the refreshed token).
current_cache_token = None
for info in IamAuthPlugin._token_cache.values():
current_cache_token = getattr(info, "token", None) or getattr(info, "_token", None)
print("-- end-to-end --")
print("passwords handed to psycopg.connect:", passwords_to_psycopg)
print("IAM plugin token cache now holds:", current_cache_token)
# The bug: both physical connects used the same frozen TOKEN_1.
assert passwords_to_psycopg.count("TOKEN_1") >= 2, passwords_to_psycopg
print(
"BUG CONFIRMED end-to-end: the pool's second physical connection "
"was opened with the stale TOKEN_1 even though the plugin cache "
"had already been refreshed"
)
if __name__ == "__main__":
unit_reproducer()
end_to_end_reproducer()
Possible Solution
The pool's creator should route through the plugin chain so each new physical connection consults the fresh IAM token cache. Two reasonable options:
- Rebuild the creator per call. In
_get_connection_func, instead of capturing a props snapshot, capture the provider+host+props and rebuild a fresh target_func(**current_props) each invocation by re-running the relevant plugin steps (at minimum, re-entering IamAuthPlugin.connect if configured).
- Delegate to
AwsWrapperConnection.connect as the creator. Change the creator produced by _get_connection_func to re-enter the full wrapper stack rather than the raw target_connect_func. This is the lowest-risk change because it reuses the same code path as a non-pooled connect and lets any plugin (not only IAM) observe every new physical connection.
Additional Information/Context
Separately worth noting that psycopg PAM failures are misclassified as network errors ("connection failed" prefix match), which prevents the login-exception retry inside IamAuthPlugin._connect from firing. Fixing that would make the IAM plugin more resilient when it is on the connection path, but does not by itself fix this pool-creator bug.
The AWS Advanced Python Wrapper version used
2.1.0
python version used
3.13.7
Operating System and version
Debian 13.4
Describe the bug
Summary
When
plugins=iamis combined withSqlAlchemyPooledConnectionProvider, the SQLAlchemy pool'screatorcallable is built once, at pool-creation time, and closes over a frozen copy of the connection properties — including the IAM-generatedpassword. The pool itself is long-lived and cached on the provider, but thecreatoris never regenerated. Any new physical connection the pool opens later (initial fill beyond the first connection, overflow growth, pool recycle, or connection invalidation) is attempted with the original token and fails once that token is older than 15 minutes.The
IamAuthPlugin._token_cacheis refreshed correctly by subsequentAwsWrapperConnection.connect()calls, but the cache has no effect on the pool's captured creator, which bypasses the plugin chain entirely.Root cause
In
aws_advanced_python_wrapper/sql_alchemy_connection_provider.py:prepare_connect_infois explicitly a copy (Properties(original_props.copy())inpg_driver_dialect.py). The resulting dict — which containspassword=<IAM token T1>produced byIamAuthPluginon the first connect — is captured by the lambda and handed directly to the raw driver (psycopg.connect) for every subsequent pool fill. No plugin in the chain runs, so no token refresh happens on the creator path.The provider's pool cache (
_database_pools, aSlidingExpirationCache) keeps the pool alive across allconnect()calls with the same(host_url, user)key, so the stale-creator pool persists for the lifetime of the process in practice.Why existing mitigations don't help
iam_expiration: correctly drives theIamAuthPlugin._token_cacheTTL, but that cache is only consulted on calls routed through the plugin chain (i.e.AwsWrapperConnection.connect). The pool's creator bypasses it.IamAuthPlugin._connect: does not fire here. psycopg raises a PAM failure whose message begins withconnection failed:, which is in_NETWORK_ERROR_MESSAGESinpg_exception_handler._NETWORK_ERROR_MESSAGES;_is_network_errorwins before the login-exception branch, so the plugin re-raises asAwsConnectErrorwithout regenerating the token. But even if the retry did fire, it runs inside the plugin chain — which the pool's creator doesn't invoke.Impact
In any long-lived process using
plugins=iam+SqlAlchemyPooledConnectionProvider, authentication failures are inevitable once the pool needs to grow or replace a physical connection more than 15 minutes after the pool was first created. The failure rate is proportional to (new-physical-connection events) / (total connect-requests), which in practice is a small but non-zero percentage and scales with traffic burstiness. It is not addressable by tuningiam_expiration.Expected Behavior
New physical connections opened by a pool created via
SqlAlchemyPooledConnectionProviderwithplugins=iamshould use a current IAM token rather than a snapshot taken at pool-creation time.What plugins are used? What other connection properties were set?
iam
Current Behavior
Intermittent failures on database queries:
Coincident with spikes in the
IamDbAuthConnectionFailureInvalidTokenRDS CloudWatch metric, and RDSiam-db-auth-errorlogs reporting:Errors cluster with traffic bursts (because bursts are what trigger pool overflow → new physical connection → frozen token).
Reproduction Steps
Propertieshas no effect on what the creator passes to the driver.boto3IAM token generation to return a new token each call, stubs outpsycopg.connectto capture thepasswordit receives, drives a fullAwsWrapperConnection.connect()withplugins="iam"andSqlAlchemyPooledConnectionProviderinstalled, then forces the pool to open a new physical connection and observes that the captured password is still the original token even though the plugin's token cache has since been refreshed.Possible Solution
The pool's creator should route through the plugin chain so each new physical connection consults the fresh IAM token cache. Two reasonable options:
_get_connection_func, instead of capturing a props snapshot, capture the provider+host+props and rebuild a freshtarget_func(**current_props)each invocation by re-running the relevant plugin steps (at minimum, re-enteringIamAuthPlugin.connectif configured).AwsWrapperConnection.connectas the creator. Change the creator produced by_get_connection_functo re-enter the full wrapper stack rather than the rawtarget_connect_func. This is the lowest-risk change because it reuses the same code path as a non-pooled connect and lets any plugin (not only IAM) observe every new physical connection.Additional Information/Context
Separately worth noting that psycopg PAM failures are misclassified as network errors (
"connection failed"prefix match), which prevents the login-exception retry insideIamAuthPlugin._connectfrom firing. Fixing that would make the IAM plugin more resilient when it is on the connection path, but does not by itself fix this pool-creator bug.The AWS Advanced Python Wrapper version used
2.1.0
python version used
3.13.7
Operating System and version
Debian 13.4