SqlAlchemyPooledConnectionProvider freezes IAM token in pool creator, causing auth failures after 15 minutes

### Describe the bug

## Summary

When `plugins=iam` is combined with `SqlAlchemyPooledConnectionProvider`, the SQLAlchemy pool's `creator` callable is built once, at pool-creation time, and closes over a **frozen copy** of the connection properties — including the IAM-generated `password`. The pool itself is long-lived and cached on the provider, but the `creator` is never regenerated. Any new physical connection the pool opens later (initial fill beyond the first connection, overflow growth, pool recycle, or connection invalidation) is attempted with the original token and fails once that token is older than 15 minutes.

The `IamAuthPlugin._token_cache` is refreshed correctly by subsequent `AwsWrapperConnection.connect()` calls, but the cache has no effect on the pool's captured creator, which bypasses the plugin chain entirely.

## Root cause

In `aws_advanced_python_wrapper/sql_alchemy_connection_provider.py`:

```python
def _create_pool(self, target_func, driver_dialect, database_dialect, host_info, props):
    kwargs = dict() if self._pool_configurator is None else self._pool_configurator(host_info, props)
    prepared_properties = driver_dialect.prepare_connect_info(host_info, props)  # copy of props
    database_dialect.prepare_conn_props(prepared_properties)
    kwargs["creator"] = self._get_connection_func(target_func, prepared_properties)  # <-- one-shot
    return self._create_sql_alchemy_pool(**kwargs)

def _get_connection_func(self, target_connect_func, props):
    return lambda: target_connect_func(**props)  # closes over frozen props
```

`prepare_connect_info` is explicitly a copy (`Properties(original_props.copy())` in `pg_driver_dialect.py`). The resulting dict — which contains `password=<IAM token T1>` produced by `IamAuthPlugin` on the first connect — is captured by the lambda and handed directly to the raw driver (`psycopg.connect`) for every subsequent pool fill. No plugin in the chain runs, so no token refresh happens on the creator path.

The provider's pool cache (`_database_pools`, a `SlidingExpirationCache`) keeps the pool alive across all `connect()` calls with the same `(host_url, user)` key, so the stale-creator pool persists for the lifetime of the process in practice.

## Why existing mitigations don't help

- **`iam_expiration`**: correctly drives the `IamAuthPlugin._token_cache` TTL, but that cache is only consulted on calls routed through the plugin chain (i.e. `AwsWrapperConnection.connect`). The pool's creator bypasses it.
- **Retry-on-login-error in `IamAuthPlugin._connect`**: does not fire here. psycopg raises a PAM failure whose message begins with `connection failed:`, which is in `_NETWORK_ERROR_MESSAGES` in `pg_exception_handler._NETWORK_ERROR_MESSAGES`; `_is_network_error` wins before the login-exception branch, so the plugin re-raises as `AwsConnectError` without regenerating the token. But even if the retry did fire, it runs inside the plugin chain — which the pool's creator doesn't invoke.

## Impact

In any long-lived process using `plugins=iam` + `SqlAlchemyPooledConnectionProvider`, authentication failures are inevitable once the pool needs to grow or replace a physical connection more than 15 minutes after the pool was first created. The failure rate is proportional to (new-physical-connection events) / (total connect-requests), which in practice is a small but non-zero percentage and scales with traffic burstiness. It is not addressable by tuning `iam_expiration`.


### Expected Behavior

New physical connections opened by a pool created via `SqlAlchemyPooledConnectionProvider` with `plugins=iam` should use a current IAM token rather than a snapshot taken at pool-creation time.

### What plugins are used? What other connection properties were set?

iam

### Current Behavior

Intermittent failures on database queries:

```
aws_advanced_python_wrapper.AwsConnectError: [IamAuthPlugin] Error occurred while
opening a connection: connection failed: connection to server at "<rds-ip>",
port 5432 failed: FATAL:  PAM authentication failed for user "<db_user>"
```

Coincident with spikes in the `IamDbAuthConnectionFailureInvalidToken` RDS CloudWatch metric, and RDS `iam-db-auth-error` logs reporting:

> Failed to authenticate the connection request for user `'<db_user>'` because the token age is longer than 15 minutes

Errors cluster with traffic bursts (because bursts are what trigger pool overflow → new physical connection → frozen token).

### Reproduction Steps


1. **Unit-level** — constructs the provider's creator lambda directly and shows that mutating the originating `Properties` has no effect on what the creator passes to the driver.
2. **End-to-end** — stubs out `boto3` IAM token generation to return a new token each call, stubs out `psycopg.connect` to capture the `password` it receives, drives a full `AwsWrapperConnection.connect()` with `plugins="iam"` and `SqlAlchemyPooledConnectionProvider` installed, then forces the pool to open a new physical connection and observes that the captured password is still the original token even though the plugin's token cache has since been refreshed.

```python
"""
Minimal reproducer for the frozen-IAM-token-in-pool bug in
aws-advanced-python-wrapper `SqlAlchemyPooledConnectionProvider`.

Run:
    pip install "aws-advanced-python-wrapper==2.1.0" "SQLAlchemy>=2" "psycopg[binary]>=3"
    python reproducer.py

Expected output (relevant lines):

    -- unit level --
    passwords seen by pool creator: ['TOKEN_1']
    current token in the (refreshed) props: TOKEN_2
    BUG CONFIRMED at unit level: pool creator still using TOKEN_1

    -- end-to-end --
    passwords handed to psycopg.connect: ['TOKEN_1', 'TOKEN_1']
    IAM plugin token cache now holds: TOKEN_2
    BUG CONFIRMED end-to-end: the pool's second physical connection
    was opened with the stale TOKEN_1 even though the plugin cache
    had already been refreshed to TOKEN_2
"""

from __future__ import annotations

import itertools
from unittest.mock import MagicMock, patch

# --------------------------------------------------------------------------- #
# Part 1: Unit-level reproducer.
#
# This part demonstrates the core mechanism without standing up the full
# wrapper: `_get_connection_func` closes over a snapshot of props, so updates
# to the originating props dict (as would happen when the IAM plugin refreshes
# its cached token) do not reach the pool's creator.
# --------------------------------------------------------------------------- #

def unit_reproducer() -> None:
    from aws_advanced_python_wrapper.sql_alchemy_connection_provider import (
        SqlAlchemyPooledConnectionProvider,
    )
    from aws_advanced_python_wrapper.utils.properties import Properties

    passwords_seen: list[str] = []

    def fake_target_connect(**kwargs) -> MagicMock:
        passwords_seen.append(kwargs.get("password"))
        return MagicMock()

    provider = SqlAlchemyPooledConnectionProvider()

    # These are the props the plugin chain would hand to `_create_pool` after
    # `IamAuthPlugin` wrote the initial token into them.
    props = Properties({"user": "u", "password": "TOKEN_1", "host": "h"})

    # `_create_pool` calls `prepare_connect_info` which returns a *copy*; that
    # copy is what the creator closes over. Model that faithfully here.
    frozen_snapshot = Properties(props.copy())
    creator = provider._get_connection_func(fake_target_connect, frozen_snapshot)

    # IAM plugin refreshes its cached token after 10 minutes and updates the
    # props it is given on the next AwsWrapperConnection.connect() call.
    props["password"] = "TOKEN_2"

    # SQLAlchemy pool grows to meet demand: invokes the stored creator to open
    # a brand-new physical connection.
    creator()

    print("-- unit level --")
    print("passwords seen by pool creator:", passwords_seen)
    print("current token in the (refreshed) props:", props["password"])
    assert passwords_seen == ["TOKEN_1"], passwords_seen
    print("BUG CONFIRMED at unit level: pool creator still using TOKEN_1\n")


# --------------------------------------------------------------------------- #
# Part 2: End-to-end reproducer.
#
# Drives a real `AwsWrapperConnection.connect(plugins="iam", ...)` against a
# `SqlAlchemyPooledConnectionProvider`, with `boto3.client("rds").generate_db_auth_token`
# and `psycopg.connect` both patched. The pool is configured `pool_size=1,
# max_overflow=1` so the second logical connect forces the pool to open a
# second *physical* connection via its stored creator lambda.
# --------------------------------------------------------------------------- #

def end_to_end_reproducer() -> None:
    from aws_advanced_python_wrapper import AwsWrapperConnection
    from aws_advanced_python_wrapper.connection_provider import (
        ConnectionProviderManager,
    )
    from aws_advanced_python_wrapper.sql_alchemy_connection_provider import (
        SqlAlchemyPooledConnectionProvider,
    )
    from aws_advanced_python_wrapper.iam_plugin import IamAuthPlugin

    # Deterministically hand out TOKEN_1, TOKEN_2, TOKEN_3, ...
    tokens = itertools.count(1)
    def fake_generate_db_auth_token(*_a, **_kw):
        return f"TOKEN_{next(tokens)}"

    passwords_to_psycopg: list[str] = []

    class FakePsycopgConn:
        closed = False
        def cursor(self, *_a, **_kw):
            cur = MagicMock()
            cur.execute = MagicMock()
            cur.fetchone = MagicMock(return_value=(1,))
            return cur
        def close(self): self.closed = True
        def commit(self): pass
        def rollback(self): pass

    def fake_psycopg_connect(*args, **kwargs):
        # psycopg.connect(dsn, password=..., ...) — the wrapper passes password
        # as a kwarg after the plugin chain populates it.
        passwords_to_psycopg.append(kwargs.get("password"))
        return FakePsycopgConn()

    # Ensure we start with a clean token cache and pool cache.
    IamAuthPlugin._token_cache.clear()
    SqlAlchemyPooledConnectionProvider._database_pools.clear()

    provider = SqlAlchemyPooledConnectionProvider(
        pool_configurator=lambda host_info, props: {
            "pool_size": 1,
            "max_overflow": 1,   # allow exactly one overflow connection
        }
    )
    ConnectionProviderManager.set_connection_provider(provider)

    # Patch the IAM boto3 client factory and the raw driver.
    rds_client = MagicMock()
    rds_client.generate_db_auth_token.side_effect = fake_generate_db_auth_token

    with patch("boto3.client", return_value=rds_client), \
         patch("psycopg.connect", side_effect=fake_psycopg_connect):
        import psycopg  # noqa: F401  ensures our patched attribute is hit

        connect_kwargs = dict(
            wrapper_dialect="rds-pg",
            plugins="iam",
            iam_region="us-west-2",
            # Keep plugin cache TTL short so we can force a refresh quickly.
            iam_expiration=1,
            user="cashflow_insights_service_app",
            host="fake-rds.cluster-xyz.us-west-2.rds.amazonaws.com",
            port=5432,
            dbname="cashflow",
        )

        # 1) First connect: creates the pool, plugin generates TOKEN_1,
        #    pool creator lambda closes over a props snapshot with TOKEN_1.
        c1 = AwsWrapperConnection.connect(psycopg.connect, "", **connect_kwargs)
        c1_checkedout = True  # keep it out of the pool, force overflow next

        # 2) Force the plugin's token cache to rotate. The cleanest way: invoke
        #    the token generator directly through the plugin's path by
        #    clearing the cache, so the next plugin-level connect produces
        #    TOKEN_2. (In production this happens naturally every ~10 min.)
        IamAuthPlugin._token_cache.clear()

        # 3) Second connect: the provider reuses the existing pool (cached on
        #    host+user). Because pool_size=1 and c1 is still checked out, the
        #    pool grows via overflow — invoking the frozen creator lambda,
        #    which still references TOKEN_1.
        c2 = AwsWrapperConnection.connect(psycopg.connect, "", **connect_kwargs)

        # Force a cursor use so SQLAlchemy actually materializes the DBAPI conn.
        c2.cursor().execute("SELECT 1")
        c1.cursor().execute("SELECT 1") if c1_checkedout else None

    # What the plugin cache holds now (the refreshed token).
    current_cache_token = None
    for info in IamAuthPlugin._token_cache.values():
        current_cache_token = getattr(info, "token", None) or getattr(info, "_token", None)

    print("-- end-to-end --")
    print("passwords handed to psycopg.connect:", passwords_to_psycopg)
    print("IAM plugin token cache now holds:", current_cache_token)

    # The bug: both physical connects used the same frozen TOKEN_1.
    assert passwords_to_psycopg.count("TOKEN_1") >= 2, passwords_to_psycopg
    print(
        "BUG CONFIRMED end-to-end: the pool's second physical connection "
        "was opened with the stale TOKEN_1 even though the plugin cache "
        "had already been refreshed"
    )


if __name__ == "__main__":
    unit_reproducer()
    end_to_end_reproducer()

```

### Possible Solution

The pool's creator should route through the plugin chain so each new physical connection consults the fresh IAM token cache. Two reasonable options:

1. **Rebuild the creator per call.** In `_get_connection_func`, instead of capturing a props snapshot, capture the provider+host+props and rebuild a fresh `target_func(**current_props)` each invocation by re-running the relevant plugin steps (at minimum, re-entering `IamAuthPlugin.connect` if configured).
2. **Delegate to `AwsWrapperConnection.connect` as the creator.** Change the creator produced by `_get_connection_func` to re-enter the full wrapper stack rather than the raw `target_connect_func`. This is the lowest-risk change because it reuses the same code path as a non-pooled connect and lets any plugin (not only IAM) observe every new physical connection.

### Additional Information/Context

Separately worth noting that psycopg PAM failures are misclassified as network errors (`"connection failed"` prefix match), which prevents the login-exception retry inside `IamAuthPlugin._connect` from firing. Fixing that would make the IAM plugin more resilient when it *is* on the connection path, but does not by itself fix this pool-creator bug.

### The AWS Advanced Python Wrapper version used

2.1.0

### python version used

3.13.7

### Operating System and version

Debian 13.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SqlAlchemyPooledConnectionProvider freezes IAM token in pool creator, causing auth failures after 15 minutes #1230

Describe the bug

Summary

Root cause

Why existing mitigations don't help

Impact

Expected Behavior

What plugins are used? What other connection properties were set?

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

The AWS Advanced Python Wrapper version used

python version used

Operating System and version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SqlAlchemyPooledConnectionProvider freezes IAM token in pool creator, causing auth failures after 15 minutes #1230

Description

Describe the bug

Summary

Root cause

Why existing mitigations don't help

Impact

Expected Behavior

What plugins are used? What other connection properties were set?

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

The AWS Advanced Python Wrapper version used

python version used

Operating System and version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions