Skip to content

Retry token exchange on transient server errors #113

@zfarrell

Description

@zfarrell

The token manager (hotdata/_auth.py) treats any non-200 from the token-exchange endpoint (/v1/auth/jwt) as fatal: _mint raises TokenExchangeError immediately on the first failed response, with no retry.

This means a brief, transient server-side error (e.g. a momentary 500) on the token endpoint fails the caller outright, even though an immediate re-attempt would succeed. We hit this in CI: a single transient 500 during a mint failed one query, while every request around it succeeded.

Ask: retry token exchange on transient failures before giving up.

  • Retry on 5xx responses and transport errors (connection/read errors).
  • Do not retry on 4xx (e.g. 400/401 -- bad/expired credentials are not transient).
  • Use a small bounded retry budget with exponential backoff + jitter (e.g. 2-3 attempts).
  • Applies to both the api_token mint path and the refresh_token path (the refresh path already falls back to a re-mint, so retry should wrap the underlying request).
  • Surface the final error as TokenExchangeError once retries are exhausted, preserving the last status/body.

Context on the server-side transient errors that motivated this: hotdata-dev/monopoly#1128.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions