Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,3 +96,17 @@ Three documented internal boundaries. AI agents must respect them — never cros

- Check the relevant [`architecture/`](architecture/) capability file before adding a new module or extension point.
- Surface ambiguity as a documentation gap rather than improvising.

## Agent skills

### Issue tracker

Issues live in GitHub Issues (`modern-python/httpware`), managed via the `gh` CLI; external PRs are not a triage surface. See `planning/agents/issue-tracker.md`.

### Triage labels

Canonical defaults — `needs-triage`, `needs-info`, `ready-for-agent`, `ready-for-human`, `wontfix` (the last already exists). See `planning/agents/triage-labels.md`.

### Domain docs

Single-context — one `CONTEXT.md` at the repo root + ADRs under `planning/adr/`. See `planning/agents/domain.md`.
2 changes: 2 additions & 0 deletions architecture/resilience.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@

`Retry` (and `AsyncRetry`) is a retry middleware backed by a Finagle-style `RetryBudget` — a token bucket that caps the proportion of traffic spent on retries so a degraded backend cannot be amplified into a retry storm. `RetryBudget` is a single thread-safe class shared by both worlds: all mutations go through a `threading.Lock`, so state is never torn. "Safe" here means no corruption, not non-blocking — when one budget is shared across a (sync `Client`, `AsyncClient`) pair, a sync thread holding the lock can briefly block the event-loop thread's acquisition. The critical section is intentionally tiny to bound that latency. Backoff between attempts uses full-jitter.

The decision logic — status/method eligibility, streaming-body refusal, exhaustion, Retry-After handling, budget accounting, and the backoff delay — lives once in a stateless private `_RetryPolicy.decide`, the retry analog of how the circuit breaker keeps its transition logic in one shared state object. `Retry` and `AsyncRetry` are thin loop drivers over that policy: they own the attempt loop, the terminal call, and the sleep, and differ only in `await next` vs `next` and `asyncio.sleep` vs `time.sleep`. `decide` returns the delay to sleep before the next attempt, or raises the terminal exception (with its PEP 678 note and event already emitted); because it runs inside the wrapper's `except` block, exception chaining behaves as a direct raise. `_RetryPolicy` holds the immutable config plus the shared `RetryBudget`; per-attempt state stays as wrapper locals, so one instance is safe across the concurrent requests it serves.

## Bulkhead

`Bulkhead` / `AsyncBulkhead` is a concurrency limiter. `AsyncBulkhead` uses `asyncio.Semaphore` with a bounded acquire wait; sync `Bulkhead` uses `threading.Semaphore`. A sync instance cannot share with an async one. Both are sharable across clients (one instance = one shared concurrency pool).
Expand Down
39 changes: 39 additions & 0 deletions planning/agents/domain.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Domain Docs

How the engineering skills should consume this repo's domain documentation when exploring the codebase.

**Layout: single-context.** One `CONTEXT.md` at the repo root + ADRs under `planning/adr/`.

ADRs live under `planning/` (internal docs) rather than `docs/` (the user-facing mkdocs site). This repo also keeps per-capability living truth in [`architecture/`](../../architecture/) and per-change design under [`planning/changes/`](../changes/) — read those for established context before writing a new ADR.

## Before exploring, read these

- **`CONTEXT.md`** at the repo root.
- **`planning/adr/`** — read ADRs that touch the area you're about to work in.

If any of these files don't exist, **proceed silently**. Don't flag their absence; don't suggest creating them upfront. The `/domain-modeling` skill (reached via `/grill-with-docs` and `/improve-codebase-architecture`) creates them lazily when terms or decisions actually get resolved.

## File structure

Single-context repo (this repo):

```
/
├── CONTEXT.md
├── planning/adr/
│ ├── 0001-some-decision.md
│ └── 0002-another-decision.md
└── src/
```

## Use the glossary's vocabulary

When your output names a domain concept (in an issue title, a refactor proposal, a hypothesis, a test name), use the term as defined in `CONTEXT.md`. Don't drift to synonyms the glossary explicitly avoids.

If the concept you need isn't in the glossary yet, that's a signal — either you're inventing language the project doesn't use (reconsider) or there's a real gap (note it for `/domain-modeling`).

## Flag ADR conflicts

If your output contradicts an existing ADR, surface it explicitly rather than silently overriding:

> _Contradicts ADR-0007 (some decision) — but worth reopening because…_
34 changes: 34 additions & 0 deletions planning/agents/issue-tracker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Issue tracker: GitHub

Issues and PRDs for this repo live as GitHub issues. Use the `gh` CLI for all operations.

## Conventions

- **Create an issue**: `gh issue create --title "..." --body "..."`. Use a heredoc for multi-line bodies.
- **Read an issue**: `gh issue view <number> --comments`, filtering comments by `jq` and also fetching labels.
- **List issues**: `gh issue list --state open --json number,title,body,labels,comments --jq '[.[] | {number, title, body, labels: [.labels[].name], comments: [.comments[].body]}]'` with appropriate `--label` and `--state` filters.
- **Comment on an issue**: `gh issue comment <number> --body "..."`
- **Apply / remove labels**: `gh issue edit <number> --add-label "..."` / `--remove-label "..."`
- **Close**: `gh issue close <number> --comment "..."`

Infer the repo from `git remote -v` — `gh` does this automatically when run inside a clone. This repo's remote is `modern-python/httpware`.

## Pull requests as a triage surface

**PRs as a request surface: no.** _(Set to `yes` if this repo treats external PRs as feature requests; `/triage` reads this flag.)_

When set to `yes`, PRs run through the same labels and states as issues, using the `gh pr` equivalents:

- **Read a PR**: `gh pr view <number> --comments` and `gh pr diff <number>` for the diff.
- **List external PRs for triage**: `gh pr list --state open --json number,title,body,labels,author,authorAssociation,comments` then keep only `authorAssociation` of `CONTRIBUTOR`, `FIRST_TIME_CONTRIBUTOR`, or `NONE` (drop `OWNER`/`MEMBER`/`COLLABORATOR`).
- **Comment / label / close**: `gh pr comment`, `gh pr edit --add-label`/`--remove-label`, `gh pr close`.

GitHub shares one number space across issues and PRs, so a bare `#42` may be either — resolve with `gh pr view 42` and fall back to `gh issue view 42`.

## When a skill says "publish to the issue tracker"

Create a GitHub issue.

## When a skill says "fetch the relevant ticket"

Run `gh issue view <number> --comments`.
17 changes: 17 additions & 0 deletions planning/agents/triage-labels.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Triage Labels

The skills speak in terms of five canonical triage roles. This file maps those roles to the actual label strings used in this repo's issue tracker.

| Canonical role | Label in our tracker | Meaning |
| -------------------------- | -------------------- | ---------------------------------------- |
| `needs-triage` | `needs-triage` | Maintainer needs to evaluate this issue |
| `needs-info` | `needs-info` | Waiting on reporter for more information |
| `ready-for-agent` | `ready-for-agent` | Fully specified, ready for an AFK agent |
| `ready-for-human` | `ready-for-human` | Requires human implementation |
| `wontfix` | `wontfix` | Will not be actioned |

`wontfix` already exists in this repo's GitHub labels. The other four are created on first use by `/triage` (`gh label create <name>`).

When a skill mentions a role (e.g. "apply the AFK-ready triage label"), use the corresponding label string from this table.

Edit the right-hand column to match whatever vocabulary you actually use.
173 changes: 173 additions & 0 deletions planning/changes/2026-06-23.01-retry-policy-extraction/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
---
status: shipped
date: 2026-06-23
slug: retry-policy-extraction
summary: Extract a stateless _RetryPolicy decision module from the duplicated AsyncRetry/Retry __call__ loops.
supersedes: null
superseded_by: null
pr: 76
outcome: Shipped via #76 — decision logic moved into a stateless _RetryPolicy.decide; AsyncRetry/Retry are now thin loop drivers, the ~110-line sync/async duplication is gone, behaviour byte-identical (718 tests, 100% coverage). New seam suite tests/test_retry_policy.py; promoted into architecture/resilience.md. Internal refactor — no release.
---

# Design: Extract a deep `_RetryPolicy` decision module

## Summary

`AsyncRetry.__call__` and `Retry.__call__` hand-copy ~110 lines of retry
*decision* logic — status eligibility, streaming-body refusal, exhaustion,
Retry-After parsing, budget accounting, backoff — differing only in `await
next` vs `next` and `asyncio.sleep` vs `time.sleep`. This change pulls the
decision logic into a stateless private `_RetryPolicy` in the same module, so
both wrappers shrink to a thin loop and the decision lives once. It mirrors
the precedent already in the package: `CircuitBreaker`/`AsyncCircuitBreaker`
share the lock-free `_CircuitBreakerState`.

## Motivation

- `retry.py:100-210` (`AsyncRetry.__call__`) and `retry.py:213-349`
(`Retry.__call__`) are ~110 lines each, byte-identical except the `await`.
Parity is hand-maintained; drift is undetectable. Both carry
`# noqa: C901, PLR0912, PLR0915` to silence the complexity budget.
- The package already proved the fix: `_CircuitBreakerState`
(`circuit_breaker.py:131-310`) is a deep, synchronous, lock-free decision
module that both breaker wrappers drive. Retry never got the same treatment.
- **Depth:** the retry interface (the `Middleware` protocol — one `__call__`)
is small, but the implementation is duplicated rather than deep. Moving the
decision behind `_RetryPolicy.decide` concentrates it: one place to fix a
retry bug (locality), one interface to test directly without `MockTransport`
(leverage).

## Non-goals

- No behaviour change. The retry policy, defaults, events, notes, and raised
exceptions stay byte-identical.
- Not touching `RetryBudget`, `_backoff.full_jitter_delay`, or
`_parse_retry_after` — they stay as-is.
- Not unifying the sync/async wrappers themselves — the `await`/blocking split
is fundamental and stays in the two thin `__call__` shells.
- Not extending the same treatment to `Bulkhead` in this change.

## Design

### 1. `_RetryPolicy` — stateless decision module

A private class in `retry.py`, holding **immutable config + the shared
budget** and nothing per-call mutable. This is the faithful analog of
`_CircuitBreakerState`: there the *circuit* is the shared state; here the
shared state is the already-thread-safe `RetryBudget`, and `_RetryPolicy` is
the decision logic around it. Because it carries no per-call field, it is
trivially safe under the concurrent requests a single frozen middleware
instance serves.

It owns:

- config: `max_attempts`, `base_delay`, `max_delay`, `retry_status_codes`,
`retry_methods`, `respect_retry_after`, `budget`;
- validation: `max_attempts < 1` → `ValueError` (raised when the wrapper
builds the policy in `__init__`, so construction-time behaviour is
unchanged);
- the `_LOGGER` event emissions and PEP-678 note additions (side effects move
here with the decision).

### 2. The seam — one method

```python
def decide(self, *, attempt: int, request: httpx2.Request, exc: BaseException) -> float
```

- **Returns** the `float` delay to sleep for the retry case.
- **Raises** for every terminal case, having already added the note, emitted
the event, and (for the budget case) constructed `RetryBudgetExhaustedError`
with its `__cause__`. `decide` is called *inside* the wrapper's `except`
block, so implicit `__context__` and explicit `raise ... from exc` chaining
behave exactly as today — no manual `__cause__` fiddling.

Classification is folded in (no separate predicate): derive `last_response`
from `isinstance(exc, StatusError)`; apply method-eligibility and status-set
membership; re-raise non-retryable failures unchanged; otherwise walk
streaming-refusal → exhaustion → Retry-After-exceeds-`max_delay` → budget
`try_withdraw` → delay (Retry-After value or `full_jitter_delay`).

Rejected alternative: a `_Sleep | _Stop` sum type. It defers the raise to
*after* the `except` block, losing the active exception context and forcing
manual chain reconstruction — machinery that exists only to paper over that.
Returning-a-delay-or-raising matches `_CircuitBreakerState.admit()`, which
already raises `CircuitOpenError` rather than returning a rejected value.

### 3. The wrappers shrink to a thin driver

```python
_RETRYABLE_EXCEPTIONS = (StatusError, NetworkError, TimeoutError)

async def __call__(self, request: httpx2.Request, next: AsyncNext) -> httpx2.Response:
self.budget.deposit()
for attempt in range(self._policy.max_attempts):
try:
return await next(request)
except _RETRYABLE_EXCEPTIONS as exc:
delay = self._policy.decide(attempt=attempt, request=request, exc=exc)
await self._sleep(delay)
raise AssertionError("unreachable") # pragma: no cover
```

The sync `Retry.__call__` is identical but for `next(request)` and
`self._sleep(delay)`. `_RETRYABLE_EXCEPTIONS` is one module constant
referenced by both — the narrow catch surface stays structural, so anything
not in the tuple (e.g. `httpx2.InvalidURL`, programming errors) propagates
untouched exactly as today. The `# noqa: C901, PLR0912, PLR0915` suppressions
come off `__call__`; `decide` may carry its own.

### 4. Preserved public contract

- `AsyncRetry.__init__` / `Retry.__init__` signatures unchanged (incl.
`_sleep`, `budget`).
- The wrapper keeps `self.budget` (the *same object* the policy holds, so
`r1.budget is r2.budget` identity tests pass) and `self._sleep`.
- The six config attributes (`max_attempts`, `base_delay`, `max_delay`,
`retry_status_codes`, `retry_methods`, `respect_retry_after`) are **dropped**
from the wrapper instances — they live solely on `_RetryPolicy`. They are
read nowhere outside `retry.py` and `docs/resilience.md` documents them only
as constructor parameters, not readable attributes.

## Operations

None — internal refactor, no infra or external changes.

## Out of scope

- `Bulkhead`/`AsyncBulkhead` deduplication.
- Injecting randomness into `full_jitter_delay` (see Testing — only needed if
we want exact-value assertions on the jitter path).

## Testing

- **Parity net:** all existing `MockTransport` suites — `test_retry.py`,
`test_retry_sync.py`, `test_retry_props.py`,
`test_retry_budget_threadsafety.py`, `test_threading_with_shared_budget.py`
— stay green unchanged. Byte-identical behaviour is the bar.
- **New seam tests:** `tests/test_retry_policy.py` drives `decide` directly
(no client, no `MockTransport`) across the decision matrix: retryable →
returns a delay; non-retryable status / non-eligible method → re-raises the
original; streaming-body refusal; exhaustion note on the last attempt;
Retry-After > `max_delay`; budget refusal → `RetryBudgetExhaustedError` with
`__cause__`.
- The jitter path returns a random delay, so assert **bounds**
(`0 ≤ delay ≤ max_delay`) for it; assert exact values only on the
deterministic Retry-After path.
- `just lint` and `just test` both clean.

## Risk

- **Behavioural drift during extraction** (likely × high): a subtle
reordering changes a note string, an event payload, or which exception wins.
*Mitigation:* extract under the existing green suites; they assert notes,
events (via the recording sleeper / caplog), and exception types. Do not
edit the test suites in this change.
- **Exception-chaining regression** (low × medium): moving the raise into
`decide` could drop a `__cause__`/`__context__`. *Mitigation:* `decide` is
called inside the live `except`; an explicit test asserts `__cause__` on the
budget-exhausted path.
- **Concurrency** (low × high): a stray per-call field on `_RetryPolicy` would
make a shared instance unsafe. *Mitigation:* the policy holds only immutable
config + the lock-guarded budget; per-attempt state stays as wrapper locals.
The property/thread-safety suites cover interleaving.
Loading