docs(retries): add guides/retries.md retry/resilience documentation

jhamon · jhamon · commit fdadfdeebbb2 · 2026-05-29T17:48:08.000Z
## Purpose

The retry-resilience plan (DX-0147–DX-0161) added jitter, Retry-After smear,
grpc-retry-pushback-ms support, and AIMD adaptive concurrency. All of that
behavior now deserves its own dedicated guide rather than a paragraph in
error-handling.md.

## Solution

Created `docs/guides/retries.md` covering:
- Defaults at a glance (REST vs gRPC retry counts, retryable codes)
- RetryConfig fields with accurate defaults and migration note for backoff_factor
- Decorrelated jitter formula and Retry-After smear formula with concrete examples
- Server-driven retry hints (REST Retry-After, gRPC grpc-retry-pushback-ms)
- Adaptive concurrency (AIMD): how max_concurrency is a ceiling, not a constant
- Transport differences table (REST vs gRPC retry policy)
- Multi-process / serverless honest-limitations section with orchestrator patterns

Added retries.md to the Guides toctree in docs/index.rst adjacent to error-handling.

Sphinx build passes with -W (warnings as errors). All 4472 unit tests pass.

## Follow-ups

DX-0163 will replace the current Retries section in error-handling.md with a
cross-reference back to this page.
diff --git a/docs/guides/retries.md b/docs/guides/retries.md
@@ -0,0 +1,319 @@
+# Retries and Resilience
+
+The SDK retries failed requests automatically using multiple layers of resilience: per-request
+backoff with decorrelated jitter, server-driven delay hints, and adaptive concurrency for bulk
+operations. This page covers everything you need to know to tune that behavior or to decide when
+to hand retry control back to your orchestrator.
+
+For the exceptions the SDK raises when retries are exhausted, see {doc}`/guides/error-handling`.
+
+---
+
+## Defaults at a Glance
+
+Out of the box — no configuration needed:
+
+| What | Default behavior |
+|------|------------------|
+| Max retries (after initial attempt) | 3 for REST (4 total), 5 for gRPC (6 total) |
+| Retryable HTTP status codes | 408, 429, 500, 502, 503, 504 |
+| Retryable gRPC status codes | UNAVAILABLE, RESOURCE\_EXHAUSTED, ABORTED |
+| Backoff algorithm | Decorrelated jitter — random walk bounded by `backoff_factor` floor and `max_wait` cap |
+| `Retry-After` / `grpc-retry-pushback-ms` | Honored as the floor on the next delay, plus a random smear up to 50% |
+| Adaptive concurrency (bulk paths) | Self-tunes downward on throttling; `max_concurrency` is a ceiling, not a constant |
+
+---
+
+## Configuring Retries
+
+Pass a `RetryConfig` to the `Pinecone` constructor to customize retry behavior for all
+REST requests made by that client:
+
+```python
+from pinecone import Pinecone, RetryConfig
+
+pc = Pinecone(
+    retry_config=RetryConfig(
+        max_retries=5,
+        backoff_factor=0.5,
+        max_wait=60.0,
+        retryable_status_codes=frozenset({429, 500, 503}),
+    )
+)
+```
+
+### `RetryConfig` fields
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `max_retries` | `int` | `3` | Number of retry attempts *after* the initial attempt. Total attempts = `max_retries + 1`. |
+| `backoff_factor` | `float` | `0.25` | Minimum delay floor in seconds (lower bound of decorrelated jitter). See [Jitter strategy](#jitter-strategy) for the full formula. |
+| `max_wait` | `float` | `60.0` | Maximum delay cap in seconds. The jitter algorithm never waits longer than this between retries. |
+| `retryable_status_codes` | `frozenset[int]` | `{408, 429, 500, 502, 503, 504}` | HTTP status codes that trigger a retry. The SDK retries on these codes and raises on all others. |
+
+**`RetryConfig` applies to REST only.** The gRPC transport (Rust-backed) uses its own fixed retry
+policy with 5 retries by default. See [Transport differences](#transport-differences).
+
+### Disabling retries
+
+To disable retries entirely, set `max_retries=0`:
+
+```python
+pc = Pinecone(retry_config=RetryConfig(max_retries=0))
+```
+
+With `max_retries=0`, the SDK makes exactly one attempt and raises immediately on any error.
+
+### Handling rate limits without retrying
+
+By default, 429 responses are retried automatically. To receive `RateLimitError` immediately
+instead (for example, so your orchestrator can handle the retry), exclude 429 from the
+retryable set:
+
+```python
+from pinecone import Pinecone, RetryConfig
+from pinecone.errors import RateLimitError
+
+pc = Pinecone(
+    retry_config=RetryConfig(
+        retryable_status_codes=frozenset({408, 500, 502, 503, 504}),  # no 429
+    )
+)
+
+try:
+    index.upsert(vectors=[...])
+except RateLimitError as exc:
+    # exc.retry_after is the parsed Retry-After value in seconds, or None
+    wait = exc.retry_after or 30.0
+    time.sleep(wait)
+    index.upsert(vectors=[...])
+```
+
+### Migration note: `backoff_factor` semantic change (v8 → v9)
+
+In v8 and earlier, `backoff_factor` was an exponential multiplier. In v9, it became the
+**minimum delay floor in seconds** — the lower bound of the decorrelated jitter window. The
+default also changed from `2.0` to `0.25`. If you pinned `backoff_factor=2.0` in v8, the
+new equivalent that produces a similar mean first-retry delay is `backoff_factor=0.5`; if
+you want to restore the old default behavior (which caused ~4× longer delays than v9), pass
+`backoff_factor=2.0` explicitly. Most users should use the v9 default or leave it unset.
+
+---
+
+## Jitter Strategy
+
+Jitter spreads retries across time so that concurrent clients with the same retry budget
+don't collide on the server at the same moment.
+
+### Decorrelated jitter (backoff path)
+
+When no server hint is present, the SDK uses decorrelated jitter:
+
+```
+delay = uniform(backoff_factor, max(backoff_factor, prev_delay * 3))
+delay = min(delay, max_wait)
+```
+
+Starting from `prev_delay = backoff_factor`, each retry delay is drawn uniformly from
+`[backoff_factor, prev_delay × 3]`, capped at `max_wait`. Because the next window's upper
+bound grows with the previous delay, the sequence performs a random walk that diverges
+naturally without a hard exponential schedule — neighboring clients are unlikely to pick
+the same delay even when they start at the same time.
+
+**Concrete example with defaults** (`backoff_factor=0.25`, `max_wait=60.0`):
+
+| Attempt | Window (seconds) | Typical delay |
+|---------|-----------------|---------------|
+| 1st retry | [0.25, 0.75] | ~0.5 s |
+| 2nd retry | [0.25, ~1.5] | ~0.9 s |
+| 3rd retry | [0.25, ~4.5] | ~2.4 s |
+
+### `Retry-After` smear (server-hint path)
+
+When a 429 response carries a `Retry-After` header, the SDK uses that value as a floor
+and adds a random smear of up to 50%:
+
+```
+smear = uniform(0.0, retry_after * 0.5)
+delay = retry_after + smear
+```
+
+**Why this matters at scale.** If 1,000 SDK clients all receive `Retry-After: 60` at the
+same moment, naive honor-exactly behavior causes all 1,000 to retry at second 60 — a
+thundering herd. With the 50% smear, they spread randomly across `[60, 90)`, reducing
+peak re-collision pressure by roughly 3×.
+
+---
+
+## Server-Driven Retry Hints
+
+The SDK checks for server-supplied delay hints and uses them as the floor for the next
+retry delay.
+
+### REST: `Retry-After` header
+
+When a retryable response (most commonly a 429) includes a `Retry-After` header, the SDK
+parses it as a non-negative number of seconds. Values that cannot be parsed as a float
+(such as HTTP-date strings) are ignored, and the SDK falls back to decorrelated jitter.
+
+```
+Retry-After: 30
+```
+
+With `Retry-After: 30`, the SDK waits at least 30 seconds and adds a random smear up to
+15 seconds (50% of 30), so the actual delay is in `[30, 45)`.
+
+When retries are exhausted, the SDK raises `RateLimitError` with the parsed
+`Retry-After` value on `exc.retry_after` (or `None` if the header was absent or
+unparseable):
+
+```python
+from pinecone.errors import RateLimitError
+
+try:
+    index.upsert(vectors=[...])
+except RateLimitError as exc:
+    print(exc.retry_after)   # float seconds, or None
+```
+
+### gRPC: `grpc-retry-pushback-ms`
+
+The gRPC transport (Rust-backed) checks for `grpc-retry-pushback-ms` in response
+trailers. This header carries the suggested delay in milliseconds. The SDK treats it as a
+floor and applies the same ±50% smear. If the trailer is absent or invalid, the gRPC
+transport falls back to its own decorrelated jitter backoff (`initial_backoff=100 ms`,
+`max_backoff=1600 ms`).
+
+---
+
+## Adaptive Concurrency for Bulk Operations
+
+When you run bulk upserts or other parallel operations, the SDK observes throttling
+signals and automatically reduces the number of concurrent in-flight requests. When
+throttling subsides, concurrency recovers.
+
+### How it works
+
+Each `Pinecone` client maintains a per-host concurrency limiter. On every retryable
+response (429, 503, or equivalent gRPC code), the limiter halves the effective
+concurrency floor for that host. After a streak of consecutive successful requests, it
+recovers by one slot. The algorithm is AIMD (Additive Increase, Multiplicative Decrease)
+— the same control loop used by TCP congestion control.
+
+**You don't configure this directly.** The `max_concurrency` parameter you pass to
+`upsert()` is a *ceiling* — the SDK self-tunes between 1 and that ceiling based on what
+the server can absorb.
+
+### Example
+
+```python
+from pinecone import Pinecone
+
+pc = Pinecone()
+index = pc.index(host="product-search-abc123.svc.pinecone.io")
+
+# max_concurrency=8 is the ceiling.
+# If the index throttles during the run, the SDK will automatically
+# reduce effective concurrency (e.g. to 4, then 2) and recover as
+# throttling subsides. No code changes required.
+response = index.upsert(
+    vectors=large_list,
+    batch_size=200,
+    max_concurrency=8,
+)
+print(response.upserted_count)
+```
+
+### Limiter scope
+
+One limiter per index host per `Pinecone` client. If you create two `Pinecone` clients and
+both target the same index, they each maintain an independent limiter — there is no
+cross-client coordination (see [Multi-process and serverless workloads](#multi-process-and-serverless-workloads)).
+
+---
+
+## Transport Differences
+
+The retry plan goal is parity across REST and gRPC. The remaining differences are small:
+
+| Aspect | REST (`Index`, `AsyncIndex`) | gRPC (`GrpcIndex`) |
+|--------|------------------------------|---------------------|
+| Default `max_retries` | 3 (4 total attempts) | 5 (6 total attempts) |
+| Configured via | `RetryConfig` passed to `Pinecone()` | Fixed in transport (not user-configurable) |
+| Retryable codes | `{408, 429, 500, 502, 503, 504}` | UNAVAILABLE, RESOURCE\_EXHAUSTED, ABORTED |
+| Server-hint header | `Retry-After` (seconds, float) | `grpc-retry-pushback-ms` (milliseconds, int) |
+| Jitter algorithm | Decorrelated jitter (Python) | Decorrelated jitter (Rust) |
+| Async support | Yes (`AsyncIndex`) | No — gRPC transport is sync-only |
+| Adaptive concurrency | Yes (REST + gRPC share the same per-host limiter registry) | Yes |
+
+**gRPC retry is not configurable via `RetryConfig`.** If you need to tune gRPC retry
+behavior, construct `GrpcIndex` directly (rather than through `Pinecone.index(grpc=True)`)
+and pass `max_retries` explicitly.
+
+---
+
+## Multi-Process and Serverless Workloads
+
+### What the SDK cannot do
+
+The SDK's retry and adaptive concurrency machinery is per-process. If your workload fans
+out across multiple Lambda invocations, Cloud Run instances, or Kubernetes pods, each
+process runs its own independent retry loop. There is no shared state, no cross-process
+coordination, and no distributed rate-limit awareness.
+
+This means:
+
+- N simultaneously throttled invocations each independently back off and retry. Without
+  coordination, they can collide again at the end of the retry window.
+- The adaptive concurrency limiter starts from scratch for each new process instance (e.g.
+  a fresh Lambda cold start). It cannot inherit a reduced limit that another invocation
+  learned from throttling.
+
+### Recommended pattern for fan-out workloads
+
+Let your orchestrator handle retries at the job level, and keep the SDK's retry window
+narrow:
+
+```python
+from pinecone import Pinecone, RetryConfig
+from pinecone.errors import RateLimitError
+
+# Set max_retries=0 or 1: one attempt (or one fast retry), then raise.
+# Let the SQS visibility timeout / Cloud Tasks retry / Step Functions catch
+# handle the outer retry loop.
+pc = Pinecone(retry_config=RetryConfig(max_retries=1))
+index = pc.index(host="product-search-abc123.svc.pinecone.io")
+
+try:
+    response = index.upsert(vectors=batch, batch_size=100, max_concurrency=4)
+except RateLimitError as exc:
+    # Re-raise so the orchestrator sees a task failure and schedules a retry
+    # after the visibility timeout expires — which is already longer than
+    # Pinecone's Retry-After window in most configurations.
+    raise
+```
+
+### Why `Retry-After` smear still helps
+
+Even without coordination, the SDK's 50% smear on `Retry-After` provides statistical
+relief. If N independent Lambda invocations all receive `Retry-After: 60`, they don't all
+retry at second 60 — the smear spreads them across `[60, 90)`. The larger N is, the more
+this matters.
+
+### Summary: when to trust the SDK vs. the orchestrator
+
+| Scenario | Recommended approach |
+|----------|----------------------|
+| Single-process bulk upsert | Use defaults — SDK handles everything |
+| Long-running worker (persistent process) | Use defaults — adaptive limiter learns and recovers |
+| Lambda / Cloud Functions / Cloud Run (stateless) | `max_retries=1`, catch `RateLimitError`, re-raise for orchestrator retry |
+| Fan-out across many pods (e.g. Kubernetes Job) | Same as stateless — set low `max_retries`, rely on orchestrator |
+| Strict per-invocation SLA (must not block) | `max_retries=0`, `retryable_status_codes=frozenset()` — raise immediately |
+
+---
+
+## See Also
+
+- {doc}`/guides/error-handling` — Exception hierarchy, `RateLimitError.retry_after`, and how to catch specific errors
+- {doc}`/guides/performance` — Bulk upsert patterns, `max_concurrency` tuning, and transport selection
+- {doc}`/guides/sync-vs-async` — When to use the async client and how to manage concurrency with `asyncio`
diff --git a/docs/index.rst b/docs/index.rst
@@ -16,6 +16,7 @@ Pinecone Python SDK
    guides/concepts
    guides/sync-vs-async
    guides/error-handling
+   guides/retries
    guides/pagination
    guides/performance