Skip to content

Commit fdadfde

Browse files
committed
docs(retries): add guides/retries.md retry/resilience documentation
## Purpose The retry-resilience plan (DX-0147–DX-0161) added jitter, Retry-After smear, grpc-retry-pushback-ms support, and AIMD adaptive concurrency. All of that behavior now deserves its own dedicated guide rather than a paragraph in error-handling.md. ## Solution Created `docs/guides/retries.md` covering: - Defaults at a glance (REST vs gRPC retry counts, retryable codes) - RetryConfig fields with accurate defaults and migration note for backoff_factor - Decorrelated jitter formula and Retry-After smear formula with concrete examples - Server-driven retry hints (REST Retry-After, gRPC grpc-retry-pushback-ms) - Adaptive concurrency (AIMD): how max_concurrency is a ceiling, not a constant - Transport differences table (REST vs gRPC retry policy) - Multi-process / serverless honest-limitations section with orchestrator patterns Added retries.md to the Guides toctree in docs/index.rst adjacent to error-handling. Sphinx build passes with -W (warnings as errors). All 4472 unit tests pass. ## Follow-ups DX-0163 will replace the current Retries section in error-handling.md with a cross-reference back to this page.
1 parent 929ce83 commit fdadfde

2 files changed

Lines changed: 320 additions & 0 deletions

File tree

docs/guides/retries.md

Lines changed: 319 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,319 @@
1+
# Retries and Resilience
2+
3+
The SDK retries failed requests automatically using multiple layers of resilience: per-request
4+
backoff with decorrelated jitter, server-driven delay hints, and adaptive concurrency for bulk
5+
operations. This page covers everything you need to know to tune that behavior or to decide when
6+
to hand retry control back to your orchestrator.
7+
8+
For the exceptions the SDK raises when retries are exhausted, see {doc}`/guides/error-handling`.
9+
10+
---
11+
12+
## Defaults at a Glance
13+
14+
Out of the box — no configuration needed:
15+
16+
| What | Default behavior |
17+
|------|------------------|
18+
| Max retries (after initial attempt) | 3 for REST (4 total), 5 for gRPC (6 total) |
19+
| Retryable HTTP status codes | 408, 429, 500, 502, 503, 504 |
20+
| Retryable gRPC status codes | UNAVAILABLE, RESOURCE\_EXHAUSTED, ABORTED |
21+
| Backoff algorithm | Decorrelated jitter — random walk bounded by `backoff_factor` floor and `max_wait` cap |
22+
| `Retry-After` / `grpc-retry-pushback-ms` | Honored as the floor on the next delay, plus a random smear up to 50% |
23+
| Adaptive concurrency (bulk paths) | Self-tunes downward on throttling; `max_concurrency` is a ceiling, not a constant |
24+
25+
---
26+
27+
## Configuring Retries
28+
29+
Pass a `RetryConfig` to the `Pinecone` constructor to customize retry behavior for all
30+
REST requests made by that client:
31+
32+
```python
33+
from pinecone import Pinecone, RetryConfig
34+
35+
pc = Pinecone(
36+
retry_config=RetryConfig(
37+
max_retries=5,
38+
backoff_factor=0.5,
39+
max_wait=60.0,
40+
retryable_status_codes=frozenset({429, 500, 503}),
41+
)
42+
)
43+
```
44+
45+
### `RetryConfig` fields
46+
47+
| Field | Type | Default | Description |
48+
|-------|------|---------|-------------|
49+
| `max_retries` | `int` | `3` | Number of retry attempts *after* the initial attempt. Total attempts = `max_retries + 1`. |
50+
| `backoff_factor` | `float` | `0.25` | Minimum delay floor in seconds (lower bound of decorrelated jitter). See [Jitter strategy](#jitter-strategy) for the full formula. |
51+
| `max_wait` | `float` | `60.0` | Maximum delay cap in seconds. The jitter algorithm never waits longer than this between retries. |
52+
| `retryable_status_codes` | `frozenset[int]` | `{408, 429, 500, 502, 503, 504}` | HTTP status codes that trigger a retry. The SDK retries on these codes and raises on all others. |
53+
54+
**`RetryConfig` applies to REST only.** The gRPC transport (Rust-backed) uses its own fixed retry
55+
policy with 5 retries by default. See [Transport differences](#transport-differences).
56+
57+
### Disabling retries
58+
59+
To disable retries entirely, set `max_retries=0`:
60+
61+
```python
62+
pc = Pinecone(retry_config=RetryConfig(max_retries=0))
63+
```
64+
65+
With `max_retries=0`, the SDK makes exactly one attempt and raises immediately on any error.
66+
67+
### Handling rate limits without retrying
68+
69+
By default, 429 responses are retried automatically. To receive `RateLimitError` immediately
70+
instead (for example, so your orchestrator can handle the retry), exclude 429 from the
71+
retryable set:
72+
73+
```python
74+
from pinecone import Pinecone, RetryConfig
75+
from pinecone.errors import RateLimitError
76+
77+
pc = Pinecone(
78+
retry_config=RetryConfig(
79+
retryable_status_codes=frozenset({408, 500, 502, 503, 504}), # no 429
80+
)
81+
)
82+
83+
try:
84+
index.upsert(vectors=[...])
85+
except RateLimitError as exc:
86+
# exc.retry_after is the parsed Retry-After value in seconds, or None
87+
wait = exc.retry_after or 30.0
88+
time.sleep(wait)
89+
index.upsert(vectors=[...])
90+
```
91+
92+
### Migration note: `backoff_factor` semantic change (v8 → v9)
93+
94+
In v8 and earlier, `backoff_factor` was an exponential multiplier. In v9, it became the
95+
**minimum delay floor in seconds** — the lower bound of the decorrelated jitter window. The
96+
default also changed from `2.0` to `0.25`. If you pinned `backoff_factor=2.0` in v8, the
97+
new equivalent that produces a similar mean first-retry delay is `backoff_factor=0.5`; if
98+
you want to restore the old default behavior (which caused ~4× longer delays than v9), pass
99+
`backoff_factor=2.0` explicitly. Most users should use the v9 default or leave it unset.
100+
101+
---
102+
103+
## Jitter Strategy
104+
105+
Jitter spreads retries across time so that concurrent clients with the same retry budget
106+
don't collide on the server at the same moment.
107+
108+
### Decorrelated jitter (backoff path)
109+
110+
When no server hint is present, the SDK uses decorrelated jitter:
111+
112+
```
113+
delay = uniform(backoff_factor, max(backoff_factor, prev_delay * 3))
114+
delay = min(delay, max_wait)
115+
```
116+
117+
Starting from `prev_delay = backoff_factor`, each retry delay is drawn uniformly from
118+
`[backoff_factor, prev_delay × 3]`, capped at `max_wait`. Because the next window's upper
119+
bound grows with the previous delay, the sequence performs a random walk that diverges
120+
naturally without a hard exponential schedule — neighboring clients are unlikely to pick
121+
the same delay even when they start at the same time.
122+
123+
**Concrete example with defaults** (`backoff_factor=0.25`, `max_wait=60.0`):
124+
125+
| Attempt | Window (seconds) | Typical delay |
126+
|---------|-----------------|---------------|
127+
| 1st retry | [0.25, 0.75] | ~0.5 s |
128+
| 2nd retry | [0.25, ~1.5] | ~0.9 s |
129+
| 3rd retry | [0.25, ~4.5] | ~2.4 s |
130+
131+
### `Retry-After` smear (server-hint path)
132+
133+
When a 429 response carries a `Retry-After` header, the SDK uses that value as a floor
134+
and adds a random smear of up to 50%:
135+
136+
```
137+
smear = uniform(0.0, retry_after * 0.5)
138+
delay = retry_after + smear
139+
```
140+
141+
**Why this matters at scale.** If 1,000 SDK clients all receive `Retry-After: 60` at the
142+
same moment, naive honor-exactly behavior causes all 1,000 to retry at second 60 — a
143+
thundering herd. With the 50% smear, they spread randomly across `[60, 90)`, reducing
144+
peak re-collision pressure by roughly 3×.
145+
146+
---
147+
148+
## Server-Driven Retry Hints
149+
150+
The SDK checks for server-supplied delay hints and uses them as the floor for the next
151+
retry delay.
152+
153+
### REST: `Retry-After` header
154+
155+
When a retryable response (most commonly a 429) includes a `Retry-After` header, the SDK
156+
parses it as a non-negative number of seconds. Values that cannot be parsed as a float
157+
(such as HTTP-date strings) are ignored, and the SDK falls back to decorrelated jitter.
158+
159+
```
160+
Retry-After: 30
161+
```
162+
163+
With `Retry-After: 30`, the SDK waits at least 30 seconds and adds a random smear up to
164+
15 seconds (50% of 30), so the actual delay is in `[30, 45)`.
165+
166+
When retries are exhausted, the SDK raises `RateLimitError` with the parsed
167+
`Retry-After` value on `exc.retry_after` (or `None` if the header was absent or
168+
unparseable):
169+
170+
```python
171+
from pinecone.errors import RateLimitError
172+
173+
try:
174+
index.upsert(vectors=[...])
175+
except RateLimitError as exc:
176+
print(exc.retry_after) # float seconds, or None
177+
```
178+
179+
### gRPC: `grpc-retry-pushback-ms`
180+
181+
The gRPC transport (Rust-backed) checks for `grpc-retry-pushback-ms` in response
182+
trailers. This header carries the suggested delay in milliseconds. The SDK treats it as a
183+
floor and applies the same ±50% smear. If the trailer is absent or invalid, the gRPC
184+
transport falls back to its own decorrelated jitter backoff (`initial_backoff=100 ms`,
185+
`max_backoff=1600 ms`).
186+
187+
---
188+
189+
## Adaptive Concurrency for Bulk Operations
190+
191+
When you run bulk upserts or other parallel operations, the SDK observes throttling
192+
signals and automatically reduces the number of concurrent in-flight requests. When
193+
throttling subsides, concurrency recovers.
194+
195+
### How it works
196+
197+
Each `Pinecone` client maintains a per-host concurrency limiter. On every retryable
198+
response (429, 503, or equivalent gRPC code), the limiter halves the effective
199+
concurrency floor for that host. After a streak of consecutive successful requests, it
200+
recovers by one slot. The algorithm is AIMD (Additive Increase, Multiplicative Decrease)
201+
— the same control loop used by TCP congestion control.
202+
203+
**You don't configure this directly.** The `max_concurrency` parameter you pass to
204+
`upsert()` is a *ceiling* — the SDK self-tunes between 1 and that ceiling based on what
205+
the server can absorb.
206+
207+
### Example
208+
209+
```python
210+
from pinecone import Pinecone
211+
212+
pc = Pinecone()
213+
index = pc.index(host="product-search-abc123.svc.pinecone.io")
214+
215+
# max_concurrency=8 is the ceiling.
216+
# If the index throttles during the run, the SDK will automatically
217+
# reduce effective concurrency (e.g. to 4, then 2) and recover as
218+
# throttling subsides. No code changes required.
219+
response = index.upsert(
220+
vectors=large_list,
221+
batch_size=200,
222+
max_concurrency=8,
223+
)
224+
print(response.upserted_count)
225+
```
226+
227+
### Limiter scope
228+
229+
One limiter per index host per `Pinecone` client. If you create two `Pinecone` clients and
230+
both target the same index, they each maintain an independent limiter — there is no
231+
cross-client coordination (see [Multi-process and serverless workloads](#multi-process-and-serverless-workloads)).
232+
233+
---
234+
235+
## Transport Differences
236+
237+
The retry plan goal is parity across REST and gRPC. The remaining differences are small:
238+
239+
| Aspect | REST (`Index`, `AsyncIndex`) | gRPC (`GrpcIndex`) |
240+
|--------|------------------------------|---------------------|
241+
| Default `max_retries` | 3 (4 total attempts) | 5 (6 total attempts) |
242+
| Configured via | `RetryConfig` passed to `Pinecone()` | Fixed in transport (not user-configurable) |
243+
| Retryable codes | `{408, 429, 500, 502, 503, 504}` | UNAVAILABLE, RESOURCE\_EXHAUSTED, ABORTED |
244+
| Server-hint header | `Retry-After` (seconds, float) | `grpc-retry-pushback-ms` (milliseconds, int) |
245+
| Jitter algorithm | Decorrelated jitter (Python) | Decorrelated jitter (Rust) |
246+
| Async support | Yes (`AsyncIndex`) | No — gRPC transport is sync-only |
247+
| Adaptive concurrency | Yes (REST + gRPC share the same per-host limiter registry) | Yes |
248+
249+
**gRPC retry is not configurable via `RetryConfig`.** If you need to tune gRPC retry
250+
behavior, construct `GrpcIndex` directly (rather than through `Pinecone.index(grpc=True)`)
251+
and pass `max_retries` explicitly.
252+
253+
---
254+
255+
## Multi-Process and Serverless Workloads
256+
257+
### What the SDK cannot do
258+
259+
The SDK's retry and adaptive concurrency machinery is per-process. If your workload fans
260+
out across multiple Lambda invocations, Cloud Run instances, or Kubernetes pods, each
261+
process runs its own independent retry loop. There is no shared state, no cross-process
262+
coordination, and no distributed rate-limit awareness.
263+
264+
This means:
265+
266+
- N simultaneously throttled invocations each independently back off and retry. Without
267+
coordination, they can collide again at the end of the retry window.
268+
- The adaptive concurrency limiter starts from scratch for each new process instance (e.g.
269+
a fresh Lambda cold start). It cannot inherit a reduced limit that another invocation
270+
learned from throttling.
271+
272+
### Recommended pattern for fan-out workloads
273+
274+
Let your orchestrator handle retries at the job level, and keep the SDK's retry window
275+
narrow:
276+
277+
```python
278+
from pinecone import Pinecone, RetryConfig
279+
from pinecone.errors import RateLimitError
280+
281+
# Set max_retries=0 or 1: one attempt (or one fast retry), then raise.
282+
# Let the SQS visibility timeout / Cloud Tasks retry / Step Functions catch
283+
# handle the outer retry loop.
284+
pc = Pinecone(retry_config=RetryConfig(max_retries=1))
285+
index = pc.index(host="product-search-abc123.svc.pinecone.io")
286+
287+
try:
288+
response = index.upsert(vectors=batch, batch_size=100, max_concurrency=4)
289+
except RateLimitError as exc:
290+
# Re-raise so the orchestrator sees a task failure and schedules a retry
291+
# after the visibility timeout expires — which is already longer than
292+
# Pinecone's Retry-After window in most configurations.
293+
raise
294+
```
295+
296+
### Why `Retry-After` smear still helps
297+
298+
Even without coordination, the SDK's 50% smear on `Retry-After` provides statistical
299+
relief. If N independent Lambda invocations all receive `Retry-After: 60`, they don't all
300+
retry at second 60 — the smear spreads them across `[60, 90)`. The larger N is, the more
301+
this matters.
302+
303+
### Summary: when to trust the SDK vs. the orchestrator
304+
305+
| Scenario | Recommended approach |
306+
|----------|----------------------|
307+
| Single-process bulk upsert | Use defaults — SDK handles everything |
308+
| Long-running worker (persistent process) | Use defaults — adaptive limiter learns and recovers |
309+
| Lambda / Cloud Functions / Cloud Run (stateless) | `max_retries=1`, catch `RateLimitError`, re-raise for orchestrator retry |
310+
| Fan-out across many pods (e.g. Kubernetes Job) | Same as stateless — set low `max_retries`, rely on orchestrator |
311+
| Strict per-invocation SLA (must not block) | `max_retries=0`, `retryable_status_codes=frozenset()` — raise immediately |
312+
313+
---
314+
315+
## See Also
316+
317+
- {doc}`/guides/error-handling` — Exception hierarchy, `RateLimitError.retry_after`, and how to catch specific errors
318+
- {doc}`/guides/performance` — Bulk upsert patterns, `max_concurrency` tuning, and transport selection
319+
- {doc}`/guides/sync-vs-async` — When to use the async client and how to manage concurrency with `asyncio`

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ Pinecone Python SDK
1616
guides/concepts
1717
guides/sync-vs-async
1818
guides/error-handling
19+
guides/retries
1920
guides/pagination
2021
guides/performance
2122

0 commit comments

Comments
 (0)