|
| 1 | +# Retries and Resilience |
| 2 | + |
| 3 | +The SDK retries failed requests automatically using multiple layers of resilience: per-request |
| 4 | +backoff with decorrelated jitter, server-driven delay hints, and adaptive concurrency for bulk |
| 5 | +operations. This page covers everything you need to know to tune that behavior or to decide when |
| 6 | +to hand retry control back to your orchestrator. |
| 7 | + |
| 8 | +For the exceptions the SDK raises when retries are exhausted, see {doc}`/guides/error-handling`. |
| 9 | + |
| 10 | +--- |
| 11 | + |
| 12 | +## Defaults at a Glance |
| 13 | + |
| 14 | +Out of the box — no configuration needed: |
| 15 | + |
| 16 | +| What | Default behavior | |
| 17 | +|------|------------------| |
| 18 | +| Max retries (after initial attempt) | 3 for REST (4 total), 5 for gRPC (6 total) | |
| 19 | +| Retryable HTTP status codes | 408, 429, 500, 502, 503, 504 | |
| 20 | +| Retryable gRPC status codes | UNAVAILABLE, RESOURCE\_EXHAUSTED, ABORTED | |
| 21 | +| Backoff algorithm | Decorrelated jitter — random walk bounded by `backoff_factor` floor and `max_wait` cap | |
| 22 | +| `Retry-After` / `grpc-retry-pushback-ms` | Honored as the floor on the next delay, plus a random smear up to 50% | |
| 23 | +| Adaptive concurrency (bulk paths) | Self-tunes downward on throttling; `max_concurrency` is a ceiling, not a constant | |
| 24 | + |
| 25 | +--- |
| 26 | + |
| 27 | +## Configuring Retries |
| 28 | + |
| 29 | +Pass a `RetryConfig` to the `Pinecone` constructor to customize retry behavior for all |
| 30 | +REST requests made by that client: |
| 31 | + |
| 32 | +```python |
| 33 | +from pinecone import Pinecone, RetryConfig |
| 34 | + |
| 35 | +pc = Pinecone( |
| 36 | + retry_config=RetryConfig( |
| 37 | + max_retries=5, |
| 38 | + backoff_factor=0.5, |
| 39 | + max_wait=60.0, |
| 40 | + retryable_status_codes=frozenset({429, 500, 503}), |
| 41 | + ) |
| 42 | +) |
| 43 | +``` |
| 44 | + |
| 45 | +### `RetryConfig` fields |
| 46 | + |
| 47 | +| Field | Type | Default | Description | |
| 48 | +|-------|------|---------|-------------| |
| 49 | +| `max_retries` | `int` | `3` | Number of retry attempts *after* the initial attempt. Total attempts = `max_retries + 1`. | |
| 50 | +| `backoff_factor` | `float` | `0.25` | Minimum delay floor in seconds (lower bound of decorrelated jitter). See [Jitter strategy](#jitter-strategy) for the full formula. | |
| 51 | +| `max_wait` | `float` | `60.0` | Maximum delay cap in seconds. The jitter algorithm never waits longer than this between retries. | |
| 52 | +| `retryable_status_codes` | `frozenset[int]` | `{408, 429, 500, 502, 503, 504}` | HTTP status codes that trigger a retry. The SDK retries on these codes and raises on all others. | |
| 53 | + |
| 54 | +**`RetryConfig` applies to REST only.** The gRPC transport (Rust-backed) uses its own fixed retry |
| 55 | +policy with 5 retries by default. See [Transport differences](#transport-differences). |
| 56 | + |
| 57 | +### Disabling retries |
| 58 | + |
| 59 | +To disable retries entirely, set `max_retries=0`: |
| 60 | + |
| 61 | +```python |
| 62 | +pc = Pinecone(retry_config=RetryConfig(max_retries=0)) |
| 63 | +``` |
| 64 | + |
| 65 | +With `max_retries=0`, the SDK makes exactly one attempt and raises immediately on any error. |
| 66 | + |
| 67 | +### Handling rate limits without retrying |
| 68 | + |
| 69 | +By default, 429 responses are retried automatically. To receive `RateLimitError` immediately |
| 70 | +instead (for example, so your orchestrator can handle the retry), exclude 429 from the |
| 71 | +retryable set: |
| 72 | + |
| 73 | +```python |
| 74 | +from pinecone import Pinecone, RetryConfig |
| 75 | +from pinecone.errors import RateLimitError |
| 76 | + |
| 77 | +pc = Pinecone( |
| 78 | + retry_config=RetryConfig( |
| 79 | + retryable_status_codes=frozenset({408, 500, 502, 503, 504}), # no 429 |
| 80 | + ) |
| 81 | +) |
| 82 | + |
| 83 | +try: |
| 84 | + index.upsert(vectors=[...]) |
| 85 | +except RateLimitError as exc: |
| 86 | + # exc.retry_after is the parsed Retry-After value in seconds, or None |
| 87 | + wait = exc.retry_after or 30.0 |
| 88 | + time.sleep(wait) |
| 89 | + index.upsert(vectors=[...]) |
| 90 | +``` |
| 91 | + |
| 92 | +### Migration note: `backoff_factor` semantic change (v8 → v9) |
| 93 | + |
| 94 | +In v8 and earlier, `backoff_factor` was an exponential multiplier. In v9, it became the |
| 95 | +**minimum delay floor in seconds** — the lower bound of the decorrelated jitter window. The |
| 96 | +default also changed from `2.0` to `0.25`. If you pinned `backoff_factor=2.0` in v8, the |
| 97 | +new equivalent that produces a similar mean first-retry delay is `backoff_factor=0.5`; if |
| 98 | +you want to restore the old default behavior (which caused ~4× longer delays than v9), pass |
| 99 | +`backoff_factor=2.0` explicitly. Most users should use the v9 default or leave it unset. |
| 100 | + |
| 101 | +--- |
| 102 | + |
| 103 | +## Jitter Strategy |
| 104 | + |
| 105 | +Jitter spreads retries across time so that concurrent clients with the same retry budget |
| 106 | +don't collide on the server at the same moment. |
| 107 | + |
| 108 | +### Decorrelated jitter (backoff path) |
| 109 | + |
| 110 | +When no server hint is present, the SDK uses decorrelated jitter: |
| 111 | + |
| 112 | +``` |
| 113 | +delay = uniform(backoff_factor, max(backoff_factor, prev_delay * 3)) |
| 114 | +delay = min(delay, max_wait) |
| 115 | +``` |
| 116 | + |
| 117 | +Starting from `prev_delay = backoff_factor`, each retry delay is drawn uniformly from |
| 118 | +`[backoff_factor, prev_delay × 3]`, capped at `max_wait`. Because the next window's upper |
| 119 | +bound grows with the previous delay, the sequence performs a random walk that diverges |
| 120 | +naturally without a hard exponential schedule — neighboring clients are unlikely to pick |
| 121 | +the same delay even when they start at the same time. |
| 122 | + |
| 123 | +**Concrete example with defaults** (`backoff_factor=0.25`, `max_wait=60.0`): |
| 124 | + |
| 125 | +| Attempt | Window (seconds) | Typical delay | |
| 126 | +|---------|-----------------|---------------| |
| 127 | +| 1st retry | [0.25, 0.75] | ~0.5 s | |
| 128 | +| 2nd retry | [0.25, ~1.5] | ~0.9 s | |
| 129 | +| 3rd retry | [0.25, ~4.5] | ~2.4 s | |
| 130 | + |
| 131 | +### `Retry-After` smear (server-hint path) |
| 132 | + |
| 133 | +When a 429 response carries a `Retry-After` header, the SDK uses that value as a floor |
| 134 | +and adds a random smear of up to 50%: |
| 135 | + |
| 136 | +``` |
| 137 | +smear = uniform(0.0, retry_after * 0.5) |
| 138 | +delay = retry_after + smear |
| 139 | +``` |
| 140 | + |
| 141 | +**Why this matters at scale.** If 1,000 SDK clients all receive `Retry-After: 60` at the |
| 142 | +same moment, naive honor-exactly behavior causes all 1,000 to retry at second 60 — a |
| 143 | +thundering herd. With the 50% smear, they spread randomly across `[60, 90)`, reducing |
| 144 | +peak re-collision pressure by roughly 3×. |
| 145 | + |
| 146 | +--- |
| 147 | + |
| 148 | +## Server-Driven Retry Hints |
| 149 | + |
| 150 | +The SDK checks for server-supplied delay hints and uses them as the floor for the next |
| 151 | +retry delay. |
| 152 | + |
| 153 | +### REST: `Retry-After` header |
| 154 | + |
| 155 | +When a retryable response (most commonly a 429) includes a `Retry-After` header, the SDK |
| 156 | +parses it as a non-negative number of seconds. Values that cannot be parsed as a float |
| 157 | +(such as HTTP-date strings) are ignored, and the SDK falls back to decorrelated jitter. |
| 158 | + |
| 159 | +``` |
| 160 | +Retry-After: 30 |
| 161 | +``` |
| 162 | + |
| 163 | +With `Retry-After: 30`, the SDK waits at least 30 seconds and adds a random smear up to |
| 164 | +15 seconds (50% of 30), so the actual delay is in `[30, 45)`. |
| 165 | + |
| 166 | +When retries are exhausted, the SDK raises `RateLimitError` with the parsed |
| 167 | +`Retry-After` value on `exc.retry_after` (or `None` if the header was absent or |
| 168 | +unparseable): |
| 169 | + |
| 170 | +```python |
| 171 | +from pinecone.errors import RateLimitError |
| 172 | + |
| 173 | +try: |
| 174 | + index.upsert(vectors=[...]) |
| 175 | +except RateLimitError as exc: |
| 176 | + print(exc.retry_after) # float seconds, or None |
| 177 | +``` |
| 178 | + |
| 179 | +### gRPC: `grpc-retry-pushback-ms` |
| 180 | + |
| 181 | +The gRPC transport (Rust-backed) checks for `grpc-retry-pushback-ms` in response |
| 182 | +trailers. This header carries the suggested delay in milliseconds. The SDK treats it as a |
| 183 | +floor and applies the same ±50% smear. If the trailer is absent or invalid, the gRPC |
| 184 | +transport falls back to its own decorrelated jitter backoff (`initial_backoff=100 ms`, |
| 185 | +`max_backoff=1600 ms`). |
| 186 | + |
| 187 | +--- |
| 188 | + |
| 189 | +## Adaptive Concurrency for Bulk Operations |
| 190 | + |
| 191 | +When you run bulk upserts or other parallel operations, the SDK observes throttling |
| 192 | +signals and automatically reduces the number of concurrent in-flight requests. When |
| 193 | +throttling subsides, concurrency recovers. |
| 194 | + |
| 195 | +### How it works |
| 196 | + |
| 197 | +Each `Pinecone` client maintains a per-host concurrency limiter. On every retryable |
| 198 | +response (429, 503, or equivalent gRPC code), the limiter halves the effective |
| 199 | +concurrency floor for that host. After a streak of consecutive successful requests, it |
| 200 | +recovers by one slot. The algorithm is AIMD (Additive Increase, Multiplicative Decrease) |
| 201 | +— the same control loop used by TCP congestion control. |
| 202 | + |
| 203 | +**You don't configure this directly.** The `max_concurrency` parameter you pass to |
| 204 | +`upsert()` is a *ceiling* — the SDK self-tunes between 1 and that ceiling based on what |
| 205 | +the server can absorb. |
| 206 | + |
| 207 | +### Example |
| 208 | + |
| 209 | +```python |
| 210 | +from pinecone import Pinecone |
| 211 | + |
| 212 | +pc = Pinecone() |
| 213 | +index = pc.index(host="product-search-abc123.svc.pinecone.io") |
| 214 | + |
| 215 | +# max_concurrency=8 is the ceiling. |
| 216 | +# If the index throttles during the run, the SDK will automatically |
| 217 | +# reduce effective concurrency (e.g. to 4, then 2) and recover as |
| 218 | +# throttling subsides. No code changes required. |
| 219 | +response = index.upsert( |
| 220 | + vectors=large_list, |
| 221 | + batch_size=200, |
| 222 | + max_concurrency=8, |
| 223 | +) |
| 224 | +print(response.upserted_count) |
| 225 | +``` |
| 226 | + |
| 227 | +### Limiter scope |
| 228 | + |
| 229 | +One limiter per index host per `Pinecone` client. If you create two `Pinecone` clients and |
| 230 | +both target the same index, they each maintain an independent limiter — there is no |
| 231 | +cross-client coordination (see [Multi-process and serverless workloads](#multi-process-and-serverless-workloads)). |
| 232 | + |
| 233 | +--- |
| 234 | + |
| 235 | +## Transport Differences |
| 236 | + |
| 237 | +The retry plan goal is parity across REST and gRPC. The remaining differences are small: |
| 238 | + |
| 239 | +| Aspect | REST (`Index`, `AsyncIndex`) | gRPC (`GrpcIndex`) | |
| 240 | +|--------|------------------------------|---------------------| |
| 241 | +| Default `max_retries` | 3 (4 total attempts) | 5 (6 total attempts) | |
| 242 | +| Configured via | `RetryConfig` passed to `Pinecone()` | Fixed in transport (not user-configurable) | |
| 243 | +| Retryable codes | `{408, 429, 500, 502, 503, 504}` | UNAVAILABLE, RESOURCE\_EXHAUSTED, ABORTED | |
| 244 | +| Server-hint header | `Retry-After` (seconds, float) | `grpc-retry-pushback-ms` (milliseconds, int) | |
| 245 | +| Jitter algorithm | Decorrelated jitter (Python) | Decorrelated jitter (Rust) | |
| 246 | +| Async support | Yes (`AsyncIndex`) | No — gRPC transport is sync-only | |
| 247 | +| Adaptive concurrency | Yes (REST + gRPC share the same per-host limiter registry) | Yes | |
| 248 | + |
| 249 | +**gRPC retry is not configurable via `RetryConfig`.** If you need to tune gRPC retry |
| 250 | +behavior, construct `GrpcIndex` directly (rather than through `Pinecone.index(grpc=True)`) |
| 251 | +and pass `max_retries` explicitly. |
| 252 | + |
| 253 | +--- |
| 254 | + |
| 255 | +## Multi-Process and Serverless Workloads |
| 256 | + |
| 257 | +### What the SDK cannot do |
| 258 | + |
| 259 | +The SDK's retry and adaptive concurrency machinery is per-process. If your workload fans |
| 260 | +out across multiple Lambda invocations, Cloud Run instances, or Kubernetes pods, each |
| 261 | +process runs its own independent retry loop. There is no shared state, no cross-process |
| 262 | +coordination, and no distributed rate-limit awareness. |
| 263 | + |
| 264 | +This means: |
| 265 | + |
| 266 | +- N simultaneously throttled invocations each independently back off and retry. Without |
| 267 | + coordination, they can collide again at the end of the retry window. |
| 268 | +- The adaptive concurrency limiter starts from scratch for each new process instance (e.g. |
| 269 | + a fresh Lambda cold start). It cannot inherit a reduced limit that another invocation |
| 270 | + learned from throttling. |
| 271 | + |
| 272 | +### Recommended pattern for fan-out workloads |
| 273 | + |
| 274 | +Let your orchestrator handle retries at the job level, and keep the SDK's retry window |
| 275 | +narrow: |
| 276 | + |
| 277 | +```python |
| 278 | +from pinecone import Pinecone, RetryConfig |
| 279 | +from pinecone.errors import RateLimitError |
| 280 | + |
| 281 | +# Set max_retries=0 or 1: one attempt (or one fast retry), then raise. |
| 282 | +# Let the SQS visibility timeout / Cloud Tasks retry / Step Functions catch |
| 283 | +# handle the outer retry loop. |
| 284 | +pc = Pinecone(retry_config=RetryConfig(max_retries=1)) |
| 285 | +index = pc.index(host="product-search-abc123.svc.pinecone.io") |
| 286 | + |
| 287 | +try: |
| 288 | + response = index.upsert(vectors=batch, batch_size=100, max_concurrency=4) |
| 289 | +except RateLimitError as exc: |
| 290 | + # Re-raise so the orchestrator sees a task failure and schedules a retry |
| 291 | + # after the visibility timeout expires — which is already longer than |
| 292 | + # Pinecone's Retry-After window in most configurations. |
| 293 | + raise |
| 294 | +``` |
| 295 | + |
| 296 | +### Why `Retry-After` smear still helps |
| 297 | + |
| 298 | +Even without coordination, the SDK's 50% smear on `Retry-After` provides statistical |
| 299 | +relief. If N independent Lambda invocations all receive `Retry-After: 60`, they don't all |
| 300 | +retry at second 60 — the smear spreads them across `[60, 90)`. The larger N is, the more |
| 301 | +this matters. |
| 302 | + |
| 303 | +### Summary: when to trust the SDK vs. the orchestrator |
| 304 | + |
| 305 | +| Scenario | Recommended approach | |
| 306 | +|----------|----------------------| |
| 307 | +| Single-process bulk upsert | Use defaults — SDK handles everything | |
| 308 | +| Long-running worker (persistent process) | Use defaults — adaptive limiter learns and recovers | |
| 309 | +| Lambda / Cloud Functions / Cloud Run (stateless) | `max_retries=1`, catch `RateLimitError`, re-raise for orchestrator retry | |
| 310 | +| Fan-out across many pods (e.g. Kubernetes Job) | Same as stateless — set low `max_retries`, rely on orchestrator | |
| 311 | +| Strict per-invocation SLA (must not block) | `max_retries=0`, `retryable_status_codes=frozenset()` — raise immediately | |
| 312 | + |
| 313 | +--- |
| 314 | + |
| 315 | +## See Also |
| 316 | + |
| 317 | +- {doc}`/guides/error-handling` — Exception hierarchy, `RateLimitError.retry_after`, and how to catch specific errors |
| 318 | +- {doc}`/guides/performance` — Bulk upsert patterns, `max_concurrency` tuning, and transport selection |
| 319 | +- {doc}`/guides/sync-vs-async` — When to use the async client and how to manage concurrency with `asyncio` |
0 commit comments