temporalio · carlydf · May 14, 2026 · May 14, 2026 · May 14, 2026 · May 14, 2026
@@ -39,6 +39,9 @@ See [Migration to Unversioned](migration-to-unversioned.md) for how to migrate b
 ### [Ownership](manager-identity.md)
 How the controller gets permission to manage a Worker Deployment, how a human client can take or give back control.
 
+### [Scaling Recommendations](scaling-recommendations.md)
+Practical reactivity and reliability tradeoffs between HPA + prometheus-adapter and KEDA when scaling Temporal workers per worker-deployment-version. Covers steady-state reactivity (~3:15 via the metric path), task-queue unloading, scale-from-zero limits, and when to pick which tool.
+
 ### [WorkerResourceTemplate](worker-resource-templates.md)
 How to attach HPAs, PodDisruptionBudgets, and other Kubernetes resources to each active versioned Deployment. Covers the auto-injection model, RBAC setup, webhook TLS, and examples.
 

@@ -0,0 +1,172 @@
+# Scaling Recommendations
+
+This document describes practical reactivity and reliability tradeoffs when scaling Temporal workers per worker deployment version on Kubernetes, and recommends which tool fits which workload pattern.
+
+The `internal/demo/` example wires the HPA path described here. The KEDA path is mentioned for comparison and as a recommendation for workloads that cannot tolerate the HPA path's limits.
+
+## TL;DR
+
+We recommend choosing a scaler approach that aligns with the workload pattern your application exhibits.
+
+| Workload pattern | Recommendation |
+|------------------|----------------|
+| Continuous traffic (task queue always loaded) | HPA |
+| Idle periods >5 min between work OR needs scale-from-zero | KEDA Temporal scaler |
+| Required reactivity < ~60 s from first backlog | KEDA Temporal scaler |
+| Required reactivity ~90 s typical, tolerant of occasional multi-minute stalls | HPA + prometheus-adapter |
+| 1000s of task queues and worker deployment versions  | HPA + prometheus-adapter |
+
+## HPA scaling signal
+
+This section describes the signal used by HPA + prometheus adapter to adjust the count of workers in a Kubernetes deployment managed by Temporal Worker Controller.
+
+There are two metric data points that are scraped by HPA + prometheus adapter.
+
+`temporal_cloud_v1_approximate_backlog_count` (or just "backlog") is a measurement of the number of pending tasks on a particular task queue that are waiting for a poller (a worker) to pull that task and process it. This is a metric provided by [Temporal Cloud's OpenMetrics aggregation service][tc-openmetrics].
+
+`temporal_slot_utilization` (or just "slot util") is emitted directly by Workers (no Temporal Cloud aggregation), scraped at the Prometheus `ServiceMonitor` interval (~10–30 s), and reflects the current state of a particular Worker. This metric rises *before* backlog accumulates. In other words, slots on the Worker saturate first, then queueing starts.
+
+For a continuously-loaded task queue, important events from "backlog appears" to "HPA scales up" can be visualized like so:
+
+```
+backlog appears at T0
+  └─ Temporal Cloud OpenMetrics emission cadence     + ~60s worst-case  (~1 sample/minute)
+       └─ Prometheus scrape interval                 + ~10s
+            └─ HPA poll interval                     + ~15s
+                 └─ scale-up stabilization window    + taken from HPA configuration
+                      └─ first replica added
+```
+[tc-openmetrics]: https://docs.temporal.io/cloud/metrics/openmetrics
+
+## HPA strengths
+
+Because HPA uses a single OpenMetrics scrape to gather all series for the namespace in a single HTTP request, the HPA approach scales independently of namespace count. The single HTTP request for OpenMetrics more efficient than KEDA's Temporal API-based approach, and will not run into Temporal API rate limiting problems (see section below on [KEDA limitations](#keda-limitations)).
+
+HPA + prometheus adapter configured to look at both slot util and backlog provides fast scale-up via slot util and a backlog-driven backstop to prevent overly reactive replica count adjustment.
+
+## HPA limitations
+
+This section describes two known limitations for HPA + prometheus adapter.
+
+Temporal Cloud's OpenMetrics endpoint may sometimes return the same embedded timestamps on repeated scrapes for each series across the account simultaneously — backlog series, action counts, error counts, every queue, every namespace. This delay in returning fresh metrics data can impact the speed to which HPA + prometheus adapter scales out or in the replica count for a worker deployment version. This means that HPA + prometheus adapter may not be a good solution if your workload cannot tolerate occasional multi-minute scaling pauses.
+
+> **Note**: This is why `metricsRelistInterval: 5m` is the recommended setting: the discovery window must comfortably exceed the longest expected delay so the metric does not deregister, otherwise re-registration waits up to one more relist cycle after delivery resumes.
+
+HPA cannot scale your Worker Deployment from zero because the signal for scaling does not yet exist. The signal for scaling is the backlog metric for the task queue associated with the workers in the Worker Deployment. This metric will not exist until there is at least one worker polling the task queue.
+
+In addition to the "first worker start" problem, for customers using Temporal Cloud, if there are no polling workers for a task queue for more than 5 minutes, Temporal Cloud will unload the task queue from memory. Unloaded task queues do not emit metrics, and therefore the signal that HPA uses to scale up will not be present.
+
+Submitting a workflow does load the task queue back into memory, but the metric still won't reach the HPA until the next OpenMetrics emission cycle (~1 minute). By the time the HPA reacts, you've already had ~1+ minute of unprovisioned work.
+
+## KEDA strengths
+
+KEDA's Temporal scaler calls `DescribeTaskQueue(stats=true)` (or `DescribeWorkerDeploymentVersion`), which loads the queue synchronously and returns the backlog directly. This allows KEDA to scale Temporal workers from zero.
+
+## KEDA limitations
+
+KEDA bypasses the metric pipeline but uses Temporal API calls, which are subject to a per-namespace rate limit:
+
+```
+FrontendGlobalWorkerDeploymentReadRPS = 50  # per namespace, evenly distributed across frontend instances
+```
+
+For a namespace with N task queues × M worker-deployment-versions = K HPAs, each KEDA poll uses ~1 API call. The polling budget:
+
+| HPA count | Poll every 30s | Poll every 10s | Poll every 5s |
+|-----------|----------------|----------------|---------------|
+| 50        | 1.7 RPS (3%)   | 5 RPS (10%)    | 10 RPS (20%)  |
+| 250       | 8 RPS (17%)    | 25 RPS (50%)   | 50 RPS (100%) |
+| 1500      | 50 RPS (100%)  | exceeds limit  | exceeds limit |
+
+
+If you are using KEDA with Temporal Cloud and hitting the API rate limit described above, you will need to contact your Temporal Cloud account team to discuss increasing the rate limits.
+
+## Recommended configuration for the HPA + prometheus-adapter path
+
+This demo's configuration represents the recommendation, in compact form:
+
+**Scrape config** (`internal/demo/k8s/prometheus-stack-values.yaml`):
+```yaml
+- job_name: temporal_cloud
+  scrape_interval: 10s
+  honor_timestamps: true
+  metrics_path: /v1/metrics
+  params:
+    labels:
+      - temporal_worker_deployment_name
+      - temporal_worker_build_id
+```
+
+**prometheus-adapter rule** (`internal/demo/k8s/prometheus-adapter-values.yaml`):
+```yaml
+metricsRelistInterval: 5m   # must accommodate Cloud's ~3-min embedded-timestamp lag
+rules:
+  external:
+    - seriesQuery: 'temporal_cloud_v1_approximate_backlog_count{temporal_worker_build_id!="__unversioned__"}'
+      metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>})'
+      name:
+        as: "temporal_cloud_v1_approximate_backlog_count"
+      resources:
+        namespaced: false
+```
+
+The `seriesQuery` filter excludes `__unversioned__` series. Without it, accounts with many unversioned namespaces produce 5000+ series in the discovery response, which slows or breaks adapter discovery. The filter scopes discovery to versioned workloads.
+
+**HPA template** (`examples/wrt-hpa-backlog.yaml`): two metrics — slot utilization (fast leading signal, scale-up gate) and backlog count (confirming signal, AverageValue target).
+
+```yaml
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+spec:
+  scaleTargetRef: {}
+  minReplicas: 1
+  maxReplicas: 30
+  metrics:
+    - type: External
+      external:
+        metric:
+          name: temporal_slot_utilization
+          selector:
+            matchLabels:
+              worker_type: "ActivityWorker"
+        target:
+          type: Value
+          value: "750m"
+
+    - type: External
+      external:
+        metric:
+          name: temporal_cloud_v1_approximate_backlog_count
+          selector:
+            matchLabels:
+              temporal_task_queue: "default_helloworld"
+              task_type: "Activity"
+        target:
+          type: AverageValue
+          averageValue: "1"
+  behavior:
+    scaleUp:
+      stabilizationWindowSeconds: 30
+      policies:
+        - type: Percent
+          value: 10
+          periodSeconds: 10
+      selectPolicy: Max
+
+    scaleDown:
+      stabilizationWindowSeconds: 120
+      policies:
+      - type: Percent
+        value: 10
+        periodSeconds: 10
+      selectPolicy: Max
+```
+
+## References
+
+- [Temporal Cloud OpenMetrics](https://docs.temporal.io/cloud/metrics/openmetrics) — endpoint and opt-in labels
+- [prometheus-adapter README](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/README.md) — `metrics-relist-interval` and discovery window semantics
+- [prometheus-adapter externalmetrics.md](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/docs/externalmetrics.md) — external rules, `namespaced: false` for cluster-scoped metrics
+- [Prometheus HTTP API: `/api/v1/series`](https://prometheus.io/docs/prometheus/latest/querying/api/#finding-series-by-label-matchers) — series discovery semantics
+- [Prometheus scrape config: `honor_timestamps`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config) — preserving source timestamps
+- [KEDA Temporal scaler](https://keda.sh/docs/latest/scalers/temporal/) — direct API polling alternative
@@ -61,15 +61,16 @@ spec:
               value: "750m"
 
         # Metric: backlog count — scale up when tasks are queued but not yet picked up.
-        # temporal_approximate_backlog_count is a recording rule that aggregates
-        # temporal_cloud_v1_approximate_backlog_count down to the four labels the HPA needs.
+        # Sourced directly from Temporal Cloud's temporal_cloud_v1_approximate_backlog_count
+        # series; the prometheus-adapter rule wraps it in sum(...) to collapse labels the HPA
+        # doesn't select on (instance/job/region/task_priority/temporal_account).
         # temporal_worker_deployment_name, temporal_worker_build_id, and temporal_namespace
         # are injected automatically by the controller — do not set them here.
         # temporal_task_queue must be set explicitly to scope the metric to your task queue.
         - type: External
           external:
             metric:
-              name: temporal_approximate_backlog_count
+              name: temporal_cloud_v1_approximate_backlog_count
               selector:
                 matchLabels:
                   temporal_task_queue: "default_helloworld"

@@ -268,7 +268,7 @@ You'll also need to [opt-in](https://docs.temporal.io/cloud/metrics/openmetrics/
 
 This requires a **metrics API key** — a separate credential from the namespace API key used for the worker connection.
 
-> **Note:** This demo ships a Prometheus recording rule that renames `temporal_cloud_v1_approximate_backlog_count` to `temporal_approximate_backlog_count` and reduces it to the labels the HPA cares about. In principle the HPA can consume the raw Cloud metric directly (set `namespaced: false` on the prometheus-adapter rule so it doesn't auto-inject a `namespace` label filter), but this demo uses the recording rule as a known-working path.
+> **Picking a scaling tool for your workload:** This demo uses the HPA + prometheus-adapter path. It works well for continuously-loaded task queues and has a typical end-to-end reactivity of ~85 seconds (dominated by Temporal Cloud's ~1/minute OpenMetrics emission cadence). It cannot do scale-from-zero. For sub-60s reactivity or scale-from-zero, use the KEDA Temporal scaler. See [docs/scaling-recommendations.md](../../docs/scaling-recommendations.md) for the full reactivity model, when to pick which, and a caveat about an account-wide OpenMetrics delivery-delay pattern we observed during testing (retrospectively backfilled, but real for live HPA queries).
 
 **Step 1 — Create the Temporal Cloud metrics credentials secret.**
 
@@ -302,11 +302,11 @@ helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapte
 
 ```bash
 kubectl -n monitoring port-forward svc/prometheus-kube-prometheus-prometheus 9092:9090 &
-curl -s 'http://localhost:9092/api/v1/query?query=temporal_approximate_backlog_count' \
+curl -s 'http://localhost:9092/api/v1/query?query=temporal_cloud_v1_approximate_backlog_count' \
   | jq '.data.result'
 ```
 
-You should see results with `temporal_worker_deployment_name` and `temporal_worker_build_id` labels. If the result is empty, wait 15–30s for the recording rule to evaluate.
+You should see results with `temporal_worker_deployment_name` and `temporal_worker_build_id` labels. If the result is empty, verify the Temporal Cloud metrics API key secret is correct and that scrape targets are healthy in the Prometheus UI.
 
 **Step 4 — Apply the combined WRT.**
 ```bash

@@ -29,16 +29,25 @@ rules:
         namespaced: false  # cluster-scoped: HPAs in any k8s namespace can consume this metric
 
     # Phase 2: approximate backlog count per worker version (from Temporal Cloud).
-    # Uses the temporal_approximate_backlog_count recording rule, which reduces the raw
-    # temporal_cloud_v1_approximate_backlog_count (high cardinality, many label dimensions)
-    # down to just the four labels the HPA needs. cluster-scoped so HPAs in any namespace
-    # can consume it.
-    - seriesQuery: 'temporal_approximate_backlog_count{}'
+    # Consumes temporal_cloud_v1_approximate_backlog_count directly. The metricsQuery's
+    # sum(...) collapses labels the HPA's matchLabels don't select on
+    # (instance/job/region/task_priority/temporal_account).
+    #
+    # seriesQuery filter rationale: Temporal Cloud emits this metric for *every* namespace
+    # in your account, including ones not yet opted in to per-version labels — those carry
+    # temporal_worker_build_id="__unversioned__" and can dominate cardinality (5000+ series
+    # per account is typical). The adapter chokes on series-discovery responses that large,
+    # so we filter discovery to versioned series only.
+    #
+    # cluster-scoped so HPAs in any namespace can consume it.
+    - seriesQuery: 'temporal_cloud_v1_approximate_backlog_count{temporal_worker_build_id!="__unversioned__"}'
       metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>})'
       name:
-        as: "temporal_approximate_backlog_count"
+        as: "temporal_cloud_v1_approximate_backlog_count"
       resources:
         namespaced: false  # cluster-scoped: HPAs in any namespace can consume this metric
 
-# Must be greater than the Prometheus scrape interval.
-metricsRelistInterval: 15s
+# Must accommodate Temporal Cloud's embedded-timestamp lag (~3 min) AND have
+# margin for emission cadence. 5m is empirically the smallest tested value
+# that keeps the metric registered through the 3-min timestamp staleness.
+metricsRelistInterval: 5m
@@ -11,7 +11,9 @@
 #   1. ServiceMonitor — scrapes worker pod metrics (slot gauges) from port 9090
 #   2. Temporal Cloud scrape config — scrapes temporal_cloud_v1_approximate_backlog_count
 #      (Phase 2 only; requires a Temporal Cloud metrics API key)
-#   3. Recording rules — slot utilization ratio (Phase 1) and backlog count by version (Phase 2)
+#   3. Recording rule — slot utilization ratio (Phase 1 only). The backlog count
+#      is consumed directly from the raw Cloud series via prometheus-adapter; see
+#      docs/scaling-recommendations.md for the reasoning.
 #   4. prometheus-adapter — see internal/demo/k8s/prometheus-adapter-values.yaml
 
 # ─── 1. ServiceMonitor ──────────────────────────────────────────────────────
@@ -81,18 +83,3 @@ additionalPrometheusRulesMap:
                 1
               )
 
-      - name: temporal_cloud_backlog
-        interval: 10s
-        rules:
-          # Backlog count per worker version. Temporal Cloud emits
-          # temporal_worker_deployment_name and temporal_worker_build_id as separate
-          # labels (opted in via params.labels in the scrape config), so no label
-          # manipulation is needed — only cardinality reduction via sum by.
-          # The prometheus-adapter serves this as a cluster-scoped external metric
-          # (namespaced: false), so HPAs in any namespace can consume it.
-          - record: temporal_approximate_backlog_count
-            expr: |
-              sum by (temporal_worker_deployment_name, temporal_worker_build_id, task_type, temporal_namespace, temporal_task_queue) (
-                temporal_cloud_v1_approximate_backlog_count
-              )
-