Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,9 @@ See [Migration to Unversioned](migration-to-unversioned.md) for how to migrate b
### [Ownership](manager-identity.md)
How the controller gets permission to manage a Worker Deployment, how a human client can take or give back control.

### [Scaling Recommendations](scaling-recommendations.md)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend adding this new doc to the list here.

Practical reactivity and reliability tradeoffs between HPA + prometheus-adapter and KEDA when scaling Temporal workers per worker-deployment-version. Covers steady-state reactivity (~3:15 via the metric path), task-queue unloading, scale-from-zero limits, and when to pick which tool.

### [WorkerResourceTemplate](worker-resource-templates.md)
How to attach HPAs, PodDisruptionBudgets, and other Kubernetes resources to each active versioned Deployment. Covers the auto-injection model, RBAC setup, webhook TLS, and examples.

Expand Down
172 changes: 172 additions & 0 deletions docs/scaling-recommendations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# Scaling Recommendations

This document describes practical reactivity and reliability tradeoffs when scaling Temporal workers per worker deployment version on Kubernetes, and recommends which tool fits which workload pattern.

The `internal/demo/` example wires the HPA path described here. The KEDA path is mentioned for comparison and as a recommendation for workloads that cannot tolerate the HPA path's limits.

## TL;DR

We recommend choosing a scaler approach that aligns with the workload pattern your application exhibits.

| Workload pattern | Recommendation |
|------------------|----------------|
| Continuous traffic (task queue always loaded) | HPA |
| Idle periods >5 min between work OR needs scale-from-zero | KEDA Temporal scaler |
| Required reactivity < ~60 s from first backlog | KEDA Temporal scaler |
| Required reactivity ~90 s typical, tolerant of occasional multi-minute stalls | HPA + prometheus-adapter |
| 1000s of task queues and worker deployment versions | HPA + prometheus-adapter |

## HPA scaling signal

This section describes the signal used by HPA + prometheus adapter to adjust the count of workers in a Kubernetes deployment managed by Temporal Worker Controller.

There are two metric data points that are scraped by HPA + prometheus adapter.

`temporal_cloud_v1_approximate_backlog_count` (or just "backlog") is a measurement of the number of pending tasks on a particular task queue that are waiting for a poller (a worker) to pull that task and process it. This is a metric provided by [Temporal Cloud's OpenMetrics aggregation service][tc-openmetrics].

`temporal_slot_utilization` (or just "slot util") is emitted directly by Workers (no Temporal Cloud aggregation), scraped at the Prometheus `ServiceMonitor` interval (~10–30 s), and reflects the current state of a particular Worker. This metric rises *before* backlog accumulates. In other words, slots on the Worker saturate first, then queueing starts.

For a continuously-loaded task queue, important events from "backlog appears" to "HPA scales up" can be visualized like so:

```
backlog appears at T0
└─ Temporal Cloud OpenMetrics emission cadence + ~60s worst-case (~1 sample/minute)
└─ Prometheus scrape interval + ~10s
└─ HPA poll interval + ~15s
└─ scale-up stabilization window + taken from HPA configuration
└─ first replica added
```
[tc-openmetrics]: https://docs.temporal.io/cloud/metrics/openmetrics

## HPA strengths

Because HPA uses a single OpenMetrics scrape to gather all series for the namespace in a single HTTP request, the HPA approach scales independently of namespace count. The single HTTP request for OpenMetrics more efficient than KEDA's Temporal API-based approach, and will not run into Temporal API rate limiting problems (see section below on [KEDA limitations](#keda-limitations)).

HPA + prometheus adapter configured to look at both slot util and backlog provides fast scale-up via slot util and a backlog-driven backstop to prevent overly reactive replica count adjustment.

## HPA limitations
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great! was wondering - should we also mention that our OM endpoint inherently has a 3 minute time lag, which would mean that at time=0, we are seeing a one minute aggregate of time.now() - 3 minutes


This section describes two known limitations for HPA + prometheus adapter.

Temporal Cloud's OpenMetrics endpoint may sometimes return the same embedded timestamps on repeated scrapes for each series across the account simultaneously — backlog series, action counts, error counts, every queue, every namespace. This delay in returning fresh metrics data can impact the speed to which HPA + prometheus adapter scales out or in the replica count for a worker deployment version. This means that HPA + prometheus adapter may not be a good solution if your workload cannot tolerate occasional multi-minute scaling pauses.

> **Note**: This is why `metricsRelistInterval: 5m` is the recommended setting: the discovery window must comfortably exceed the longest expected delay so the metric does not deregister, otherwise re-registration waits up to one more relist cycle after delivery resumes.

Comment thread
carlydf marked this conversation as resolved.
HPA cannot scale your Worker Deployment from zero because the signal for scaling does not yet exist. The signal for scaling is the backlog metric for the task queue associated with the workers in the Worker Deployment. This metric will not exist until there is at least one worker polling the task queue.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i took about a second to understand what this really meant - at first, I thought this meant that there won't be a backlog metric emitted if you don't have workers running at all (which is not true since you do have this metric being emitted for the unversioned world without workers being present)

I know you have clearly mentioned versions in the preamble here, but do you think we can be extra clear and mention the backlog count per version is not emitted without a worker being present since that is what creates a version in temporal?


In addition to the "first worker start" problem, for customers using Temporal Cloud, if there are no polling workers for a task queue for more than 5 minutes, Temporal Cloud will unload the task queue from memory. Unloaded task queues do not emit metrics, and therefore the signal that HPA uses to scale up will not be present.

Submitting a workflow does load the task queue back into memory, but the metric still won't reach the HPA until the next OpenMetrics emission cycle (~1 minute). By the time the HPA reacts, you've already had ~1+ minute of unprovisioned work.

## KEDA strengths

KEDA's Temporal scaler calls `DescribeTaskQueue(stats=true)` (or `DescribeWorkerDeploymentVersion`), which loads the queue synchronously and returns the backlog directly. This allows KEDA to scale Temporal workers from zero.

## KEDA limitations

KEDA bypasses the metric pipeline but uses Temporal API calls, which are subject to a per-namespace rate limit:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other KEDA limitation (for now) #355

KEDA also does not actually work with TWC until #286 is closed. Luckily we have an open community PR #351 that will add support for the KEDA temporal trigger, which I think can be merged soon. I just reviewed it.


```
FrontendGlobalWorkerDeploymentReadRPS = 50 # per namespace, evenly distributed across frontend instances
```

For a namespace with N task queues × M worker-deployment-versions = K HPAs, each KEDA poll uses ~1 API call. The polling budget:

| HPA count | Poll every 30s | Poll every 10s | Poll every 5s |
|-----------|----------------|----------------|---------------|
| 50 | 1.7 RPS (3%) | 5 RPS (10%) | 10 RPS (20%) |
| 250 | 8 RPS (17%) | 25 RPS (50%) | 50 RPS (100%) |
| 1500 | 50 RPS (100%) | exceeds limit | exceeds limit |


If you are using KEDA with Temporal Cloud and hitting the API rate limit described above, you will need to contact your Temporal Cloud account team to discuss increasing the rate limits.

## Recommended configuration for the HPA + prometheus-adapter path

This demo's configuration represents the recommendation, in compact form:

**Scrape config** (`internal/demo/k8s/prometheus-stack-values.yaml`):
```yaml
- job_name: temporal_cloud
scrape_interval: 10s
honor_timestamps: true
metrics_path: /v1/metrics
params:
labels:
- temporal_worker_deployment_name
- temporal_worker_build_id
```

**prometheus-adapter rule** (`internal/demo/k8s/prometheus-adapter-values.yaml`):
```yaml
metricsRelistInterval: 5m # must accommodate Cloud's ~3-min embedded-timestamp lag
rules:
external:
- seriesQuery: 'temporal_cloud_v1_approximate_backlog_count{temporal_worker_build_id!="__unversioned__"}'
metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>})'
name:
as: "temporal_cloud_v1_approximate_backlog_count"
resources:
namespaced: false
```

The `seriesQuery` filter excludes `__unversioned__` series. Without it, accounts with many unversioned namespaces produce 5000+ series in the discovery response, which slows or breaks adapter discovery. The filter scopes discovery to versioned workloads.

**HPA template** (`examples/wrt-hpa-backlog.yaml`): two metrics — slot utilization (fast leading signal, scale-up gate) and backlog count (confirming signal, AverageValue target).

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
scaleTargetRef: {}
minReplicas: 1
maxReplicas: 30
metrics:
- type: External
external:
metric:
name: temporal_slot_utilization
selector:
matchLabels:
worker_type: "ActivityWorker"
target:
type: Value
value: "750m"

- type: External
external:
metric:
name: temporal_cloud_v1_approximate_backlog_count
selector:
matchLabels:
temporal_task_queue: "default_helloworld"
task_type: "Activity"
target:
type: AverageValue
averageValue: "1"
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 10
periodSeconds: 10
selectPolicy: Max

scaleDown:
stabilizationWindowSeconds: 120
policies:
- type: Percent
value: 10
periodSeconds: 10
selectPolicy: Max
```

## References

- [Temporal Cloud OpenMetrics](https://docs.temporal.io/cloud/metrics/openmetrics) — endpoint and opt-in labels
- [prometheus-adapter README](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/README.md) — `metrics-relist-interval` and discovery window semantics
- [prometheus-adapter externalmetrics.md](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/docs/externalmetrics.md) — external rules, `namespaced: false` for cluster-scoped metrics
- [Prometheus HTTP API: `/api/v1/series`](https://prometheus.io/docs/prometheus/latest/querying/api/#finding-series-by-label-matchers) — series discovery semantics
- [Prometheus scrape config: `honor_timestamps`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config) — preserving source timestamps
- [KEDA Temporal scaler](https://keda.sh/docs/latest/scalers/temporal/) — direct API polling alternative
7 changes: 4 additions & 3 deletions examples/wrt-hpa-backlog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -61,15 +61,16 @@ spec:
value: "750m"

# Metric: backlog count — scale up when tasks are queued but not yet picked up.
# temporal_approximate_backlog_count is a recording rule that aggregates
# temporal_cloud_v1_approximate_backlog_count down to the four labels the HPA needs.
# Sourced directly from Temporal Cloud's temporal_cloud_v1_approximate_backlog_count
# series; the prometheus-adapter rule wraps it in sum(...) to collapse labels the HPA
# doesn't select on (instance/job/region/task_priority/temporal_account).
# temporal_worker_deployment_name, temporal_worker_build_id, and temporal_namespace
# are injected automatically by the controller — do not set them here.
# temporal_task_queue must be set explicitly to scope the metric to your task queue.
- type: External
external:
metric:
name: temporal_approximate_backlog_count
name: temporal_cloud_v1_approximate_backlog_count
selector:
matchLabels:
temporal_task_queue: "default_helloworld"
Expand Down
6 changes: 3 additions & 3 deletions internal/demo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -268,7 +268,7 @@ You'll also need to [opt-in](https://docs.temporal.io/cloud/metrics/openmetrics/

This requires a **metrics API key** — a separate credential from the namespace API key used for the worker connection.

> **Note:** This demo ships a Prometheus recording rule that renames `temporal_cloud_v1_approximate_backlog_count` to `temporal_approximate_backlog_count` and reduces it to the labels the HPA cares about. In principle the HPA can consume the raw Cloud metric directly (set `namespaced: false` on the prometheus-adapter rule so it doesn't auto-inject a `namespace` label filter), but this demo uses the recording rule as a known-working path.
> **Picking a scaling tool for your workload:** This demo uses the HPA + prometheus-adapter path. It works well for continuously-loaded task queues and has a typical end-to-end reactivity of ~85 seconds (dominated by Temporal Cloud's ~1/minute OpenMetrics emission cadence). It cannot do scale-from-zero. For sub-60s reactivity or scale-from-zero, use the KEDA Temporal scaler. See [docs/scaling-recommendations.md](../../docs/scaling-recommendations.md) for the full reactivity model, when to pick which, and a caveat about an account-wide OpenMetrics delivery-delay pattern we observed during testing (retrospectively backfilled, but real for live HPA queries).

**Step 1 — Create the Temporal Cloud metrics credentials secret.**

Expand Down Expand Up @@ -302,11 +302,11 @@ helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapte

```bash
kubectl -n monitoring port-forward svc/prometheus-kube-prometheus-prometheus 9092:9090 &
curl -s 'http://localhost:9092/api/v1/query?query=temporal_approximate_backlog_count' \
curl -s 'http://localhost:9092/api/v1/query?query=temporal_cloud_v1_approximate_backlog_count' \
| jq '.data.result'
```

You should see results with `temporal_worker_deployment_name` and `temporal_worker_build_id` labels. If the result is empty, wait 15–30s for the recording rule to evaluate.
You should see results with `temporal_worker_deployment_name` and `temporal_worker_build_id` labels. If the result is empty, verify the Temporal Cloud metrics API key secret is correct and that scrape targets are healthy in the Prometheus UI.

**Step 4 — Apply the combined WRT.**
```bash
Expand Down
25 changes: 17 additions & 8 deletions internal/demo/k8s/prometheus-adapter-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,16 +29,25 @@ rules:
namespaced: false # cluster-scoped: HPAs in any k8s namespace can consume this metric

# Phase 2: approximate backlog count per worker version (from Temporal Cloud).
# Uses the temporal_approximate_backlog_count recording rule, which reduces the raw
# temporal_cloud_v1_approximate_backlog_count (high cardinality, many label dimensions)
# down to just the four labels the HPA needs. cluster-scoped so HPAs in any namespace
# can consume it.
- seriesQuery: 'temporal_approximate_backlog_count{}'
# Consumes temporal_cloud_v1_approximate_backlog_count directly. The metricsQuery's
# sum(...) collapses labels the HPA's matchLabels don't select on
# (instance/job/region/task_priority/temporal_account).
#
# seriesQuery filter rationale: Temporal Cloud emits this metric for *every* namespace
# in your account, including ones not yet opted in to per-version labels — those carry
# temporal_worker_build_id="__unversioned__" and can dominate cardinality (5000+ series
# per account is typical). The adapter chokes on series-discovery responses that large,
# so we filter discovery to versioned series only.
#
# cluster-scoped so HPAs in any namespace can consume it.
- seriesQuery: 'temporal_cloud_v1_approximate_backlog_count{temporal_worker_build_id!="__unversioned__"}'
metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>})'
name:
as: "temporal_approximate_backlog_count"
as: "temporal_cloud_v1_approximate_backlog_count"
resources:
namespaced: false # cluster-scoped: HPAs in any namespace can consume this metric

# Must be greater than the Prometheus scrape interval.
metricsRelistInterval: 15s
# Must accommodate Temporal Cloud's embedded-timestamp lag (~3 min) AND have
# margin for emission cadence. 5m is empirically the smallest tested value
# that keeps the metric registered through the 3-min timestamp staleness.
metricsRelistInterval: 5m
19 changes: 3 additions & 16 deletions internal/demo/k8s/prometheus-stack-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,9 @@
# 1. ServiceMonitor — scrapes worker pod metrics (slot gauges) from port 9090
# 2. Temporal Cloud scrape config — scrapes temporal_cloud_v1_approximate_backlog_count
# (Phase 2 only; requires a Temporal Cloud metrics API key)
# 3. Recording rules — slot utilization ratio (Phase 1) and backlog count by version (Phase 2)
# 3. Recording rule — slot utilization ratio (Phase 1 only). The backlog count
# is consumed directly from the raw Cloud series via prometheus-adapter; see
# docs/scaling-recommendations.md for the reasoning.
# 4. prometheus-adapter — see internal/demo/k8s/prometheus-adapter-values.yaml

# ─── 1. ServiceMonitor ──────────────────────────────────────────────────────
Expand Down Expand Up @@ -81,18 +83,3 @@ additionalPrometheusRulesMap:
1
)

- name: temporal_cloud_backlog
interval: 10s
rules:
# Backlog count per worker version. Temporal Cloud emits
# temporal_worker_deployment_name and temporal_worker_build_id as separate
# labels (opted in via params.labels in the scrape config), so no label
# manipulation is needed — only cardinality reduction via sum by.
# The prometheus-adapter serves this as a cluster-scoped external metric
# (namespaced: false), so HPAs in any namespace can consume it.
- record: temporal_approximate_backlog_count
expr: |
sum by (temporal_worker_deployment_name, temporal_worker_build_id, task_type, temporal_namespace, temporal_task_queue) (
temporal_cloud_v1_approximate_backlog_count
)

Loading