Add docs recommending autoscaling setup by carlydf · Pull Request #324 · temporalio/temporal-worker-controller

carlydf · 2026-05-14T01:52:36Z

Adds documentation outlining the tradeoffs between two autoscaling solutions:

HPA+prometheus adapter
KEDA Temporal Scaler

Documentation focuses on straightforward descriptions of the pros and cons of each solution.

The backlog metric pipeline goes from prometheus-adapter directly to the raw temporal_cloud_v1_approximate_backlog_count series, eliminating the temporal_approximate_backlog_count recording rule. Adapter rule: - seriesQuery filters out temporal_worker_build_id="__unversioned__" so discovery doesn't choke on the 5000+ unversioned series in typical accounts. - metricsQuery sum(...) collapses labels the HPA doesn't select on at query time (instance/job/region/task_priority/temporal_account). - metricsRelistInterval is bumped to 5m to accommodate the ~3-minute embedded-timestamp lag in Temporal Cloud's OpenMetrics emission. WRT example, prometheus-stack-values, and demo README are updated to match. Add docs/scaling-recommendations.md covering the empirically measured reactivity model (steady-state ~3:15 dominated by Cloud aggregation lag), task-queue-unload behavior, scale-from-zero limits, and when to pick KEDA over the metric path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Initial scaling-recommendations.md framed steady-state HPA reactivity as ~3:15, citing a "Temporal Cloud aggregation lag." That was wrong. The actual sample-age distribution on the OpenMetrics endpoint is: p50 30s (matches ~1/min emission cadence, age oscillates 0-60s) p95 50s p99 ~tail of occasional gateway-wide stalls So typical end-to-end reactivity is ~85s (emission + scrape + HPA poll), not ~3:15. The 3-minute figures came from observations made during the occasional periods when the OpenMetrics gateway returns frozen timestamps across every series in the account simultaneously - those stalls are real but not steady-state. Doc now: - Replaces the 3:15 figure with empirically-derived ~85s typical. - Adds a "Gateway-wide stalls" caveat describing the frozen-timestamp behavior observationally (no speculation about cause). - Keeps the metricsRelistInterval: 5m recommendation, now justified by the need to exceed stall duration rather than the misattributed "aggregation lag." - Demo README updated to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Earlier wording implied multiple stall events ("occasional periods") when we have only directly characterized one such event during this investigation. Reword to describe exactly what was seen, note that frequency is not yet known, and that the behavior is open with the Observability team. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Verified directly: across a 3-hour window including one of the observed "stall" events, every gap between consecutive sample timestamps in Prometheus's storage is exactly 60 seconds. So the OpenMetrics endpoint isn't dropping or freezing emissions - it's delivering them late, in bursts after a delay, with their original minute-aligned timestamps. The retrospective record looks complete (good for dashboards), but live HPA consumers see the delay as real staleness because they query the latest available timestamp at decision time. Reframe the caveat in the scaling doc and demo README accordingly. Also note we observed two such delay events in ~2 hours of close observation - frequency in normal operation is still open with the Observability team. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jaypipes

Thanks @carlydf , I've done a first go-around reviewing this documentation and adding (quite a few) suggested changes and removals to "de-Claude" some of it and make it (hopefully) a bit more readable for a general audience.

jaypipes · 2026-05-15T10:02:06Z

 ### [Ownership](manager-identity.md)
 How the controller gets permission to manage a Worker Deployment, how a human client can take or give back control.

+### [Scaling Recommendations](scaling-recommendations.md)


Recommend adding this new doc to the list here.

jaypipes · 2026-05-15T11:09:27Z

+                      └─ first replica added
+```
+
+**Typical end-to-end reactivity is ≈ 85 seconds + your stabilization window.** Empirically, sample age in Prometheus for a single series follows a sawtooth between 0 and 60 seconds (matching the gateway's ~1/min emission cadence). p50 sample age ≈ 30s, p95 ≈ 50s. The 60-second emission cadence is the inherent floor — smaller scrape intervals, tighter `metricsRelistInterval`, or recording rules cannot improve it because they all consume the same upstream cadence.


What is "your stabilization window"? That is not defined anywhere and this reads like Claude has seen the phrase "stabilization window" in other docs that it scraped from the Internet that described auto-scaling algorithms but doesn't actually understand what "stabilization window" means.

Also, "Typical end-to-end reactivity" doesn't make sense here and sounds like a term Claude either made up or has hallucinated-adopted from the term "end-to-end reactivity" from frontend software development patterns.

Sorry about any Claude-isms, the reason this is in Draft mode is because I hadn't done a full pass over it yet. I care deeply about not putting PRs in front of people that I haven't reviewed myself and don't necessarily endorse any of this until it's out of draft mode.

Claude probably came up with this after seeing a bunch of grafana screenshots that I sent it.

jaypipes · 2026-05-15T11:27:52Z

+### Slot utilization is a much faster leading signal
+
+`temporal_slot_utilization` is emitted directly by worker pods (no Temporal Cloud aggregation), scraped at the ServiceMonitor interval (~10–30 s), and reflects current state. It also rises *before* backlog accumulates — slots saturate first, then queueing starts. So a two-metric HPA with both slot util and backlog gives you fast scale-up via slot util and a backlog-driven backstop.
+
+The demo HPA uses both. For production scaling we recommend keeping both as well.


Recommend removing this. I've pulled some of the content into the proposed new ## HPA strengths and ## HPA scaling signal sections

jaypipes · 2026-05-15T11:35:43Z

+### Caveat: gateway delivery delay
+
+During our investigation we observed periods of several minutes during which Temporal Cloud's OpenMetrics endpoint returned the same embedded timestamps on repeated scrapes for *every* series across the account simultaneously — backlog series, action counts, error counts, every queue, every namespace, all showing identical staleness to the second (e.g. all ~30 visible series reading 239s old at once). The Prometheus scrape continued to succeed (`up{job="temporal_cloud"}` stayed 1, HTTP 200 responses) — the response body simply repeated already-known samples instead of advancing.
+
+Once the delay resolved, the gateway delivered the missing samples with their original minute-aligned timestamps in a burst, so Prometheus's storage ends up with a complete 1/minute series in retrospect. We verified this directly: across a 3-hour window covering one such delay event, every gap between consecutive sample timestamps was exactly 60 seconds, no exceptions.
+
+The retrospective completeness is helpful for dashboards and post-hoc analysis, but it does **not** help an HPA, which queries the *latest available* value at decision time. During a delivery delay, the latest available sample is the one from before the delay started. The HPA sees real staleness even though the underlying record will eventually be filled in.
+
+We have only directly characterized this pattern during one investigation session (seeing it twice in ~2 hours of close observation). Frequency in normal operation is not yet known and is open with Temporal's Observability team. If your workload cannot tolerate occasional multi-minute scaling pauses, prefer KEDA.
+
+This is also why `metricsRelistInterval: 5m` is the recommended setting: the discovery window must comfortably exceed the longest expected delay so the metric does not deregister, otherwise re-registration waits up to one more relist cycle after delivery resumes.


Recommend removing this section. I've moved some of the content and reworded it in the ## HPA limitations section below.

02strich · 2026-05-19T22:49:58Z

+
+### Caveat: gateway delivery delay
+
+During our investigation we observed periods of several minutes during which Temporal Cloud's OpenMetrics endpoint returned the same embedded timestamps on repeated scrapes for *every* series across the account simultaneously — backlog series, action counts, error counts, every queue, every namespace, all showing identical staleness to the second (e.g. all ~30 visible series reading 239s old at once). The Prometheus scrape continued to succeed (`up{job="temporal_cloud"}` stayed 1, HTTP 200 responses) — the response body simply repeated already-known samples instead of advancing.


this reads odd given that the project is from the same people as Temporal Cloud - can this be reworded?

02strich · 2026-05-19T22:50:41Z

+
+The retrospective completeness is helpful for dashboards and post-hoc analysis, but it does **not** help an HPA, which queries the *latest available* value at decision time. During a delivery delay, the latest available sample is the one from before the delay started. The HPA sees real staleness even though the underlying record will eventually be filled in.
+
+We have only directly characterized this pattern during one investigation session (seeing it twice in ~2 hours of close observation). Frequency in normal operation is not yet known and is open with Temporal's Observability team. If your workload cannot tolerate occasional multi-minute scaling pauses, prefer KEDA.


this sounds like sharing internal sausage making - which as a customer I am not sure what to take away from

Right, which is why I recommended in my review that this entire section be removed :)

02strich · 2026-05-19T22:52:06Z

+
+In a two-metric HPA configured with slot utilization, this is mostly fine: the HPA reports `ScalingActive=True` based on slot utilization while backlog is unavailable, and rejoins backlog scaling once it returns. We've confirmed this empirically in this demo cluster — the HPA continued scaling correctly on slot utilization through 1000+ backlog `FailedGetExternalMetric` events.
+
+## Why this demo does not use a backlog recording rule


which demo? "this" is an unclear reference

Ii recommended removing this entire section

carlydf · 2026-05-28T00:28:05Z

+
+## When KEDA hits its own limits
+
+KEDA bypasses the metric pipeline but uses Temporal API calls, which are subject to a per-namespace rate limit:


Other KEDA limitation (for now) #355

KEDA also does not actually work with TWC until #286 is closed. Luckily we have an open community PR #351 that will add support for the KEDA temporal trigger, which I think can be merged soon. I just reviewed it.

Co-authored-by: Jay Pipes <jaypipes@gmail.com> Co-authored-by: Stefan Richter <stefan@02strich.de>

Removes a bunch of overly verbose Claude-generated stuff that will likely confuse readers. Reworded a few places where Claude was using some odd terminology -- e.g. "typical end-to-end reactivity" -- to use more straightforward verbiage. Added a brief WRT example HPA template that shows the stabilization window that is referred to in multiple sections of the doc. Signed-off-by: Jay Pipes <jay.pipes@temporal.io>

Shivs11

a couple of nits -- looks g to me otherwise

Shivs11 · 2026-06-03T16:00:04Z

+
+> **Note**: This is why `metricsRelistInterval: 5m` is the recommended setting: the discovery window must comfortably exceed the longest expected delay so the metric does not deregister, otherwise re-registration waits up to one more relist cycle after delivery resumes.
+
+HPA cannot scale your Worker Deployment from zero because the signal for scaling does not yet exist. The signal for scaling is the backlog metric for the task queue associated with the workers in the Worker Deployment. This metric will not exist until there is at least one worker polling the task queue.


i took about a second to understand what this really meant - at first, I thought this meant that there won't be a backlog metric emitted if you don't have workers running at all (which is not true since you do have this metric being emitted for the unversioned world without workers being present)

I know you have clearly mentioned versions in the preamble here, but do you think we can be extra clear and mention the backlog count per version is not emitted without a worker being present since that is what creates a version in temporal?

Shivs11 · 2026-06-03T17:30:48Z

+
+HPA + prometheus adapter configured to look at both slot util and backlog provides fast scale-up via slot util and a backlog-driven backstop to prevent overly reactive replica count adjustment.
+
+## HPA limitations


this is great! was wondering - should we also mention that our OM endpoint inherently has a 3 minute time lag, which would mean that at time=0, we are seeing a one minute aggregate of time.now() - 3 minutes

carlydf requested review from a team and jlegrone as code owners May 14, 2026 01:52

carlydf marked this pull request as draft May 14, 2026 02:03

carlydf and others added 3 commits May 13, 2026 19:44

jaypipes requested changes May 15, 2026

View reviewed changes

02strich reviewed May 19, 2026

View reviewed changes

Comment thread docs/scaling-recommendations.md Outdated

carlydf commented May 28, 2026

View reviewed changes

carlydf and others added 2 commits May 27, 2026 17:28

Apply suggestions from code review

4f72016

Co-authored-by: Jay Pipes <jaypipes@gmail.com> Co-authored-by: Stefan Richter <stefan@02strich.de>

jaypipes marked this pull request as ready for review June 1, 2026 16:35

jaypipes changed the title ~~Drop backlog recording rule; consume raw temporal_cloud_v1_approximate_backlog_count~~ Add docs recommending autoscaling setup Jun 3, 2026

Shivs11 approved these changes Jun 3, 2026

View reviewed changes


		### Caveat: gateway delivery delay

		During our investigation we observed periods of several minutes during which Temporal Cloud's OpenMetrics endpoint returned the same embedded timestamps on repeated scrapes for every series across the account simultaneously — backlog series, action counts, error counts, every queue, every namespace, all showing identical staleness to the second (e.g. all ~30 visible series reading 239s old at once). The Prometheus scrape continued to succeed (`up{job="temporal_cloud"}` stayed 1, HTTP 200 responses) — the response body simply repeated already-known samples instead of advancing.


		The retrospective completeness is helpful for dashboards and post-hoc analysis, but it does not help an HPA, which queries the latest available value at decision time. During a delivery delay, the latest available sample is the one from before the delay started. The HPA sees real staleness even though the underlying record will eventually be filled in.

		We have only directly characterized this pattern during one investigation session (seeing it twice in ~2 hours of close observation). Frequency in normal operation is not yet known and is open with Temporal's Observability team. If your workload cannot tolerate occasional multi-minute scaling pauses, prefer KEDA.


		In a two-metric HPA configured with slot utilization, this is mostly fine: the HPA reports `ScalingActive=True` based on slot utilization while backlog is unavailable, and rejoins backlog scaling once it returns. We've confirmed this empirically in this demo cluster — the HPA continued scaling correctly on slot utilization through 1000+ backlog `FailedGetExternalMetric` events.

		## Why this demo does not use a backlog recording rule


		## When KEDA hits its own limits

		KEDA bypasses the metric pipeline but uses Temporal API calls, which are subject to a per-namespace rate limit:


		> Note: This is why `metricsRelistInterval: 5m` is the recommended setting: the discovery window must comfortably exceed the longest expected delay so the metric does not deregister, otherwise re-registration waits up to one more relist cycle after delivery resumes.

		HPA cannot scale your Worker Deployment from zero because the signal for scaling does not yet exist. The signal for scaling is the backlog metric for the task queue associated with the workers in the Worker Deployment. This metric will not exist until there is at least one worker polling the task queue.


		HPA + prometheus adapter configured to look at both slot util and backlog provides fast scale-up via slot util and a backlog-driven backstop to prevent overly reactive replica count adjustment.

		## HPA limitations

Conversation

carlydf commented May 14, 2026 • edited by jaypipes Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jaypipes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carlydf May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shivs11 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

carlydf commented May 14, 2026 •

edited by jaypipes

Loading

carlydf May 27, 2026 •

edited

Loading