Skip to content

Add docs recommending autoscaling setup#324

Open
carlydf wants to merge 6 commits into
mainfrom
demo-ga-no-recording-rule
Open

Add docs recommending autoscaling setup#324
carlydf wants to merge 6 commits into
mainfrom
demo-ga-no-recording-rule

Conversation

@carlydf
Copy link
Copy Markdown
Collaborator

@carlydf carlydf commented May 14, 2026

Adds documentation outlining the tradeoffs between two autoscaling solutions:

  1. HPA+prometheus adapter
  2. KEDA Temporal Scaler

Documentation focuses on straightforward descriptions of the pros and cons of each solution.

The backlog metric pipeline goes from prometheus-adapter directly to the
raw temporal_cloud_v1_approximate_backlog_count series, eliminating the
temporal_approximate_backlog_count recording rule. Adapter rule:

- seriesQuery filters out temporal_worker_build_id="__unversioned__" so
  discovery doesn't choke on the 5000+ unversioned series in typical
  accounts.
- metricsQuery sum(...) collapses labels the HPA doesn't select on at
  query time (instance/job/region/task_priority/temporal_account).
- metricsRelistInterval is bumped to 5m to accommodate the ~3-minute
  embedded-timestamp lag in Temporal Cloud's OpenMetrics emission.

WRT example, prometheus-stack-values, and demo README are updated to
match. Add docs/scaling-recommendations.md covering the empirically
measured reactivity model (steady-state ~3:15 dominated by Cloud
aggregation lag), task-queue-unload behavior, scale-from-zero limits,
and when to pick KEDA over the metric path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@carlydf carlydf requested review from a team and jlegrone as code owners May 14, 2026 01:52
@carlydf carlydf marked this pull request as draft May 14, 2026 02:03
carlydf and others added 3 commits May 13, 2026 19:44
Initial scaling-recommendations.md framed steady-state HPA reactivity as
~3:15, citing a "Temporal Cloud aggregation lag." That was wrong. The
actual sample-age distribution on the OpenMetrics endpoint is:

  p50  30s  (matches ~1/min emission cadence, age oscillates 0-60s)
  p95  50s
  p99  ~tail of occasional gateway-wide stalls

So typical end-to-end reactivity is ~85s (emission + scrape + HPA poll),
not ~3:15. The 3-minute figures came from observations made during the
occasional periods when the OpenMetrics gateway returns frozen
timestamps across every series in the account simultaneously - those
stalls are real but not steady-state.

Doc now:
- Replaces the 3:15 figure with empirically-derived ~85s typical.
- Adds a "Gateway-wide stalls" caveat describing the frozen-timestamp
  behavior observationally (no speculation about cause).
- Keeps the metricsRelistInterval: 5m recommendation, now justified by
  the need to exceed stall duration rather than the misattributed
  "aggregation lag."
- Demo README updated to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier wording implied multiple stall events ("occasional periods")
when we have only directly characterized one such event during this
investigation. Reword to describe exactly what was seen, note that
frequency is not yet known, and that the behavior is open with the
Observability team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verified directly: across a 3-hour window including one of the observed
"stall" events, every gap between consecutive sample timestamps in
Prometheus's storage is exactly 60 seconds. So the OpenMetrics endpoint
isn't dropping or freezing emissions - it's delivering them late, in
bursts after a delay, with their original minute-aligned timestamps.

The retrospective record looks complete (good for dashboards), but live
HPA consumers see the delay as real staleness because they query the
latest available timestamp at decision time. Reframe the caveat in the
scaling doc and demo README accordingly.

Also note we observed two such delay events in ~2 hours of close
observation - frequency in normal operation is still open with the
Observability team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @carlydf , I've done a first go-around reviewing this documentation and adding (quite a few) suggested changes and removals to "de-Claude" some of it and make it (hopefully) a bit more readable for a general audience.

Comment thread docs/README.md
### [Ownership](manager-identity.md)
How the controller gets permission to manage a Worker Deployment, how a human client can take or give back control.

### [Scaling Recommendations](scaling-recommendations.md)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend adding this new doc to the list here.

Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
└─ first replica added
```

**Typical end-to-end reactivity is ≈ 85 seconds + your stabilization window.** Empirically, sample age in Prometheus for a single series follows a sawtooth between 0 and 60 seconds (matching the gateway's ~1/min emission cadence). p50 sample age ≈ 30s, p95 ≈ 50s. The 60-second emission cadence is the inherent floor — smaller scrape intervals, tighter `metricsRelistInterval`, or recording rules cannot improve it because they all consume the same upstream cadence.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "your stabilization window"? That is not defined anywhere and this reads like Claude has seen the phrase "stabilization window" in other docs that it scraped from the Internet that described auto-scaling algorithms but doesn't actually understand what "stabilization window" means.

Also, "Typical end-to-end reactivity" doesn't make sense here and sounds like a term Claude either made up or has hallucinated-adopted from the term "end-to-end reactivity" from frontend software development patterns.

Copy link
Copy Markdown
Collaborator Author

@carlydf carlydf May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about any Claude-isms, the reason this is in Draft mode is because I hadn't done a full pass over it yet. I care deeply about not putting PRs in front of people that I haven't reviewed myself and don't necessarily endorse any of this until it's out of draft mode.

Claude probably came up with this after seeing a bunch of grafana screenshots that I sent it.

Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment on lines +44 to +48
### Slot utilization is a much faster leading signal

`temporal_slot_utilization` is emitted directly by worker pods (no Temporal Cloud aggregation), scraped at the ServiceMonitor interval (~10–30 s), and reflects current state. It also rises *before* backlog accumulates — slots saturate first, then queueing starts. So a two-metric HPA with both slot util and backlog gives you fast scale-up via slot util and a backlog-driven backstop.

The demo HPA uses both. For production scaling we recommend keeping both as well.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend removing this. I've pulled some of the content into the proposed new ## HPA strengths and ## HPA scaling signal sections

Comment thread docs/scaling-recommendations.md Outdated
Comment on lines +32 to +42
### Caveat: gateway delivery delay

During our investigation we observed periods of several minutes during which Temporal Cloud's OpenMetrics endpoint returned the same embedded timestamps on repeated scrapes for *every* series across the account simultaneously — backlog series, action counts, error counts, every queue, every namespace, all showing identical staleness to the second (e.g. all ~30 visible series reading 239s old at once). The Prometheus scrape continued to succeed (`up{job="temporal_cloud"}` stayed 1, HTTP 200 responses) — the response body simply repeated already-known samples instead of advancing.

Once the delay resolved, the gateway delivered the missing samples with their original minute-aligned timestamps in a burst, so Prometheus's storage ends up with a complete 1/minute series in retrospect. We verified this directly: across a 3-hour window covering one such delay event, every gap between consecutive sample timestamps was exactly 60 seconds, no exceptions.

The retrospective completeness is helpful for dashboards and post-hoc analysis, but it does **not** help an HPA, which queries the *latest available* value at decision time. During a delivery delay, the latest available sample is the one from before the delay started. The HPA sees real staleness even though the underlying record will eventually be filled in.

We have only directly characterized this pattern during one investigation session (seeing it twice in ~2 hours of close observation). Frequency in normal operation is not yet known and is open with Temporal's Observability team. If your workload cannot tolerate occasional multi-minute scaling pauses, prefer KEDA.

This is also why `metricsRelistInterval: 5m` is the recommended setting: the discovery window must comfortably exceed the longest expected delay so the metric does not deregister, otherwise re-registration waits up to one more relist cycle after delivery resumes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend removing this section. I've moved some of the content and reworded it in the ## HPA limitations section below.

Comment thread docs/scaling-recommendations.md Outdated

### Caveat: gateway delivery delay

During our investigation we observed periods of several minutes during which Temporal Cloud's OpenMetrics endpoint returned the same embedded timestamps on repeated scrapes for *every* series across the account simultaneously — backlog series, action counts, error counts, every queue, every namespace, all showing identical staleness to the second (e.g. all ~30 visible series reading 239s old at once). The Prometheus scrape continued to succeed (`up{job="temporal_cloud"}` stayed 1, HTTP 200 responses) — the response body simply repeated already-known samples instead of advancing.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this reads odd given that the project is from the same people as Temporal Cloud - can this be reworded?

Comment thread docs/scaling-recommendations.md Outdated

The retrospective completeness is helpful for dashboards and post-hoc analysis, but it does **not** help an HPA, which queries the *latest available* value at decision time. During a delivery delay, the latest available sample is the one from before the delay started. The HPA sees real staleness even though the underlying record will eventually be filled in.

We have only directly characterized this pattern during one investigation session (seeing it twice in ~2 hours of close observation). Frequency in normal operation is not yet known and is open with Temporal's Observability team. If your workload cannot tolerate occasional multi-minute scaling pauses, prefer KEDA.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds like sharing internal sausage making - which as a customer I am not sure what to take away from

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, which is why I recommended in my review that this entire section be removed :)

Comment thread docs/scaling-recommendations.md Outdated

In a two-metric HPA configured with slot utilization, this is mostly fine: the HPA reports `ScalingActive=True` based on slot utilization while backlog is unavailable, and rejoins backlog scaling once it returns. We've confirmed this empirically in this demo cluster — the HPA continued scaling correctly on slot utilization through 1000+ backlog `FailedGetExternalMetric` events.

## Why this demo does not use a backlog recording rule
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which demo? "this" is an unclear reference

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ii recommended removing this entire section

Comment thread docs/scaling-recommendations.md Outdated

## When KEDA hits its own limits

KEDA bypasses the metric pipeline but uses Temporal API calls, which are subject to a per-namespace rate limit:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other KEDA limitation (for now) #355

KEDA also does not actually work with TWC until #286 is closed. Luckily we have an open community PR #351 that will add support for the KEDA temporal trigger, which I think can be merged soon. I just reviewed it.

carlydf and others added 2 commits May 27, 2026 17:28
Co-authored-by: Jay Pipes <jaypipes@gmail.com>
Co-authored-by: Stefan Richter <stefan@02strich.de>
Removes a bunch of overly verbose Claude-generated stuff that will
likely confuse readers. Reworded a few places where Claude was using
some odd terminology -- e.g. "typical end-to-end reactivity" -- to use
more straightforward verbiage. Added a brief WRT example HPA template
that shows the stabilization window that is referred to in multiple
sections of the doc.

Signed-off-by: Jay Pipes <jay.pipes@temporal.io>
@jaypipes jaypipes marked this pull request as ready for review June 1, 2026 16:35
@jaypipes jaypipes changed the title Drop backlog recording rule; consume raw temporal_cloud_v1_approximate_backlog_count Add docs recommending autoscaling setup Jun 3, 2026
Copy link
Copy Markdown
Member

@Shivs11 Shivs11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of nits -- looks g to me otherwise


> **Note**: This is why `metricsRelistInterval: 5m` is the recommended setting: the discovery window must comfortably exceed the longest expected delay so the metric does not deregister, otherwise re-registration waits up to one more relist cycle after delivery resumes.

HPA cannot scale your Worker Deployment from zero because the signal for scaling does not yet exist. The signal for scaling is the backlog metric for the task queue associated with the workers in the Worker Deployment. This metric will not exist until there is at least one worker polling the task queue.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i took about a second to understand what this really meant - at first, I thought this meant that there won't be a backlog metric emitted if you don't have workers running at all (which is not true since you do have this metric being emitted for the unversioned world without workers being present)

I know you have clearly mentioned versions in the preamble here, but do you think we can be extra clear and mention the backlog count per version is not emitted without a worker being present since that is what creates a version in temporal?


HPA + prometheus adapter configured to look at both slot util and backlog provides fast scale-up via slot util and a backlog-driven backstop to prevent overly reactive replica count adjustment.

## HPA limitations
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great! was wondering - should we also mention that our OM endpoint inherently has a 3 minute time lag, which would mean that at time=0, we are seeing a one minute aggregate of time.now() - 3 minutes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants