Skip to content

Document known accuracy limitations for ApproximateBacklogCount#4392

Open
dustin-temporal wants to merge 1 commit intomainfrom
dustin/backlog-accuracy-limitations
Open

Document known accuracy limitations for ApproximateBacklogCount#4392
dustin-temporal wants to merge 1 commit intomainfrom
dustin/backlog-accuracy-limitations

Conversation

@dustin-temporal
Copy link
Copy Markdown
Contributor

@dustin-temporal dustin-temporal commented Apr 6, 2026

Summary

  • Adds a "Known accuracy limitations" section to the worker performance page explaining two scenarios where ApproximateBacklogCount can be temporarily inaccurate:
    • Overcount from cancelled Workflows - pending Tasks from cancelled Workflows inflate the count until they expire
    • Reset to zero on partition unload - idle Task Queues (no pollers, no new Tasks for ~5 min) lose their count when the partition is unloaded
  • Consolidates the existing sticky queue caveat into the same section
  • Adds a cross-reference note to the Cloud metrics reference (temporal_cloud_v1_approximate_backlog_count)

Context

These limitations were identified during internal discussion with the matching team (Shahab, Nish). Both behaviors are explainable and expected, but not currently documented - leading to customer confusion.

Test plan

  • Verify the anchor link #backlog-accuracy-limitations works from the Cloud metrics reference page
  • Review rendering of the :::note admonition in the metrics reference

🤖 Generated with Claude Code

┆Attachments: EDU-6164 Document known accuracy limitations for ApproximateBacklogCount

Adds a section explaining two known scenarios where the backlog count
can be temporarily inaccurate: overcount from cancelled workflows and
reset to zero on partition unload after ~5 minutes of inactivity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 6, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
temporal-documentation Ready Ready Preview, Comment Apr 6, 2026 8:12pm

Request Review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

📖 Docs PR preview links

@dustin-temporal dustin-temporal requested a review from ShahabT April 6, 2026 20:13
@dustin-temporal dustin-temporal marked this pull request as ready for review April 6, 2026 20:14
@dustin-temporal dustin-temporal requested a review from a team as a code owner April 6, 2026 20:14
The approximate number of tasks pending in a task queue. Started Activities are not included in the count as they have been dequeued from the task queue.

:::note Known accuracy limitations
This metric may temporarily overcount due to cancelled Workflow Tasks that haven't yet expired, and may reset to zero if no Workers poll and no Tasks are added for approximately 5 minutes (due to partition unload).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits:

  • it's not only canceled/terminated workflows, it's also expired tasks, say for expired workflows (specially in new matcher).
  • instead of "partition unload" we can just say "task queue unload" because user should not be exposed to partition concept.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what is meant by "... tasks that haven't yet expired". The idea is that when they expire or become invalid, they are still counted because they haven't been processed and discarded yet

These values are approximate and may be temporarily inaccurate in the following scenarios:

- **Overcount from cancelled Workflows**: When a Workflow is cancelled, its pending Tasks may still be counted in the backlog until they expire.
Workers process Tasks in order within a partition, so valid Tasks _can_ be blocked behind invalid (expired) Tasks, but invalid Tasks cannot block other invalid Tasks.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • again, the partition concept doesn't have much bearing here, so maybe we can omit it.
  • "valid Tasks can be blocked behind invalid (expired) Tasks" -> this is the other way around: valid tasks, until dispatched, can block invalid tasks. once an invalid task is in the front of the queue, we remove it quickly so they do not generally block other invalid/valid tasks. A valid task may stay longer in front of the queue if there are not enough workers.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also: "invalid" (workflow is closed, or activity is canceled, or various other things) and "expired" (due to timeout) are technically different internally but I think it's good to just bundle them together for this purpose. We should be consistent, though, maybe use "invalid" everywhere, or "invalid or expired"?

- **Overcount from cancelled Workflows**: When a Workflow is cancelled, its pending Tasks may still be counted in the backlog until they expire.
Workers process Tasks in order within a partition, so valid Tasks _can_ be blocked behind invalid (expired) Tasks, but invalid Tasks cannot block other invalid Tasks.
The count eventually converges to the correct value as expired Tasks are cleared.
- **Reset to zero on partition unload**: If no Workers are polling a Task Queue and no new Tasks are added for approximately 5 minutes, the Temporal Service unloads the Task Queue partition from memory.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no Workers are polling a Task Queue and no new Tasks are added

Other task queue activities such as DescribeTaskQueue or UpdateTaskQueueConfig will also keep the task queue loaded.

The approximate number of tasks pending in a task queue. Started Activities are not included in the count as they have been dequeued from the task queue.

:::note Known accuracy limitations
This metric may temporarily overcount due to cancelled Workflow Tasks that haven't yet expired, and may reset to zero if no Workers poll and no Tasks are added for approximately 5 minutes (due to partition unload).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what is meant by "... tasks that haven't yet expired". The idea is that when they expire or become invalid, they are still counted because they haven't been processed and discarded yet


These values are approximate and may be temporarily inaccurate in the following scenarios:

- **Overcount from cancelled Workflows**: When a Workflow is cancelled, its pending Tasks may still be counted in the backlog until they expire.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, I think "expire" is the wrong word here

These values are approximate and may be temporarily inaccurate in the following scenarios:

- **Overcount from cancelled Workflows**: When a Workflow is cancelled, its pending Tasks may still be counted in the backlog until they expire.
Workers process Tasks in order within a partition, so valid Tasks _can_ be blocked behind invalid (expired) Tasks, but invalid Tasks cannot block other invalid Tasks.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also: "invalid" (workflow is closed, or activity is canceled, or various other things) and "expired" (due to timeout) are technically different internally but I think it's good to just bundle them together for this purpose. We should be consistent, though, maybe use "invalid" everywhere, or "invalid or expired"?

Inaccuracies diminish as the backlog grows.
#### Known accuracy limitations {#backlog-accuracy-limitations}

These values are approximate and may be temporarily inaccurate in the following scenarios:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also just all the discrepancy due to metadata updates being infrequent, and discrepancy due to database TTLs (cassandra makes rows disappear and we don't see them). I think "may be temporarily inaccurate" is too strong, it suggests inaccuracy is temporary and will go away, but that's not true at all.


- **Overcount from cancelled Workflows**: When a Workflow is cancelled, its pending Tasks may still be counted in the backlog until they expire.
Workers process Tasks in order within a partition, so valid Tasks _can_ be blocked behind invalid (expired) Tasks, but invalid Tasks cannot block other invalid Tasks.
The count eventually converges to the correct value as expired Tasks are cleared.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it's promising too much. Invalid tasks will eventually be accounted for, but I don't think we can say the count will converge to the correct value, there are other sources of discrepancy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants