Document known accuracy limitations for ApproximateBacklogCount#4392
Document known accuracy limitations for ApproximateBacklogCount#4392dustin-temporal wants to merge 1 commit intomainfrom
Conversation
Adds a section explaining two known scenarios where the backlog count can be temporarily inaccurate: overcount from cancelled workflows and reset to zero on partition unload after ~5 minutes of inactivity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📖 Docs PR preview links
|
| The approximate number of tasks pending in a task queue. Started Activities are not included in the count as they have been dequeued from the task queue. | ||
|
|
||
| :::note Known accuracy limitations | ||
| This metric may temporarily overcount due to cancelled Workflow Tasks that haven't yet expired, and may reset to zero if no Workers poll and no Tasks are added for approximately 5 minutes (due to partition unload). |
There was a problem hiding this comment.
Nits:
- it's not only canceled/terminated workflows, it's also expired tasks, say for expired workflows (specially in new matcher).
- instead of "partition unload" we can just say "task queue unload" because user should not be exposed to partition concept.
There was a problem hiding this comment.
I'm not sure what is meant by "... tasks that haven't yet expired". The idea is that when they expire or become invalid, they are still counted because they haven't been processed and discarded yet
| These values are approximate and may be temporarily inaccurate in the following scenarios: | ||
|
|
||
| - **Overcount from cancelled Workflows**: When a Workflow is cancelled, its pending Tasks may still be counted in the backlog until they expire. | ||
| Workers process Tasks in order within a partition, so valid Tasks _can_ be blocked behind invalid (expired) Tasks, but invalid Tasks cannot block other invalid Tasks. |
There was a problem hiding this comment.
- again, the partition concept doesn't have much bearing here, so maybe we can omit it.
- "valid Tasks can be blocked behind invalid (expired) Tasks" -> this is the other way around: valid tasks, until dispatched, can block invalid tasks. once an invalid task is in the front of the queue, we remove it quickly so they do not generally block other invalid/valid tasks. A valid task may stay longer in front of the queue if there are not enough workers.
There was a problem hiding this comment.
also: "invalid" (workflow is closed, or activity is canceled, or various other things) and "expired" (due to timeout) are technically different internally but I think it's good to just bundle them together for this purpose. We should be consistent, though, maybe use "invalid" everywhere, or "invalid or expired"?
| - **Overcount from cancelled Workflows**: When a Workflow is cancelled, its pending Tasks may still be counted in the backlog until they expire. | ||
| Workers process Tasks in order within a partition, so valid Tasks _can_ be blocked behind invalid (expired) Tasks, but invalid Tasks cannot block other invalid Tasks. | ||
| The count eventually converges to the correct value as expired Tasks are cleared. | ||
| - **Reset to zero on partition unload**: If no Workers are polling a Task Queue and no new Tasks are added for approximately 5 minutes, the Temporal Service unloads the Task Queue partition from memory. |
There was a problem hiding this comment.
no Workers are polling a Task Queue and no new Tasks are added
Other task queue activities such as DescribeTaskQueue or UpdateTaskQueueConfig will also keep the task queue loaded.
| The approximate number of tasks pending in a task queue. Started Activities are not included in the count as they have been dequeued from the task queue. | ||
|
|
||
| :::note Known accuracy limitations | ||
| This metric may temporarily overcount due to cancelled Workflow Tasks that haven't yet expired, and may reset to zero if no Workers poll and no Tasks are added for approximately 5 minutes (due to partition unload). |
There was a problem hiding this comment.
I'm not sure what is meant by "... tasks that haven't yet expired". The idea is that when they expire or become invalid, they are still counted because they haven't been processed and discarded yet
|
|
||
| These values are approximate and may be temporarily inaccurate in the following scenarios: | ||
|
|
||
| - **Overcount from cancelled Workflows**: When a Workflow is cancelled, its pending Tasks may still be counted in the backlog until they expire. |
There was a problem hiding this comment.
same, I think "expire" is the wrong word here
| These values are approximate and may be temporarily inaccurate in the following scenarios: | ||
|
|
||
| - **Overcount from cancelled Workflows**: When a Workflow is cancelled, its pending Tasks may still be counted in the backlog until they expire. | ||
| Workers process Tasks in order within a partition, so valid Tasks _can_ be blocked behind invalid (expired) Tasks, but invalid Tasks cannot block other invalid Tasks. |
There was a problem hiding this comment.
also: "invalid" (workflow is closed, or activity is canceled, or various other things) and "expired" (due to timeout) are technically different internally but I think it's good to just bundle them together for this purpose. We should be consistent, though, maybe use "invalid" everywhere, or "invalid or expired"?
| Inaccuracies diminish as the backlog grows. | ||
| #### Known accuracy limitations {#backlog-accuracy-limitations} | ||
|
|
||
| These values are approximate and may be temporarily inaccurate in the following scenarios: |
There was a problem hiding this comment.
There's also just all the discrepancy due to metadata updates being infrequent, and discrepancy due to database TTLs (cassandra makes rows disappear and we don't see them). I think "may be temporarily inaccurate" is too strong, it suggests inaccuracy is temporary and will go away, but that's not true at all.
|
|
||
| - **Overcount from cancelled Workflows**: When a Workflow is cancelled, its pending Tasks may still be counted in the backlog until they expire. | ||
| Workers process Tasks in order within a partition, so valid Tasks _can_ be blocked behind invalid (expired) Tasks, but invalid Tasks cannot block other invalid Tasks. | ||
| The count eventually converges to the correct value as expired Tasks are cleared. |
There was a problem hiding this comment.
This feels like it's promising too much. Invalid tasks will eventually be accounted for, but I don't think we can say the count will converge to the correct value, there are other sources of discrepancy.
Summary
ApproximateBacklogCountcan be temporarily inaccurate:temporal_cloud_v1_approximate_backlog_count)Context
These limitations were identified during internal discussion with the matching team (Shahab, Nish). Both behaviors are explainable and expected, but not currently documented - leading to customer confusion.
Test plan
#backlog-accuracy-limitationsworks from the Cloud metrics reference page:::noteadmonition in the metrics reference🤖 Generated with Claude Code
┆Attachments: EDU-6164 Document known accuracy limitations for ApproximateBacklogCount