[Swarming] Push preprocess tasks to swarming queue#5282
Conversation
4993e5d to
99bf603
Compare
| from clusterfuzz._internal.metrics import logs | ||
|
|
||
| PREPROCESS_TARGET_SIZE_DEFAULT = 10000 | ||
| SWARMING_PREPROCESS_TARGET_SIZE_DEFAULT = 5 |
There was a problem hiding this comment.
The Swarming pool has a hard limit of 25 (LINUX) bots running 1 task each.
- At average 1 hour per fuzzing/swarming task, those 25 bots can finish 4 to 5 tasks every 10 minutes (the interval the cron job runs)
- Because the 2,000(in prod)
preprocesstworkers almost instantly process the preprocess queue, the target size acts as an injection rate more than a buffer.
So, Injecting 5 tasks every 10 minutes matches the expected Swarming rate, preventing an infinitely growing backlog of stale tasks. This is still the default value, the real value is managed trough a feature flag, we will later tweak this feature flag based on metrics & how swarming handled this workload, so that we have a more acqurate value
There was a problem hiding this comment.
I think we should prevent the infinitely-growing queue of tasks using CountTasks and a pretty low limit on the utask_main queue size, such that we can aim for a bit more saturation here (e.g. 10 tasks, such that we never expect the queue to be empty). This works for now though.
fernandofloresg
left a comment
There was a problem hiding this comment.
lgtm just had one question
letitz
left a comment
There was a problem hiding this comment.
One big comment about reworking the main logic here.
| PREPROCESS_QUEUE_SIZE_LIMIT = 'preprocess_queue_size_limit' | ||
|
|
||
| SWARMING_REMOTE_EXECUTION = 'swarming_remote_execution' | ||
| # TODO(ibarba): Set this value based off dev & stage metrics and tests. |
There was a problem hiding this comment.
TODO should reference a bug link instead.
There was a problem hiding this comment.
I also agree its best to link to a bug, i previously asked gemini if we can add any go link or bug references in open source code/reviews in github, basically it said:
Throughout the repos you maintain (READMEs, code comments, etc.): Don't put any Google internal info, including go/ and b/ links. If you're sure the referenced content can be public, move it to GitHub issues or README or Wiki
Although im aware ai can allucinate and im not a open source expert, but as far as i know, thats the reason for this repo be full of #TODO(metzman): comments
So your call, can we include bug links in here?
There was a problem hiding this comment.
@javanlacerda or @ViniciustCosta would know I'm sure!
| from clusterfuzz._internal.metrics import logs | ||
|
|
||
| PREPROCESS_TARGET_SIZE_DEFAULT = 10000 | ||
| SWARMING_PREPROCESS_TARGET_SIZE_DEFAULT = 5 |
There was a problem hiding this comment.
I think we should prevent the infinitely-growing queue of tasks using CountTasks and a pretty low limit on the utask_main queue size, such that we can aim for a bit more saturation here (e.g. 10 tasks, such that we never expect the queue to be empty). This works for now though.
letitz
left a comment
There was a problem hiding this comment.
Looking good, a bunch of small comments left to address.
| """Returns True if the job environment contains swarming env vars.""" | ||
| return bool( | ||
| job_environment and | ||
| (utils.string_is_true(job_environment.get('IS_SWARMING_JOB')) or |
There was a problem hiding this comment.
Do we use IS_SWARMING_JOB anywhere? Can we remove it for simplicity? (this can be in a followup)
There was a problem hiding this comment.
Yes we do, each time we define a new job we don't necessarily need specific swarming dimensions for it, so for this cases we have been using IS_SWARMING_JOB instead.
Overview
This change adds support for scheduling tasks to the new Swarming backend. Because Swarming uses a different execution model and to be able to later account for
backpressureit requires its own separatepreprocessqueue and a much lower default target size (5) to prevent unbounded task queuing.By refactoring the cron scheduling logic, we can now simultaneously feed both the Swarming and Batch environments at their respective ideal rates.
Changes
SWARMING_PREPROCESS_TARGET_SIZE_DEFAULTset to5to support the Swarming backend's task capacity needs.SWARMING_QUEUES).scheduler_fuzzso that the schedulers are composed of multiple helper functions to avoid complex inheritance:ChromeFuzzTaskSchedulerto independently schedule Swarming tasks alongside standard Batch tasks.schedule_fuzz_test.pyto match the updated scheduler class signatures.TODO
src/clusterfuzz/_internal/base/feature_flags.py: Update this value based off dev & stage metrics and tests.