Skip to content

Watchdog: a simplified method to compute time left to kill jobs that are going to run out of time #5129

@aldbr

Description

@aldbr

Currently, the Watchdog seems to compute the "time left" based on the CPU work, which is the product of
the CPUtime that we get from the underlying batch system, which is (in most of the case I guess) accurate,
and the CPU power, which might be not really accurate in some cases.

Then, based on this "time left" value, the watchdog seems to perform a complex logic to know whether a job should be killed or not.

  • First it performs a check every checkingTime until timeLeft < grossTimeLeftLimit - grossTimeLeftLimit being 18,000 see here.
  • When this happens, timeLeft is then computed every pollingTime and the variable littleTimeLeftCount, initialized to 15, is decremented every pollingTime (it can be negative apparently) see here.
  • When timeLeft < fineTimeLimitLeft - fineTimeLimitLeft being 150 * pollingTime by default - and littleTimeLeftCount == 0 (keeping in mind that it can also be negative), then the job is killed.

I would like to simplify this logic such as:

  • We add a TimeLeft.getCPUTimeLeft() method to get the CPU time left in seconds, and TimeLeft.getTimeLeft() in this case becomes getCPUWorkLeft().
    In the watchdog we use this new method to get the time left in seconds: I guess it would be more accurate.
  • Once timeLeft < 4000s or maybe checkingTime * 1.5 then we do regular check every pollingTime.
  • Once timeLeft < 600s (10 minutes) then we kill the job

There is probably many historical reasons that I do not understand or use cases that I do not know that would explain this complex logic.
Let me know if you have further details or comments about what I propose.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions