Watchdog: a simplified method to compute time left to kill jobs that are going to run out of time

Currently, the [Watchdog](https://github.com/DIRACGrid/DIRAC/blob/integration/src/DIRAC/WorkloadManagementSystem/JobWrapper/Watchdog.py) seems to compute the "[time left](https://github.com/DIRACGrid/DIRAC/blob/integration/src/DIRAC/WorkloadManagementSystem/JobWrapper/Watchdog.py#L792)" based on the CPU work, which is the product of 
the CPUtime that we get from the underlying batch system, which is (in most of the case I guess) accurate,
and the CPU power, which might be not really accurate in some cases.

Then, based on this "time left" value, the watchdog seems to perform a complex logic to know whether a job should be killed or not.
- First it performs a check every `checkingTime` until `timeLeft < grossTimeLeftLimit` - `grossTimeLeftLimit` being 18,000 [see here](https://github.com/DIRACGrid/DIRAC/blob/integration/src/DIRAC/WorkloadManagementSystem/JobWrapper/Watchdog.py#L811).
- When this happens, `timeLeft` is then computed every `pollingTime` and the variable `littleTimeLeftCount`, initialized to 15, is decremented every `pollingTime` (it can be negative apparently) [see here](https://github.com/DIRACGrid/DIRAC/blob/integration/src/DIRAC/WorkloadManagementSystem/JobWrapper/Watchdog.py#L217).
- When `timeLeft < fineTimeLimitLeft` - `fineTimeLimitLeft` being `150 * pollingTime` by default - and `littleTimeLeftCount == 0` (keeping in mind that it can also be negative), then the job is killed.

I would like to simplify this logic such as:
- We add a `TimeLeft.getCPUTimeLeft()` method to get the CPU time left in seconds, and `TimeLeft.getTimeLeft()` in this case becomes `getCPUWorkLeft()`.
  In the watchdog we use this new method to get the time left in seconds: I guess it would be more accurate.
- Once `timeLeft < 4000s or maybe checkingTime * 1.5` then we do regular check every `pollingTime`.
- Once `timeLeft < 600s` (10 minutes) then we kill the job

There is probably many historical reasons that I do not understand or use cases that I do not know that would explain this complex logic.
Let me know if you have further details or comments about what I propose.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watchdog: a simplified method to compute time left to kill jobs that are going to run out of time #5129

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Watchdog: a simplified method to compute time left to kill jobs that are going to run out of time #5129

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions