-
Notifications
You must be signed in to change notification settings - Fork 184
Open
Description
Currently, the Watchdog seems to compute the "time left" based on the CPU work, which is the product of
the CPUtime that we get from the underlying batch system, which is (in most of the case I guess) accurate,
and the CPU power, which might be not really accurate in some cases.
Then, based on this "time left" value, the watchdog seems to perform a complex logic to know whether a job should be killed or not.
- First it performs a check every
checkingTimeuntiltimeLeft < grossTimeLeftLimit-grossTimeLeftLimitbeing 18,000 see here. - When this happens,
timeLeftis then computed everypollingTimeand the variablelittleTimeLeftCount, initialized to 15, is decremented everypollingTime(it can be negative apparently) see here. - When
timeLeft < fineTimeLimitLeft-fineTimeLimitLeftbeing150 * pollingTimeby default - andlittleTimeLeftCount == 0(keeping in mind that it can also be negative), then the job is killed.
I would like to simplify this logic such as:
- We add a
TimeLeft.getCPUTimeLeft()method to get the CPU time left in seconds, andTimeLeft.getTimeLeft()in this case becomesgetCPUWorkLeft().
In the watchdog we use this new method to get the time left in seconds: I guess it would be more accurate. - Once
timeLeft < 4000s or maybe checkingTime * 1.5then we do regular check everypollingTime. - Once
timeLeft < 600s(10 minutes) then we kill the job
There is probably many historical reasons that I do not understand or use cases that I do not know that would explain this complex logic.
Let me know if you have further details or comments about what I propose.
Reactions are currently unavailable