You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[WFRunner] Handle resource limits and CPU better (#1532)
* Account for relative CPU factor in case of sampling
Studies have shown that being able to backfill tasks can have a
difference for CPU efficiency, especially for transport.
That is in particular important for high-efficiency jobs such as
high-interaction-rate pp simulations.
* Abort by default if estimated resources exceed limits
* Run anyway, if --optimistic-resources is passed
* Fix: Actually reset the overestimated resources to the limits as
otherwise the runner would silently quit when nothing else can be
done.
* In case of dynamically sampled resources and if a corresponding task
has been run already:
Reset the assigned resources to the limits if they exceed the
boundaries.
Co-authored-by: Benedikt Volkel <benedikt.volkel@cern.ch>
actionlogger.warning(f"Resource estimates of id {len(self.resources)} overestimates limits, CPU limit: {self.resource_boundaries.cpu_limit}, MEM limit: {self.resource_boundaries.mem_limit}; might not run")
print(f"Resources of task {name} are exceeding the boundaries.\nCPU: {cpu} (estimate) vs. {self.resource_boundaries.cpu_limit} (boundary)\nMEM: {mem} (estimated) vs. {self.resource_boundaries.mem_limit} (boundary).")
689
-
exit(1)
690
-
# or we do dare, let's see what happens...
691
-
actionlogger.info("We will try to run this task anyway with maximum available resources")
print(f"Resources of task {name} are exceeding the boundaries.\nCPU: {cpu} (estimate) vs. {self.resource_boundaries.cpu_limit} (boundary)\nMEM: {mem} (estimated) vs. {self.resource_boundaries.mem_limit} (boundary).")
716
+
print("Pass --optimistic-resources to the runner to attempt the run anyway.")
717
+
exit(1)
718
+
# if we get here, either all is good or the user decided to be optimistic and we limit the resources, by default to the given CPU and mem limits.
719
+
resources.limit_resources()
692
720
693
721
self.resources.append(resources)
694
722
# do the following to have the same Semaphore object for all corresponding TaskResources so that we do not need a lookup
0 commit comments