Runtime optimizations for backlogger. Caching to optimize from O(n) to O(1).#23
Conversation
|
No actionable comments were generated in the recent review. 🎉 📝 WalkthroughWalkthroughAdds cache-backed pending-job statistics and a mutation/versioning mechanism to avoid recomputation when queue/backlog are unchanged (environment.py). Adjusts plot filename tokens to include Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/plot.py (1)
272-272:⚠️ Potential issue | 🟡 MinorInconsistent prefix format in
plot_cumulative_savings— missingdrop_weightand using olddlabel forjob_age.Line 94 was updated to
e{eff}_p{price}_i{idle}_a{job_age}_d{drop}, but line 272 still uses the old formate{eff}_p{price}_i{idle}_d{job_age}, which:
- Labels
job_age_weightasdinstead ofa.- Omits
drop_weightentirely.Proposed fix
- prefix = f"e{env.weights.efficiency_weight}_p{env.weights.price_weight}_i{env.weights.idle_weight}_d{env.weights.job_age_weight}" + prefix = f"e{env.weights.efficiency_weight}_p{env.weights.price_weight}_i{env.weights.idle_weight}_a{env.weights.job_age_weight}_d{env.weights.drop_weight}"
🤖 Fix all issues with AI agents
In `@src/environment.py`:
- Around line 215-220: The fast-path cache that returns early when
_cached_queue_count == queue_count and _cached_backlog_size ==
current_backlog_size is unsafe because equal counts can hide content changes;
replace this count-only invalidation with a robust versioning or hashing
approach: introduce a monotonically increasing mutation/version id (e.g.,
_queue_backlog_version) that is incremented on every queue or backlog mutation
(every place that modifies the queue/backlog such as enqueue/assign/refill
routines), store that version alongside _cached_* (e.g.,
_cached_queue_backlog_version) and only skip recalculation when the versions
match; alternatively compute and compare a lightweight content checksum of the
active queue/backlog before skipping. Ensure fields referenced in observations
(pending_core_hours, pending_avg_duration, pending_max_nodes) are recomputed
unless the version/hash indicates no mutation.
🧹 Nitpick comments (2)
src/environment.py (2)
216-217: Redundanthasattrguards — attributes are always initialized in_reset_timeline_state.
_cached_queue_countand_cached_backlog_sizeare initialized at lines 207–208 in_reset_timeline_state, which is called from__init__before anystep()can invoke_update_pending_job_stats. Thehasattrchecks add unnecessary overhead on every step.Proposed simplification (assuming the cache logic itself is fixed)
- if (hasattr(self, '_cached_queue_count') and - hasattr(self, '_cached_backlog_size') and - self._cached_queue_count == queue_count and + if (self._cached_queue_count == queue_count and self._cached_backlog_size == current_backlog_size):
231-235: List comprehensions overdequeare fine here but could useislicefor very large backlogs.These iterate the entire backlog deque three separate times to build three arrays. If the backlog grows large, a single-pass extraction would be more efficient:
Single-pass alternative
- backlog_durations = np.array([job[0] for job in self.backlog_queue], dtype=np.int32) - backlog_nodes = np.array([job[2] for job in self.backlog_queue], dtype=np.int32) - backlog_cores = np.array([job[3] for job in self.backlog_queue], dtype=np.int32) + backlog_data = np.array([(job[0], job[2], job[3]) for job in self.backlog_queue], dtype=np.int32) + backlog_durations = backlog_data[:, 0] + backlog_nodes = backlog_data[:, 1] + backlog_cores = backlog_data[:, 2]
Previously, _update_pending_job_stats() was converting the entire backlog deque to a list and then to a numpy array every single step, resulting in O(n) time complexity where n = backlog size. With unbounded backlog growth, this caused severe FPS degradation as training progressed. Changes: - Optimize deque-to-array conversion: extract columns directly without intermediate list() materialization - Add caching layer: skip recalculation when queue and backlog sizes are unchanged (O(1) cache hit on most steps) - Initialize cache state in _reset_timeline_state() to ensure first call always recalculates, then uses cached values This reduces _update_pending_job_stats() from O(n) to O(1) on stable steps, eliminating the training stalls observed with large backlogs. Plus: Small fix in plot.py, saved plots are now named correctly.
0338183 to
7b36496
Compare
No description provided.