Skip to content

Comments

Runtime optimizations for backlogger. Caching to optimize from O(n) to O(1).#23

Merged
rbx merged 2 commits intoFairRootGroup:masterfrom
enlorenz:wip/integration
Feb 16, 2026
Merged

Runtime optimizations for backlogger. Caching to optimize from O(n) to O(1).#23
rbx merged 2 commits intoFairRootGroup:masterfrom
enlorenz:wip/integration

Conversation

@enlorenz
Copy link
Contributor

No description provided.

@enlorenz enlorenz marked this pull request as draft February 16, 2026 09:50
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 16, 2026

No actionable comments were generated in the recent review. 🎉


📝 Walkthrough

Walkthrough

Adds cache-backed pending-job statistics and a mutation/versioning mechanism to avoid recomputation when queue/backlog are unchanged (environment.py). Adjusts plot filename tokens to include drop_weight and rename job_age token (plot.py). Handles all-fixed-weights case in weight combination generation (train_iter.py).

Changes

Cohort / File(s) Summary
Job statistics & queue/backlog mutation
src/environment.py
Adds _queue_backlog_version, _cached_queue_backlog_version, and _mark_queue_backlog_mutation(); tracks queue/backlog mutations in step(), fast-paths _update_pending_job_stats() when versions match; recomputes backlog metrics from backlog_queue and updates self.state when mutated.
Plot filename tokens
src/plot.py
Changes saved filename prefix format: renames the job-age token from d to a and appends a new d token for drop_weight (string construction only).
Fixed-weights combination handling
train_iter.py
Adds branch for zero variable-weights: if fixed weights sum ≈ 1.0, builds a single combination placing fixed values in their positions and appends it to combinations. Existing multi-variable branches unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive No pull request description was provided by the author; the field is empty and therefore cannot be evaluated for relevance or clarity. Add a pull request description explaining the purpose, scope, and impact of the runtime optimizations and caching changes for better context.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main optimization changes across multiple files: caching mechanism in environment.py to avoid O(n) recomputation, plot.py filename adjustments, and train_iter.py weight handling improvements.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into master

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/plot.py (1)

272-272: ⚠️ Potential issue | 🟡 Minor

Inconsistent prefix format in plot_cumulative_savings — missing drop_weight and using old d label for job_age.

Line 94 was updated to e{eff}_p{price}_i{idle}_a{job_age}_d{drop}, but line 272 still uses the old format e{eff}_p{price}_i{idle}_d{job_age}, which:

  1. Labels job_age_weight as d instead of a.
  2. Omits drop_weight entirely.
Proposed fix
-        prefix = f"e{env.weights.efficiency_weight}_p{env.weights.price_weight}_i{env.weights.idle_weight}_d{env.weights.job_age_weight}"
+        prefix = f"e{env.weights.efficiency_weight}_p{env.weights.price_weight}_i{env.weights.idle_weight}_a{env.weights.job_age_weight}_d{env.weights.drop_weight}"
🤖 Fix all issues with AI agents
In `@src/environment.py`:
- Around line 215-220: The fast-path cache that returns early when
_cached_queue_count == queue_count and _cached_backlog_size ==
current_backlog_size is unsafe because equal counts can hide content changes;
replace this count-only invalidation with a robust versioning or hashing
approach: introduce a monotonically increasing mutation/version id (e.g.,
_queue_backlog_version) that is incremented on every queue or backlog mutation
(every place that modifies the queue/backlog such as enqueue/assign/refill
routines), store that version alongside _cached_* (e.g.,
_cached_queue_backlog_version) and only skip recalculation when the versions
match; alternatively compute and compare a lightweight content checksum of the
active queue/backlog before skipping. Ensure fields referenced in observations
(pending_core_hours, pending_avg_duration, pending_max_nodes) are recomputed
unless the version/hash indicates no mutation.
🧹 Nitpick comments (2)
src/environment.py (2)

216-217: Redundant hasattr guards — attributes are always initialized in _reset_timeline_state.

_cached_queue_count and _cached_backlog_size are initialized at lines 207–208 in _reset_timeline_state, which is called from __init__ before any step() can invoke _update_pending_job_stats. The hasattr checks add unnecessary overhead on every step.

Proposed simplification (assuming the cache logic itself is fixed)
-        if (hasattr(self, '_cached_queue_count') and 
-            hasattr(self, '_cached_backlog_size') and
-            self._cached_queue_count == queue_count and 
+        if (self._cached_queue_count == queue_count and 
             self._cached_backlog_size == current_backlog_size):

231-235: List comprehensions over deque are fine here but could use islice for very large backlogs.

These iterate the entire backlog deque three separate times to build three arrays. If the backlog grows large, a single-pass extraction would be more efficient:

Single-pass alternative
-            backlog_durations = np.array([job[0] for job in self.backlog_queue], dtype=np.int32)
-            backlog_nodes = np.array([job[2] for job in self.backlog_queue], dtype=np.int32)
-            backlog_cores = np.array([job[3] for job in self.backlog_queue], dtype=np.int32)
+            backlog_data = np.array([(job[0], job[2], job[3]) for job in self.backlog_queue], dtype=np.int32)
+            backlog_durations = backlog_data[:, 0]
+            backlog_nodes = backlog_data[:, 1]
+            backlog_cores = backlog_data[:, 2]

Previously, _update_pending_job_stats() was converting the entire backlog
deque to a list and then to a numpy array every single step, resulting in
O(n) time complexity where n = backlog size. With unbounded backlog growth,
this caused severe FPS degradation as training progressed.

Changes:
- Optimize deque-to-array conversion: extract columns directly without
  intermediate list() materialization
- Add caching layer: skip recalculation when queue and backlog sizes are
  unchanged (O(1) cache hit on most steps)
- Initialize cache state in _reset_timeline_state() to ensure first call
  always recalculates, then uses cached values

This reduces _update_pending_job_stats() from O(n) to O(1) on stable steps,
eliminating the training stalls observed with large backlogs.

Plus: Small fix in plot.py, saved plots are now named correctly.
@enlorenz enlorenz marked this pull request as ready for review February 16, 2026 10:47
@rbx rbx merged commit 7c03550 into FairRootGroup:master Feb 16, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants