LoadML - change time decay to extend up to 30 days#3401
LoadML - change time decay to extend up to 30 days#3401springfall2008 merged 10 commits intomainfrom
Conversation
HAHistory - Add locking around history modifications
There was a problem hiding this comment.
Pull request overview
This PR makes two related changes: (1) extends the ML load forecaster's time decay window from 7 to 30 days for better long-term pattern learning, and (2) adds thread-safety to the HAHistory component by introducing locking around shared state modifications.
Changes:
- Increased ML time decay parameter from 7 to 30 days with dynamic adjustment based on available data age
- Added threading.Lock to HAHistory to protect concurrent access to history_data and history_entities dictionaries
- Added comprehensive test coverage for the scenario where history cache is expanded from 7 to 30 days
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| apps/predbat/ha.py | Adds history_lock (threading.Lock) initialization and wraps history_data modifications in prune_history() and update_entity() with lock to prevent race conditions |
| apps/predbat/load_ml_component.py | Changes ml_time_decay_days from 7 to 30 days and adds dynamic calculation using max(ml_time_decay_days, load_data_age_days) to ensure decay extends to cover available data |
| apps/predbat/tests/test_hahistory.py | Adds test_hahistory_expand_from_7_to_30_days() to verify cache correctly expands when requesting more days than currently cached |
Comments suppressed due to low confidence (1)
apps/predbat/ha.py:76
- The add_entity method modifies self.history_entities without holding the history_lock, creating a race condition. This method is called from get_history (line 94) outside of the lock, and history_entities is also accessed from multiple other methods. The lock should protect all accesses to history_entities to prevent race conditions when multiple threads call get_history concurrently or when the run method accesses history_entities while get_history is updating it.
def add_entity(self, entity_id, days):
"""
Add an entity to the history list
"""
if entity_id in self.history_entities:
if days > self.history_entities[entity_id]:
self.history_entities[entity_id] = days
else:
self.history_entities[entity_id] = days
| with self.history_lock: | ||
| if self.history_data.get(entity_id, None) and self.history_entities.get(entity_id, 0) >= days: | ||
| result = [self.history_data[entity_id]] | ||
|
|
||
| if result is None: |
There was a problem hiding this comment.
The lock is released before calling add_entity and accessing history_entities (lines 94-95 outside the diff). This creates a potential race condition where another thread could modify history_entities between the lock release and these accesses. Consider expanding the lock scope to include the entire cache-check-and-fetch logic, or add locking to the add_entity method to protect all accesses to history_entities.
| is_initial=is_initial, | ||
| epochs=epochs, | ||
| time_decay_days=self.ml_time_decay_days, | ||
| time_decay_days=min(self.ml_time_decay_days, self.load_data_age_days), |
There was a problem hiding this comment.
The use of max() here may not match the intended behavior. If load_data_age_days exceeds ml_time_decay_days (e.g., 60 days of data vs 30-day decay setting), this will use 60 as the decay timescale, giving the oldest samples (at day 60) a weight of exp(-60/60) ≈ 0.37. If the intention is to decay over 30 days regardless of data age, samples at day 60 should have weight exp(-60/30) ≈ 0.14, requiring time_decay_days to be capped at 30, not increased to 60. Consider whether min(ml_time_decay_days, load_data_age_days) would better match the intended exponential decay behavior, or clarify if the current max() logic is intentional to preserve weight for very old data.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
apps/predbat/ha.py:76
- The
add_entity()method modifiesself.history_entitieswithout acquiringself.history_lock. This creates a race condition becauseadd_entity()is called fromget_history()at line 94 (outside the lock), whilehistory_entitiesis also accessed inside the lock at line 85 and inprune_history()at line 111. This can lead to inconsistent state or data corruption in multi-threaded scenarios.
To fix this, add_entity() should also use the lock when modifying self.history_entities.
def add_entity(self, entity_id, days):
"""
Add an entity to the history list
"""
if entity_id in self.history_entities:
if days > self.history_entities[entity_id]:
self.history_entities[entity_id] = days
else:
self.history_entities[entity_id] = days
| @@ -111,16 +111,14 @@ To prevent drift in long-range predictions, the model blends autoregressive pred | |||
| - **Early Stopping**: Training halts if validation error doesn't improve for 5 consecutive epochs | |||
| - **Time-Weighted Samples**: Recent data has higher importance (7-day exponential decay constant) | |||
There was a problem hiding this comment.
This line still references "7-day exponential decay constant" but the code has been changed to use 30 days (ml_time_decay_days = 30 in load_ml_component.py line 75). The documentation should be updated to reflect the new time decay period or made more generic to match the changes made elsewhere in this file.
| @@ -618,18 +620,3 @@ def _publish_entity(self): | |||
| }, | |||
| app="load_ml", | |||
| ) | |||
There was a problem hiding this comment.
The removed is_alive() method had additional health checking logic that verified the component had been updated within the last 10 minutes. The base class implementation (ComponentBase.is_alive()) only checks if self.api_started is True, which is less strict and won't detect if the component becomes stuck or stops updating.
If the enhanced health checking was intentional and important for monitoring the LoadML component's health, this removal could reduce visibility into component health issues. Consider whether the base class implementation provides sufficient health monitoring or if the 10-minute timeout check should be retained.
HAHistory - Add locking around history modifications