Simplify resilient wrapper: fail-fast → communication error → outage error#381
Open
MaStr wants to merge 5 commits into
Open
Simplify resilient wrapper: fail-fast → communication error → outage error#381MaStr wants to merge 5 commits into
MaStr wants to merge 5 commits into
Conversation
A read failure (e.g. refresh_api_values) starts a retry backoff window in which the inverter is presumed unavailable. Command operations (set_mode_*) were subject to the same backoff skip as reads, but have no cached value or default to fall back on. A set_mode_* call issued during the backoff window was therefore routed into _get_cached_or_default, which raised RuntimeError and crashed the whole process. Commands now degrade gracefully instead of raising: - During backoff the command is discarded (returns None) rather than fired at a presumed-unavailable inverter; batcontrol recalculates and re-issues the mode on the next cycle. - A command attempted outside backoff that fails also returns None instead of raising. Fail-fast before initialization and InverterOutageError after the outage tolerance window are preserved. https://claude.ai/code/session_019TARxrY2uWGTDuxwyugtqD
… _read/_command The old implementation had a single _call_with_resilience method with 7 parameters plus a separate _get_cached_or_default, driven by flags like is_command and cache_attr strings. Replaced with two clear primitives: - _read(key, method, default): caches result, returns cached value during outages; the dict-based cache replaces the CachedValues dataclass. - _command(name, method, *args): discards silently during backoff, returns None on transient failure instead of crashing. The _on_failure helper now returns bool (True = re-raise) instead of mixing returns and raises, making the control flow easier to follow. Line count: 528 -> 278 (wrapper), 651 -> 380 (tests). Behaviour unchanged. https://claude.ai/code/session_019TARxrY2uWGTDuxwyugtqD
Replace the value-caching/backoff design with a much simpler concept driven by the existing run() loop: - The wrapper no longer caches reads or discards writes. After the first successful set_mode_* (initialization), any inverter failure raises InverterCommunicationError. core.run() catches it, skips the control cycle and retries on the next scheduled run - the loop interval is the natural backoff, and no decision is ever made on stale data. - A failure that persists past outage_tolerance_minutes escalates to InverterOutageError, which propagates to __main__ and terminates (unchanged). - Errors before initialization still propagate unchanged (fail-fast on config). - Externally triggered control (api_set_mode/charge_rate/limit, set_discharge_blocked) runs on background threads and tolerates a transient outage via the _tolerate_inverter_outage decorator; termination stays driven by the main run() loop. New InverterError base with InverterCommunicationError; refresh_api_values is best-effort. Removes CachedValues, retry backoff and get_outage_status. The retry_backoff_seconds config option is dropped. resilient_wrapper.py: 528 -> 208 lines. https://claude.ai/code/session_019TARxrY2uWGTDuxwyugtqD
The wrapper re-declared every InverterInterface method (and forwarded a list of attributes) just to delegate each call - pure boilerplate. The optional methods it provided fallbacks for (get_designed_capacity, get_usable_capacity, get_mqtt_inverter_topic, publish_inverter_discovery_messages) live in the backend baseclass and are only ever called internally on the backend, never on the wrapper, so the wrapper never needed them. Replace the whole thing with a __getattr__ proxy: delegate every attribute and method to the wrapped inverter and intercept only failing method calls. The outage classification (fail-fast before init, InverterCommunicationError within tolerance, InverterOutageError past it) is unchanged; shutdown and refresh_api_values stay best-effort. No longer subclasses InverterInterface (the ABC's abstractmethods are satisfied by delegation, not concrete overrides). Existing wrapper tests pass unchanged. resilient_wrapper.py: 208 -> 114 lines (528 at the start of this work). https://claude.ai/code/session_019TARxrY2uWGTDuxwyugtqD
Add the rationale that, without the wrapper, an inverter outage terminates batcontrol and the container restart policy brings it straight back, turning a multi-minute outage into a restart loop. Each cold start re-fetches the price and solar forecasts, so repeated restarts can hit provider rate limits (Awattar/Tibber, Forecast.Solar/SolarPrognose) - leaving batcontrol without fresh data even after the inverter recovers. https://claude.ai/code/session_019TARxrY2uWGTDuxwyugtqD
Contributor
There was a problem hiding this comment.
Pull request overview
Refactors the inverter resilience layer to a simpler three-state failure model (fail-fast pre-init, transient communication error post-init, outage error after tolerance) and updates core control flow, tests, and documentation accordingly.
Changes:
- Replaces cached/backoff-based
ResilientInverterWrapperwith a thin__getattr__proxy that normalizes failures intoInverterCommunicationError/InverterOutageError. - Updates
Batcontrol.run()to skip a control cycle onInverterCommunicationError, and adds a decorator to swallow inverter errors for externally triggered actions. - Updates tests and configuration documentation to match the new failure semantics and removal of retry backoff.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/batcontrol/test_core.py | Adds coverage for skipping cycles on InverterCommunicationError and propagating InverterOutageError, plus external API tolerance. |
| tests/batcontrol/inverter/test_resilient_wrapper.py | Reworks wrapper tests around the new three-state model (no caching/backoff). |
| src/batcontrol/inverter/resilient_wrapper.py | Implements the simplified resilient proxy via __getattr__ + guarded method calls and outage tracking. |
| src/batcontrol/inverter/inverter.py | Removes retry-backoff configuration wiring and updates wrapper construction/logging. |
| src/batcontrol/inverter/exceptions.py | Introduces InverterError base class and InverterCommunicationError; keeps InverterOutageError. |
| src/batcontrol/inverter/init.py | Exports the new exception classes from the package. |
| src/batcontrol/core.py | Catches InverterCommunicationError in run() to skip cycles; adds _tolerate_inverter_outage for background-triggered actions. |
| docs/configuration/inverter-configuration.md | Updates user documentation to describe skip-cycle behavior and removes retry-backoff docs. |
| config/batcontrol_config_dummy.yaml | Removes retry_backoff_seconds and updates the resilient-wrapper comment to reflect skip-cycle behavior. |
Comment on lines
95
to
+97
| logger.error( | ||
| "Inverter communication failed before initialization complete. " | ||
| "This may be a configuration error: %s", error | ||
| ) | ||
| return True # Signal to caller to re-raise | ||
| "Inverter failure before initialization (config error?): %s", error) | ||
| raise error |
Comment on lines
110
to
+114
| logger.warning( | ||
| "Inverter communication failed for '%s' (outage: %.1f min, " | ||
| "tolerance: %.1f min, failures: %d)", | ||
| operation, outage_minutes, | ||
| self._outage_tolerance_seconds / 60, | ||
| self._consecutive_failures | ||
| ) | ||
|
|
||
| return False # Don't re-raise, try cache/default | ||
|
|
||
| def _handle_success(self, mark_initialized: bool = False) -> None: | ||
| """ | ||
| Handle a successful inverter communication. | ||
|
|
||
| Args: | ||
| mark_initialized: If True, mark initialization as complete. | ||
| This should only be True for set_mode_* calls. | ||
| """ | ||
| if self._first_failure_time is not None: | ||
| outage_duration = time.time() - self._first_failure_time | ||
| logger.info( | ||
| "Inverter connection restored after %.1f minutes outage", | ||
| outage_duration / 60 | ||
| ) | ||
|
|
||
| if mark_initialized and not self._initialization_complete: | ||
| logger.info("Inverter initialization complete (first set_mode succeeded)") | ||
| self._initialization_complete = True | ||
|
|
||
| self._first_failure_time = None | ||
| self._last_failure_time = None | ||
| self._consecutive_failures = 0 | ||
|
|
||
| def _call_with_resilience( | ||
| self, | ||
| method: Callable, | ||
| operation_name: str, | ||
| cache_attr: Optional[str] = None, | ||
| default_value: Any = None, | ||
| method_args: tuple = (), | ||
| mark_initialized: bool = False | ||
| ) -> Any: | ||
| """ | ||
| Call an inverter method with resilience handling. | ||
|
|
||
| Args: | ||
| method: The inverter method to call | ||
| operation_name: Name for logging | ||
| cache_attr: Attribute name in CachedValues to use for fallback | ||
| default_value: Value to return if no cache available | ||
| method_args: Arguments to pass to the method | ||
| mark_initialized: If True, mark initialization complete on success | ||
| (should only be True for set_mode_* calls) | ||
|
|
||
| Returns: | ||
| The method result, cached value, or default value | ||
| """ | ||
| # Check if we're in backoff period - skip actual call to avoid | ||
| # hammering an unavailable inverter | ||
| if self._is_in_backoff_period(): | ||
| return self._get_cached_or_default( | ||
| operation_name, cache_attr, default_value, is_backoff=True | ||
| ) | ||
|
|
||
| try: | ||
| result = method(*method_args) | ||
| self._handle_success(mark_initialized=mark_initialized) | ||
|
|
||
| # Update cache if applicable | ||
| if cache_attr is not None: | ||
| setattr(self._cache, cache_attr, result) | ||
| self._cache.last_update_time = time.time() | ||
|
|
||
| return result | ||
|
|
||
| except Exception as e: # pylint: disable=broad-exception-caught | ||
| should_reraise = self._handle_failure(operation_name, e) | ||
|
|
||
| if should_reraise: | ||
| raise | ||
|
|
||
| return self._get_cached_or_default( | ||
| operation_name, cache_attr, default_value, is_backoff=False | ||
| ) | ||
|
|
||
| def _get_cached_or_default( | ||
| self, | ||
| operation_name: str, | ||
| cache_attr: Optional[str], | ||
| default_value: Any, | ||
| is_backoff: bool | ||
| ) -> Any: | ||
| """ | ||
| Get cached value or default when inverter is unavailable. | ||
|
|
||
| Args: | ||
| operation_name: Name for logging | ||
| cache_attr: Attribute name in CachedValues | ||
| default_value: Fallback value if no cache | ||
| is_backoff: True if skipping due to backoff period | ||
|
|
||
| Returns: | ||
| Cached value, default value, or raises if neither available | ||
| """ | ||
| reason = "in backoff period" if is_backoff else "after failure" | ||
|
|
||
| # Try to use cached value | ||
| if cache_attr is not None: | ||
| cached_value = getattr(self._cache, cache_attr, None) | ||
| if cached_value is not None: | ||
| cache_age = time.time() - self._cache.last_update_time | ||
| time_until_retry = 0 | ||
| if self._last_failure_time: | ||
| time_until_retry = max( | ||
| 0, | ||
| self._retry_backoff_seconds - (time.time() - self._last_failure_time) | ||
| ) | ||
| logger.debug( | ||
| "Using cached %s value: %s (%s, age: %.1f min, retry in: %.0fs)", | ||
| cache_attr, cached_value, reason, cache_age / 60, time_until_retry | ||
| ) | ||
| return cached_value | ||
|
|
||
| # No cache available | ||
| if default_value is not None: | ||
| logger.warning( | ||
| "No cached value for %s (%s), using default: %s", | ||
| operation_name, reason, default_value | ||
| ) | ||
| return default_value | ||
|
|
||
| # Re-raise if no fallback available | ||
| raise RuntimeError( | ||
| f"No cached value or default available for {operation_name} ({reason})" | ||
| ) | ||
|
|
||
| # ========================================================================= | ||
| # InverterInterface Implementation - Read Operations (with caching) | ||
| # ========================================================================= | ||
|
|
||
| def get_SOC(self) -> float: | ||
| """Get state of charge with resilience handling.""" | ||
| return self._call_with_resilience( | ||
| self._inverter.get_SOC, | ||
| "get_SOC", | ||
| cache_attr="soc", | ||
| default_value=50.0 # Safe middle value if no cache | ||
| ) | ||
|
|
||
| def get_stored_energy(self) -> float: | ||
| """Get stored energy with resilience handling.""" | ||
| return self._call_with_resilience( | ||
| self._inverter.get_stored_energy, | ||
| "get_stored_energy", | ||
| cache_attr="stored_energy" | ||
| ) | ||
|
|
||
| def get_stored_usable_energy(self) -> float: | ||
| """Get stored usable energy with resilience handling.""" | ||
| return self._call_with_resilience( | ||
| self._inverter.get_stored_usable_energy, | ||
| "get_stored_usable_energy", | ||
| cache_attr="stored_usable_energy" | ||
| ) | ||
|
|
||
| def get_capacity(self) -> float: | ||
| """Get capacity with resilience handling.""" | ||
| return self._call_with_resilience( | ||
| self._inverter.get_capacity, | ||
| "get_capacity", | ||
| cache_attr="capacity" | ||
| ) | ||
|
|
||
| def get_free_capacity(self) -> float: | ||
| """Get free capacity with resilience handling.""" | ||
| return self._call_with_resilience( | ||
| self._inverter.get_free_capacity, | ||
| "get_free_capacity", | ||
| cache_attr="free_capacity" | ||
| ) | ||
|
|
||
| def get_max_capacity(self) -> float: | ||
| """Get max capacity with resilience handling.""" | ||
| return self._call_with_resilience( | ||
| self._inverter.get_max_capacity, | ||
| "get_max_capacity", | ||
| cache_attr="max_capacity" | ||
| ) | ||
|
|
||
| # ========================================================================= | ||
| # InverterInterface Implementation - Write Operations (no caching) | ||
| # ========================================================================= | ||
|
|
||
| def set_mode_force_charge(self, chargerate: float): | ||
| """Set force charge mode with resilience handling.""" | ||
| return self._call_with_resilience( | ||
| self._inverter.set_mode_force_charge, | ||
| "set_mode_force_charge", | ||
| None, None, | ||
| method_args=(chargerate,), | ||
| mark_initialized=True | ||
| ) | ||
|
|
||
| def set_mode_avoid_discharge(self): | ||
| """Set avoid discharge mode with resilience handling.""" | ||
| return self._call_with_resilience( | ||
| self._inverter.set_mode_avoid_discharge, | ||
| "set_mode_avoid_discharge", | ||
| mark_initialized=True | ||
| ) | ||
|
|
||
| def set_mode_allow_discharge(self): | ||
| """Set allow discharge mode with resilience handling.""" | ||
| return self._call_with_resilience( | ||
| self._inverter.set_mode_allow_discharge, | ||
| "set_mode_allow_discharge", | ||
| mark_initialized=True | ||
| ) | ||
|
|
||
| def set_mode_limit_battery_charge(self, limit_charge_rate: int): | ||
| """Set limit battery charge mode with resilience handling.""" | ||
| return self._call_with_resilience( | ||
| self._inverter.set_mode_limit_battery_charge, | ||
| "set_mode_limit_battery_charge", | ||
| None, None, | ||
| method_args=(limit_charge_rate,), | ||
| mark_initialized=True | ||
| ) | ||
|
|
||
| # ========================================================================= | ||
| # InverterInterface Implementation - Other Methods | ||
| # ========================================================================= | ||
|
|
||
| def activate_mqtt(self, api_mqtt_api: object): | ||
| """Activate MQTT - delegate to wrapped inverter.""" | ||
| self._inverter.activate_mqtt(api_mqtt_api) | ||
| # Update forwarded mqtt_api attribute | ||
| if hasattr(self._inverter, 'mqtt_api'): | ||
| self.mqtt_api = self._inverter.mqtt_api | ||
|
|
||
| def refresh_api_values(self): | ||
| """Refresh API values with resilience handling.""" | ||
| try: | ||
| self._inverter.refresh_api_values() | ||
| self._handle_success(mark_initialized=False) | ||
| except Exception as e: # pylint: disable=broad-exception-caught | ||
| self._handle_failure("refresh_api_values", e) | ||
| # This is non-critical, just log and continue | ||
| logger.debug("Skipping API value refresh during outage") | ||
|
|
||
| def shutdown(self): | ||
| """Shutdown - delegate to wrapped inverter.""" | ||
| try: | ||
| self._inverter.shutdown() | ||
| except Exception as e: # pylint: disable=broad-exception-caught | ||
| logger.warning("Error during inverter shutdown: %s", e) | ||
|
|
||
| # ========================================================================= | ||
| # Additional Methods (forward to wrapped inverter) | ||
| # ========================================================================= | ||
|
|
||
| def get_designed_capacity(self) -> float: | ||
| """Get designed capacity with resilience handling.""" | ||
| if hasattr(self._inverter, 'get_designed_capacity'): | ||
| return self._call_with_resilience( | ||
| self._inverter.get_designed_capacity, | ||
| "get_designed_capacity", | ||
| cache_attr="designed_capacity" | ||
| ) | ||
| return self.get_capacity() | ||
|
|
||
| def get_usable_capacity(self) -> float: | ||
| """Get usable capacity with resilience handling.""" | ||
| if hasattr(self._inverter, 'get_usable_capacity'): | ||
| return self._call_with_resilience( | ||
| self._inverter.get_usable_capacity, | ||
| "get_usable_capacity", | ||
| cache_attr="usable_capacity" | ||
| ) | ||
| # Fallback calculation | ||
| return self.get_max_capacity() | ||
|
|
||
| def get_mqtt_inverter_topic(self) -> str: | ||
| """Get MQTT topic - delegate to wrapped inverter.""" | ||
| if hasattr(self._inverter, 'get_mqtt_inverter_topic'): | ||
| return self._inverter.get_mqtt_inverter_topic() | ||
| return f'inverters/{getattr(self, "inverter_num", 0)}/' | ||
|
|
||
| def publish_inverter_discovery_messages(self): | ||
| """Publish discovery messages - delegate to wrapped inverter.""" | ||
| if hasattr(self._inverter, 'publish_inverter_discovery_messages'): | ||
| try: | ||
| self._inverter.publish_inverter_discovery_messages() | ||
| except Exception as e: # pylint: disable=broad-exception-caught | ||
| logger.warning( | ||
| "Failed to publish discovery messages: %s", e | ||
| ) | ||
|
|
||
| # ========================================================================= | ||
| # Status and Diagnostics | ||
| # ========================================================================= | ||
|
|
||
| def get_outage_status(self) -> dict: | ||
| """ | ||
| Get current outage status for diagnostics. | ||
|
|
||
| Returns: | ||
| dict with outage information | ||
| """ | ||
| outage_duration = 0 | ||
| if self._first_failure_time is not None: | ||
| outage_duration = time.time() - self._first_failure_time | ||
|
|
||
| time_until_retry = 0 | ||
| if self._last_failure_time is not None: | ||
| time_until_retry = max( | ||
| 0, | ||
| self._retry_backoff_seconds - (time.time() - self._last_failure_time) | ||
| ) | ||
|
|
||
| return { | ||
| "is_connected": self._first_failure_time is None, | ||
| "initialization_complete": self._initialization_complete, | ||
| "outage_duration_seconds": outage_duration, | ||
| "outage_duration_minutes": outage_duration / 60, | ||
| "outage_tolerance_seconds": self._outage_tolerance_seconds, | ||
| "consecutive_failures": self._consecutive_failures, | ||
| "cache_valid": self._cache.is_valid(), | ||
| "cache_age_seconds": time.time() - self._cache.last_update_time, | ||
| "in_backoff_period": self._is_in_backoff_period(), | ||
| "retry_backoff_seconds": self._retry_backoff_seconds, | ||
| "time_until_retry_seconds": time_until_retry | ||
| } | ||
|
|
||
| @property | ||
| def wrapped_inverter(self) -> InverterInterface: | ||
| """Access to the wrapped inverter for advanced use cases.""" | ||
| return self._inverter | ||
|
|
||
| def __getattr__(self, name): | ||
| """Forward unknown attributes to the wrapped inverter.""" | ||
| # This is called when an attribute is not found on the wrapper | ||
| # Forward to the wrapped inverter | ||
| return getattr(self._inverter, name) | ||
| "tolerance: %.1f min). Skipping this control cycle.", | ||
| name, outage / 60, self._outage_tolerance / 60) | ||
| raise InverterCommunicationError(name) from error |
Comment on lines
+66
to
+70
| These run on background threads in response to MQTT/evcc events. If the | ||
| inverter is briefly unavailable the request is dropped and logged; the | ||
| next scheduled run() reconciles the inverter state. Termination on a | ||
| permanent outage is driven by the main run() loop, not background threads. | ||
| """ |
| self.allow_discharging() | ||
|
|
||
| def run(self): | ||
| """ One control cycle. Aborts cleanly on a transient inverter outage. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refactor
ResilientInverterWrapperto use a simpler, three-state failure model instead of caching values during outages. The wrapper now:set_mode_*): propagate errors unchanged (fail-fast for config errors)InverterCommunicationErroron any failure, allowing the caller to skip the cycle and retry on the next scheduled runInverterOutageErrorto terminate batcontrolThis eliminates the complexity of caching, backoff periods, and default values while maintaining the same resilience guarantees.
Key Changes
CachedValuesdataclass or cache fallback during outages. Failures are signaled immediately via exceptions instead._initialization_complete,_first_failure_time,_last_failure_time,_consecutive_failureswith just_initializedand_outage_start.InverterErrorbase class andInverterCommunicationErrorfor transient failures;InverterOutageErrorremains for permanent outages.ResilientInverterWrappernow uses__getattr__to delegate all attributes and methods to the wrapped inverter, only intercepting method calls to guard them.refresh_api_values()andshutdown()never raise or start outage tracking—they log and continue.core.run(): CatchesInverterCommunicationErrorto skip the control cycle;InverterOutageErrorpropagates to terminate.@_tolerate_inverter_outagedecorator for externally triggered actions (MQTT/evcc) to swallow transient failures.Implementation Details
__getattr__and wraps them with_guard(), which classifies failures based on initialization state and outage duration.set_mode_*) mark initialization complete on first success.get_SOC()) no longer cache values; failures immediately raiseInverterCommunicationError.https://claude.ai/code/session_019TARxrY2uWGTDuxwyugtqD