Skip to content

Simplify resilient wrapper: fail-fast → communication error → outage error#381

Open
MaStr wants to merge 5 commits into
mainfrom
claude/exciting-ptolemy-qxaxbn
Open

Simplify resilient wrapper: fail-fast → communication error → outage error#381
MaStr wants to merge 5 commits into
mainfrom
claude/exciting-ptolemy-qxaxbn

Conversation

@MaStr

@MaStr MaStr commented Jun 13, 2026

Copy link
Copy Markdown
Owner

Summary

Refactor ResilientInverterWrapper to use a simpler, three-state failure model instead of caching values during outages. The wrapper now:

  1. Before initialization (first successful set_mode_*): propagate errors unchanged (fail-fast for config errors)
  2. After initialization: raise InverterCommunicationError on any failure, allowing the caller to skip the cycle and retry on the next scheduled run
  3. After tolerance window expires: escalate to InverterOutageError to terminate batcontrol

This eliminates the complexity of caching, backoff periods, and default values while maintaining the same resilience guarantees.

Key Changes

  • Removed caching logic: No more CachedValues dataclass or cache fallback during outages. Failures are signaled immediately via exceptions instead.
  • Removed retry backoff: The main loop's natural interval (15 minutes) serves as the backoff; no need for explicit backoff tracking.
  • Simplified state tracking: Replaced _initialization_complete, _first_failure_time, _last_failure_time, _consecutive_failures with just _initialized and _outage_start.
  • New exception hierarchy: Added InverterError base class and InverterCommunicationError for transient failures; InverterOutageError remains for permanent outages.
  • Thin proxy design: ResilientInverterWrapper now uses __getattr__ to delegate all attributes and methods to the wrapped inverter, only intercepting method calls to guard them.
  • Best-effort methods: refresh_api_values() and shutdown() never raise or start outage tracking—they log and continue.
  • Updated core.run(): Catches InverterCommunicationError to skip the control cycle; InverterOutageError propagates to terminate.
  • Background thread safety: Added @_tolerate_inverter_outage decorator for externally triggered actions (MQTT/evcc) to swallow transient failures.

Implementation Details

  • The wrapper intercepts method calls via __getattr__ and wraps them with _guard(), which classifies failures based on initialization state and outage duration.
  • Command methods (set_mode_*) mark initialization complete on first success.
  • Read methods (e.g., get_SOC()) no longer cache values; failures immediately raise InverterCommunicationError.
  • Tests reduced from 582 to 306 lines, focusing on the three-state model rather than caching behavior.
  • Documentation updated to reflect the new behavior and exception semantics.

https://claude.ai/code/session_019TARxrY2uWGTDuxwyugtqD

claude added 5 commits June 13, 2026 10:29
A read failure (e.g. refresh_api_values) starts a retry backoff window in
which the inverter is presumed unavailable. Command operations (set_mode_*)
were subject to the same backoff skip as reads, but have no cached value or
default to fall back on. A set_mode_* call issued during the backoff window
was therefore routed into _get_cached_or_default, which raised RuntimeError
and crashed the whole process.

Commands now degrade gracefully instead of raising:
- During backoff the command is discarded (returns None) rather than fired
  at a presumed-unavailable inverter; batcontrol recalculates and re-issues
  the mode on the next cycle.
- A command attempted outside backoff that fails also returns None instead
  of raising.

Fail-fast before initialization and InverterOutageError after the outage
tolerance window are preserved.

https://claude.ai/code/session_019TARxrY2uWGTDuxwyugtqD
… _read/_command

The old implementation had a single _call_with_resilience method with 7
parameters plus a separate _get_cached_or_default, driven by flags like
is_command and cache_attr strings. Replaced with two clear primitives:

- _read(key, method, default): caches result, returns cached value during
  outages; the dict-based cache replaces the CachedValues dataclass.
- _command(name, method, *args): discards silently during backoff, returns
  None on transient failure instead of crashing.

The _on_failure helper now returns bool (True = re-raise) instead of mixing
returns and raises, making the control flow easier to follow.

Line count: 528 -> 278 (wrapper), 651 -> 380 (tests). Behaviour unchanged.

https://claude.ai/code/session_019TARxrY2uWGTDuxwyugtqD
Replace the value-caching/backoff design with a much simpler concept driven
by the existing run() loop:

- The wrapper no longer caches reads or discards writes. After the first
  successful set_mode_* (initialization), any inverter failure raises
  InverterCommunicationError. core.run() catches it, skips the control cycle
  and retries on the next scheduled run - the loop interval is the natural
  backoff, and no decision is ever made on stale data.
- A failure that persists past outage_tolerance_minutes escalates to
  InverterOutageError, which propagates to __main__ and terminates (unchanged).
- Errors before initialization still propagate unchanged (fail-fast on config).
- Externally triggered control (api_set_mode/charge_rate/limit,
  set_discharge_blocked) runs on background threads and tolerates a transient
  outage via the _tolerate_inverter_outage decorator; termination stays driven
  by the main run() loop.

New InverterError base with InverterCommunicationError; refresh_api_values is
best-effort. Removes CachedValues, retry backoff and get_outage_status. The
retry_backoff_seconds config option is dropped.

resilient_wrapper.py: 528 -> 208 lines.

https://claude.ai/code/session_019TARxrY2uWGTDuxwyugtqD
The wrapper re-declared every InverterInterface method (and forwarded a list
of attributes) just to delegate each call - pure boilerplate. The optional
methods it provided fallbacks for (get_designed_capacity, get_usable_capacity,
get_mqtt_inverter_topic, publish_inverter_discovery_messages) live in the
backend baseclass and are only ever called internally on the backend, never on
the wrapper, so the wrapper never needed them.

Replace the whole thing with a __getattr__ proxy: delegate every attribute and
method to the wrapped inverter and intercept only failing method calls. The
outage classification (fail-fast before init, InverterCommunicationError within
tolerance, InverterOutageError past it) is unchanged; shutdown and
refresh_api_values stay best-effort.

No longer subclasses InverterInterface (the ABC's abstractmethods are satisfied
by delegation, not concrete overrides). Existing wrapper tests pass unchanged.

resilient_wrapper.py: 208 -> 114 lines (528 at the start of this work).

https://claude.ai/code/session_019TARxrY2uWGTDuxwyugtqD
Add the rationale that, without the wrapper, an inverter outage terminates
batcontrol and the container restart policy brings it straight back, turning a
multi-minute outage into a restart loop. Each cold start re-fetches the price
and solar forecasts, so repeated restarts can hit provider rate limits
(Awattar/Tibber, Forecast.Solar/SolarPrognose) - leaving batcontrol without
fresh data even after the inverter recovers.

https://claude.ai/code/session_019TARxrY2uWGTDuxwyugtqD
Copilot AI review requested due to automatic review settings June 13, 2026 19:17

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the inverter resilience layer to a simpler three-state failure model (fail-fast pre-init, transient communication error post-init, outage error after tolerance) and updates core control flow, tests, and documentation accordingly.

Changes:

  • Replaces cached/backoff-based ResilientInverterWrapper with a thin __getattr__ proxy that normalizes failures into InverterCommunicationError / InverterOutageError.
  • Updates Batcontrol.run() to skip a control cycle on InverterCommunicationError, and adds a decorator to swallow inverter errors for externally triggered actions.
  • Updates tests and configuration documentation to match the new failure semantics and removal of retry backoff.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/batcontrol/test_core.py Adds coverage for skipping cycles on InverterCommunicationError and propagating InverterOutageError, plus external API tolerance.
tests/batcontrol/inverter/test_resilient_wrapper.py Reworks wrapper tests around the new three-state model (no caching/backoff).
src/batcontrol/inverter/resilient_wrapper.py Implements the simplified resilient proxy via __getattr__ + guarded method calls and outage tracking.
src/batcontrol/inverter/inverter.py Removes retry-backoff configuration wiring and updates wrapper construction/logging.
src/batcontrol/inverter/exceptions.py Introduces InverterError base class and InverterCommunicationError; keeps InverterOutageError.
src/batcontrol/inverter/init.py Exports the new exception classes from the package.
src/batcontrol/core.py Catches InverterCommunicationError in run() to skip cycles; adds _tolerate_inverter_outage for background-triggered actions.
docs/configuration/inverter-configuration.md Updates user documentation to describe skip-cycle behavior and removes retry-backoff docs.
config/batcontrol_config_dummy.yaml Removes retry_backoff_seconds and updates the resilient-wrapper comment to reflect skip-cycle behavior.

Comment on lines 95 to +97
logger.error(
"Inverter communication failed before initialization complete. "
"This may be a configuration error: %s", error
)
return True # Signal to caller to re-raise
"Inverter failure before initialization (config error?): %s", error)
raise error
Comment on lines 110 to +114
logger.warning(
"Inverter communication failed for '%s' (outage: %.1f min, "
"tolerance: %.1f min, failures: %d)",
operation, outage_minutes,
self._outage_tolerance_seconds / 60,
self._consecutive_failures
)

return False # Don't re-raise, try cache/default

def _handle_success(self, mark_initialized: bool = False) -> None:
"""
Handle a successful inverter communication.

Args:
mark_initialized: If True, mark initialization as complete.
This should only be True for set_mode_* calls.
"""
if self._first_failure_time is not None:
outage_duration = time.time() - self._first_failure_time
logger.info(
"Inverter connection restored after %.1f minutes outage",
outage_duration / 60
)

if mark_initialized and not self._initialization_complete:
logger.info("Inverter initialization complete (first set_mode succeeded)")
self._initialization_complete = True

self._first_failure_time = None
self._last_failure_time = None
self._consecutive_failures = 0

def _call_with_resilience(
self,
method: Callable,
operation_name: str,
cache_attr: Optional[str] = None,
default_value: Any = None,
method_args: tuple = (),
mark_initialized: bool = False
) -> Any:
"""
Call an inverter method with resilience handling.

Args:
method: The inverter method to call
operation_name: Name for logging
cache_attr: Attribute name in CachedValues to use for fallback
default_value: Value to return if no cache available
method_args: Arguments to pass to the method
mark_initialized: If True, mark initialization complete on success
(should only be True for set_mode_* calls)

Returns:
The method result, cached value, or default value
"""
# Check if we're in backoff period - skip actual call to avoid
# hammering an unavailable inverter
if self._is_in_backoff_period():
return self._get_cached_or_default(
operation_name, cache_attr, default_value, is_backoff=True
)

try:
result = method(*method_args)
self._handle_success(mark_initialized=mark_initialized)

# Update cache if applicable
if cache_attr is not None:
setattr(self._cache, cache_attr, result)
self._cache.last_update_time = time.time()

return result

except Exception as e: # pylint: disable=broad-exception-caught
should_reraise = self._handle_failure(operation_name, e)

if should_reraise:
raise

return self._get_cached_or_default(
operation_name, cache_attr, default_value, is_backoff=False
)

def _get_cached_or_default(
self,
operation_name: str,
cache_attr: Optional[str],
default_value: Any,
is_backoff: bool
) -> Any:
"""
Get cached value or default when inverter is unavailable.

Args:
operation_name: Name for logging
cache_attr: Attribute name in CachedValues
default_value: Fallback value if no cache
is_backoff: True if skipping due to backoff period

Returns:
Cached value, default value, or raises if neither available
"""
reason = "in backoff period" if is_backoff else "after failure"

# Try to use cached value
if cache_attr is not None:
cached_value = getattr(self._cache, cache_attr, None)
if cached_value is not None:
cache_age = time.time() - self._cache.last_update_time
time_until_retry = 0
if self._last_failure_time:
time_until_retry = max(
0,
self._retry_backoff_seconds - (time.time() - self._last_failure_time)
)
logger.debug(
"Using cached %s value: %s (%s, age: %.1f min, retry in: %.0fs)",
cache_attr, cached_value, reason, cache_age / 60, time_until_retry
)
return cached_value

# No cache available
if default_value is not None:
logger.warning(
"No cached value for %s (%s), using default: %s",
operation_name, reason, default_value
)
return default_value

# Re-raise if no fallback available
raise RuntimeError(
f"No cached value or default available for {operation_name} ({reason})"
)

# =========================================================================
# InverterInterface Implementation - Read Operations (with caching)
# =========================================================================

def get_SOC(self) -> float:
"""Get state of charge with resilience handling."""
return self._call_with_resilience(
self._inverter.get_SOC,
"get_SOC",
cache_attr="soc",
default_value=50.0 # Safe middle value if no cache
)

def get_stored_energy(self) -> float:
"""Get stored energy with resilience handling."""
return self._call_with_resilience(
self._inverter.get_stored_energy,
"get_stored_energy",
cache_attr="stored_energy"
)

def get_stored_usable_energy(self) -> float:
"""Get stored usable energy with resilience handling."""
return self._call_with_resilience(
self._inverter.get_stored_usable_energy,
"get_stored_usable_energy",
cache_attr="stored_usable_energy"
)

def get_capacity(self) -> float:
"""Get capacity with resilience handling."""
return self._call_with_resilience(
self._inverter.get_capacity,
"get_capacity",
cache_attr="capacity"
)

def get_free_capacity(self) -> float:
"""Get free capacity with resilience handling."""
return self._call_with_resilience(
self._inverter.get_free_capacity,
"get_free_capacity",
cache_attr="free_capacity"
)

def get_max_capacity(self) -> float:
"""Get max capacity with resilience handling."""
return self._call_with_resilience(
self._inverter.get_max_capacity,
"get_max_capacity",
cache_attr="max_capacity"
)

# =========================================================================
# InverterInterface Implementation - Write Operations (no caching)
# =========================================================================

def set_mode_force_charge(self, chargerate: float):
"""Set force charge mode with resilience handling."""
return self._call_with_resilience(
self._inverter.set_mode_force_charge,
"set_mode_force_charge",
None, None,
method_args=(chargerate,),
mark_initialized=True
)

def set_mode_avoid_discharge(self):
"""Set avoid discharge mode with resilience handling."""
return self._call_with_resilience(
self._inverter.set_mode_avoid_discharge,
"set_mode_avoid_discharge",
mark_initialized=True
)

def set_mode_allow_discharge(self):
"""Set allow discharge mode with resilience handling."""
return self._call_with_resilience(
self._inverter.set_mode_allow_discharge,
"set_mode_allow_discharge",
mark_initialized=True
)

def set_mode_limit_battery_charge(self, limit_charge_rate: int):
"""Set limit battery charge mode with resilience handling."""
return self._call_with_resilience(
self._inverter.set_mode_limit_battery_charge,
"set_mode_limit_battery_charge",
None, None,
method_args=(limit_charge_rate,),
mark_initialized=True
)

# =========================================================================
# InverterInterface Implementation - Other Methods
# =========================================================================

def activate_mqtt(self, api_mqtt_api: object):
"""Activate MQTT - delegate to wrapped inverter."""
self._inverter.activate_mqtt(api_mqtt_api)
# Update forwarded mqtt_api attribute
if hasattr(self._inverter, 'mqtt_api'):
self.mqtt_api = self._inverter.mqtt_api

def refresh_api_values(self):
"""Refresh API values with resilience handling."""
try:
self._inverter.refresh_api_values()
self._handle_success(mark_initialized=False)
except Exception as e: # pylint: disable=broad-exception-caught
self._handle_failure("refresh_api_values", e)
# This is non-critical, just log and continue
logger.debug("Skipping API value refresh during outage")

def shutdown(self):
"""Shutdown - delegate to wrapped inverter."""
try:
self._inverter.shutdown()
except Exception as e: # pylint: disable=broad-exception-caught
logger.warning("Error during inverter shutdown: %s", e)

# =========================================================================
# Additional Methods (forward to wrapped inverter)
# =========================================================================

def get_designed_capacity(self) -> float:
"""Get designed capacity with resilience handling."""
if hasattr(self._inverter, 'get_designed_capacity'):
return self._call_with_resilience(
self._inverter.get_designed_capacity,
"get_designed_capacity",
cache_attr="designed_capacity"
)
return self.get_capacity()

def get_usable_capacity(self) -> float:
"""Get usable capacity with resilience handling."""
if hasattr(self._inverter, 'get_usable_capacity'):
return self._call_with_resilience(
self._inverter.get_usable_capacity,
"get_usable_capacity",
cache_attr="usable_capacity"
)
# Fallback calculation
return self.get_max_capacity()

def get_mqtt_inverter_topic(self) -> str:
"""Get MQTT topic - delegate to wrapped inverter."""
if hasattr(self._inverter, 'get_mqtt_inverter_topic'):
return self._inverter.get_mqtt_inverter_topic()
return f'inverters/{getattr(self, "inverter_num", 0)}/'

def publish_inverter_discovery_messages(self):
"""Publish discovery messages - delegate to wrapped inverter."""
if hasattr(self._inverter, 'publish_inverter_discovery_messages'):
try:
self._inverter.publish_inverter_discovery_messages()
except Exception as e: # pylint: disable=broad-exception-caught
logger.warning(
"Failed to publish discovery messages: %s", e
)

# =========================================================================
# Status and Diagnostics
# =========================================================================

def get_outage_status(self) -> dict:
"""
Get current outage status for diagnostics.

Returns:
dict with outage information
"""
outage_duration = 0
if self._first_failure_time is not None:
outage_duration = time.time() - self._first_failure_time

time_until_retry = 0
if self._last_failure_time is not None:
time_until_retry = max(
0,
self._retry_backoff_seconds - (time.time() - self._last_failure_time)
)

return {
"is_connected": self._first_failure_time is None,
"initialization_complete": self._initialization_complete,
"outage_duration_seconds": outage_duration,
"outage_duration_minutes": outage_duration / 60,
"outage_tolerance_seconds": self._outage_tolerance_seconds,
"consecutive_failures": self._consecutive_failures,
"cache_valid": self._cache.is_valid(),
"cache_age_seconds": time.time() - self._cache.last_update_time,
"in_backoff_period": self._is_in_backoff_period(),
"retry_backoff_seconds": self._retry_backoff_seconds,
"time_until_retry_seconds": time_until_retry
}

@property
def wrapped_inverter(self) -> InverterInterface:
"""Access to the wrapped inverter for advanced use cases."""
return self._inverter

def __getattr__(self, name):
"""Forward unknown attributes to the wrapped inverter."""
# This is called when an attribute is not found on the wrapper
# Forward to the wrapped inverter
return getattr(self._inverter, name)
"tolerance: %.1f min). Skipping this control cycle.",
name, outage / 60, self._outage_tolerance / 60)
raise InverterCommunicationError(name) from error
Comment thread src/batcontrol/core.py
Comment on lines +66 to +70
These run on background threads in response to MQTT/evcc events. If the
inverter is briefly unavailable the request is dropped and logged; the
next scheduled run() reconciles the inverter state. Termination on a
permanent outage is driven by the main run() loop, not background threads.
"""
Comment thread src/batcontrol/core.py
self.allow_discharging()

def run(self):
""" One control cycle. Aborts cleanly on a transient inverter outage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants