Skip to content

Latest commit

 

History

History
337 lines (258 loc) · 13 KB

File metadata and controls

337 lines (258 loc) · 13 KB

🐕 Watchdog Timers

System Monitoring and Recovery Mechanisms for Reliable Embedded Systems
Learn to implement watchdog timers for system health monitoring, fault detection, and automatic recovery


📋 Table of Contents


🎯 Overview

Watchdog timers are essential safety mechanisms that monitor system health and automatically reset the system if it becomes unresponsive or enters an error state. They are critical for reliable embedded systems, especially in safety-critical applications.


🚀 Quick Reference: Key Facts

  • System Monitoring: Continuous health checking and fault detection
  • Timeout Period: Maximum time between watchdog feeds
  • Recovery Mechanisms: Automatic system reset and recovery
  • Fault Detection: Identifying system failures and errors
  • Hardware vs Software: Independent hardware timers vs. software-based monitoring
  • Windowed Mode: Time window restrictions for feeding to catch "too-fast" faults
  • Recovery Levels: Different recovery strategies based on fault severity

🔍 Visual Understanding

Watchdog Timer Operation Flow

System Running → Health Check → Feed Watchdog → Continue Operation
     ↓              ↓              ↓              ↓
  Normal      Monitor Tasks    Reset Timer    System OK
Operation     Check Memory    Extend Life    Continue

     ↓              ↓              ↓              ↓
  Fault        Health Check     No Feed      Timeout
Detected        Fails          Watchdog      Reset

Watchdog Types Comparison

Hardware Watchdog:    CPU Clock → Independent Timer → Reset Circuit
                            ↓              ↓              ↓
                      Independent    Always Running    Reliable Reset
                      of CPU        Even if CPU      Mechanism
                      State         Crashes

Software Watchdog:    System Task → Monitor Other → Trigger Recovery
                            ↓              ↓              ↓
                      Part of OS    Can Fail with    Software-Based
                      Software      System           Recovery

Recovery Strategy Hierarchy

Fault Detected → Assess Severity → Choose Recovery → Execute Recovery
     ↓              ↓              ↓              ↓
  System Error   Minor Fault    Restart Task    Continue
  Major Fault    Major Fault    Restart App     Operation
  Critical Fault Critical Fault System Reset    Full Recovery

🧠 Conceptual Foundation

The Watchdog as a Safety Net

Watchdog timers represent a fundamental safety principle in embedded systems: fail-safe operation. Instead of allowing a system to fail silently or become unresponsive, watchdog timers provide automatic detection and recovery mechanisms. This philosophy enables:

  • System Reliability: Automatic recovery from transient faults
  • Fault Tolerance: Continued operation despite software or hardware issues
  • Safety Assurance: Guaranteed system response to critical failures
  • Maintenance Efficiency: Clear identification of system problems

Why Watchdog Timers Matter

Watchdog timers are critical because embedded systems operate in unpredictable environments where failures are inevitable. Proper watchdog implementation enables:

  • Automatic Recovery: System self-healing without manual intervention
  • Fault Detection: Early identification of system problems
  • System Integrity: Prevention of system lockup or hang conditions
  • Safety Compliance: Meeting safety standards for critical applications

The Watchdog Design Challenge

Designing watchdog systems involves balancing several competing concerns:

  • Timeout vs. Performance: Shorter timeouts provide faster recovery but require more frequent feeding
  • Simplicity vs. Robustness: Simple watchdogs vs. comprehensive health monitoring
  • Hardware vs. Software: Independent hardware vs. software-based monitoring
  • Recovery vs. Continuity: Immediate reset vs. graceful degradation

🎯 Core Concepts

Concept: Hardware Watchdog Configuration and Operation

Why it matters: Hardware watchdogs provide the most reliable system monitoring because they operate independently of the main CPU. They can detect and recover from system failures even when the CPU is completely unresponsive.

Minimal example

// Basic hardware watchdog configuration
typedef struct {
    uint32_t timeout_ms;          // Timeout period in milliseconds
    uint32_t prescaler;           // Clock prescaler value
    bool window_mode;             // Enable windowed mode
} hw_watchdog_config_t;

// Initialize hardware watchdog
void init_hardware_watchdog(hw_watchdog_config_t *config) {
    // Enable watchdog clock (LSI = 40kHz)
    RCC->CSR |= RCC_CSR_LSION;
    
    // Wait for LSI to be ready
    while (!(RCC->CSR & RCC_CSR_LSIRDY));
    
    // Configure prescaler and reload value
    IWDG->PR = config->prescaler;
    IWDG->RLR = (config->timeout_ms * 40) / (1000 * (config->prescaler + 1));
    
    // Enable and start watchdog
    IWDG->KR = 0xCCCC;  // Enable
    IWDG->KR = 0xAAAA;  // Start
}

Try it: Configure a hardware watchdog with a 1-second timeout and test system recovery.

Takeaways

  • Hardware watchdogs use independent clock sources (LSI)
  • Prescaler and reload values determine timeout period
  • Watchdog must be fed before timeout to prevent reset
  • Independent operation ensures reliability even during CPU failure

Concept: System Health Monitoring and Watchdog Feeding Strategy

Why it matters: Effective watchdog operation requires intelligent health monitoring. Simply feeding the watchdog on a timer doesn't guarantee system health - the feeding should only occur when critical system functions are verified to be working correctly.

Minimal example

// System health monitoring structure
typedef struct {
    bool tasks_running;           // Critical tasks are alive
    bool memory_ok;              // Memory integrity check passed
    bool communication_ok;        // Communication systems working
    bool sensors_responding;      // Sensor data is valid
} system_health_t;

// Feed watchdog only if system is healthy
void feed_watchdog_if_healthy(void) {
    system_health_t health = check_system_health();
    
    if (health.tasks_running && 
        health.memory_ok && 
        health.communication_ok && 
        health.sensors_responding) {
        
        // System is healthy, feed watchdog
        IWDG->KR = 0xAAAA;
    } else {
        // System has issues, let watchdog reset
        // Log health status for debugging
        log_system_health(&health);
    }
}

Try it: Implement a health monitoring system that checks multiple system components before feeding the watchdog.

Takeaways

  • Health checks should verify critical system functions
  • Only feed watchdog when system is truly healthy
  • Log health status for debugging and analysis
  • Let watchdog reset if system is unhealthy

Concept: Recovery Strategy Selection and Implementation

Why it matters: Different types of system failures require different recovery approaches. Implementing multiple recovery levels enables graceful degradation and prevents unnecessary system resets for minor issues.

Minimal example

// Recovery strategy levels
typedef enum {
    RECOVERY_NONE,           // No recovery needed
    RECOVERY_RESTART_TASK,   // Restart failed task
    RECOVERY_RESTART_APP,    // Restart application
    RECOVERY_SYSTEM_RESET    // Full system reset
} recovery_level_t;

// Determine recovery strategy based on fault type
recovery_level_t determine_recovery_strategy(system_health_t *health) {
    if (!health->tasks_running) {
        return RECOVERY_RESTART_TASK;
    } else if (!health->memory_ok) {
        return RECOVERY_RESTART_APP;
    } else if (!health->communication_ok && !health->sensors_responding) {
        return RECOVERY_SYSTEM_RESET;
    }
    
    return RECOVERY_NONE;
}

// Execute recovery strategy
void execute_recovery(recovery_level_t strategy) {
    switch (strategy) {
        case RECOVERY_RESTART_TASK:
            restart_failed_tasks();
            break;
        case RECOVERY_RESTART_APP:
            restart_application();
            break;
        case RECOVERY_SYSTEM_RESET:
            system_reset();
            break;
        default:
            break;
    }
}

Try it: Implement a multi-level recovery system that handles different types of faults appropriately.

Takeaways

  • Different faults require different recovery strategies
  • Implement graceful degradation before system reset
  • Log recovery actions for system analysis
  • Test recovery mechanisms thoroughly

🧪 Guided Labs

Lab 1: Hardware Watchdog Configuration and Testing

Objective: Configure a hardware watchdog and test system recovery behavior.

Steps:

  1. Configure hardware watchdog with appropriate timeout
  2. Implement basic health monitoring
  3. Test system recovery by intentionally causing faults
  4. Measure recovery time and verify system behavior

Expected Outcome: Understanding of hardware watchdog operation and configuration.

Lab 2: System Health Monitoring Implementation

Objective: Implement comprehensive system health monitoring for watchdog feeding.

Steps:

  1. Define critical system health metrics
  2. Implement health checking functions
  3. Integrate health monitoring with watchdog feeding
  4. Test health monitoring accuracy and reliability

Expected Outcome: Practical experience with system health monitoring and watchdog integration.

Lab 3: Multi-Level Recovery Strategy Implementation

Objective: Implement different recovery strategies based on fault severity.

Steps:

  1. Define fault categories and recovery levels
  2. Implement recovery strategy selection logic
  3. Create recovery execution functions
  4. Test recovery behavior under various fault conditions

Expected Outcome: Understanding of recovery strategy design and implementation.


Check Yourself

Basic Understanding

  • What is the difference between hardware and software watchdogs?
  • How do you determine the optimal watchdog timeout for your application?
  • What are the main components of a watchdog system?

Practical Application

  • How would you implement system health monitoring for watchdog feeding?
  • What considerations are important when choosing recovery strategies?
  • How do you test watchdog functionality during development?

Advanced Concepts

  • How do you handle watchdog timers during debugging and development?
  • What are the trade-offs between different recovery strategies?
  • How do you implement watchdogs for safety-critical applications?

🔗 Cross-links


🎯 Practical Considerations

System-Level Design Decisions

  • Watchdog Type: Choose between hardware and software watchdogs based on reliability requirements
  • Timeout Strategy: Balance recovery speed with system overhead
  • Health Monitoring: Define critical system functions that must be monitored

Development and Debugging

  • Development Mode: Implement longer timeouts or disable watchdogs during development
  • Health Logging: Comprehensive logging of system health and recovery actions
  • Testing Strategy: Test watchdog behavior under various fault conditions

Safety and Reliability

  • Safety Standards: Ensure watchdog implementation meets applicable safety requirements
  • Fault Injection: Test system behavior under controlled fault conditions
  • Recovery Validation: Verify system state after recovery actions

📚 Additional Resources

Books

  • "Making Embedded Systems" by Elecia White
  • "Programming Embedded Systems" by Michael Barr
  • "Real-Time Systems" by Jane W. S. Liu

Online Resources


Next Topic: Interrupts and ExceptionsPower Management