The reboot-limit workflow

The reboot-limit workflow is designed to test system stability by performing continuous reboots and collecting boot performance metrics. This helps identify boot-related regressions, hardware issues, or kernel stability problems that only manifest after multiple boot cycles.

Purpose

The primary goals of the reboot-limit workflow are:

Stability Testing: Verify that a system can reboot reliably over many cycles
Performance Analysis: Track boot time trends and identify performance regressions
Hardware Validation: Detect hardware issues that only appear after thermal cycling
Kernel Testing: Validate kernel changes don't introduce boot-related regressions

Features

Multiple reboot methods (Ansible, systemctl reboot, systemctl kexec)
Boot performance tracking via systemd-analyze
Configurable crash injection for resilience testing
Loop testing for continuous operation until failure
A/B testing support for comparing baseline vs development kernels
Comprehensive visualization and analysis tools

Configuration

Enable the reboot-limit workflow in your kdevops configuration:

make menuconfig
# Navigate to: Workflows → reboot-limit
# Enable the workflow and configure options

Key Configuration Options

Reboot test type: Choose between Ansible module, systemctl reboot, or systemctl kexec
Boot count: Number of reboots per test run (default: 100)
Loop testing: Enable continuous testing until failure or steady state
Data collection: Enable systemd-analyze data collection for performance tracking
Crash injection: Force crashes at intervals to test recovery mechanisms

Running the Tests

Basic Commands

# Initial setup (creates /data/reboot-limit directory)
make reboot-limit

# Run baseline test (performs configured number of reboots)
make reboot-limit-baseline

# Run development test (for A/B testing)
make reboot-limit-dev-baseline

# Reset boot counters
make reboot-limit-baseline-reset

Loop Testing

For continuous testing until failure or steady state:

# Run baseline in a loop
make reboot-limit-baseline-loop

# Run with kernel-of-the-day updates
make reboot-limit-baseline-kotd

Data Collection

The workflow collects two types of data for each host:

Boot count (reboot-count.txt): Current boot number
Boot timing (systemctl-analyze.txt): systemd-analyze output for each boot

Data is stored in:

On nodes: /data/reboot-limit/<hostname>/
Locally: workflows/demos/reboot-limit/results/<hostname>/

Analyzing Results

After running tests, analyze the results to understand boot performance:

# Generate summary statistics
make reboot-limit-results

# Generate visualization graphs
make reboot-limit-graph

Visualization Output

The analysis generates comprehensive graphs showing:

Boot Component Times: Stacked area chart showing kernel, initrd, and userspace times
Total Boot Time Analysis: Line graph with statistical indicators (mean, median, standard deviation)

The visualization helps identify:

Boot time trends over multiple reboots
Performance spikes or anomalies
Component-specific slowdowns (kernel vs userspace)
Statistical variance in boot times

Understanding the Results

The analysis provides:

Summary Statistics: Min, max, mean, median, standard deviation
Component Breakdown: Average times for kernel, initrd, and userspace
Visual Trends: Graphs showing performance over time
Anomaly Detection: Spikes visible in the timeline

Advanced Usage

Crash Injection Testing

Test system resilience by forcing crashes at intervals:

make menuconfig
# Enable: Force a crash after certain period of reboots
# Set: After how many reboots should we force a crash

This helps validate:

Crash recovery mechanisms
Filesystem consistency after unexpected shutdowns
Hardware error recovery

A/B Testing

Compare baseline and development kernels:

Configure for A/B testing with separate baseline/dev nodes
Run tests on both node groups
Compare results using the visualization tools

Integration with CI/CD

The reboot-limit workflow supports integration with continuous integration:

Exit codes indicate test success/failure
Results are machine-parseable for automation
Loop testing can run until steady state is achieved

Troubleshooting

Common Issues

DHCP Timeout: If reboots fail due to network issues
- Check DHCP server configuration
- Increase Ansible timeout values
Systemd-analyze Failures: If timing data isn't collected
- Ensure systemd-analyze is available on target systems
- Check that boot has completed before collection
Storage Issues: If data collection fails
- Verify /data partition has sufficient space
- Check file permissions on target nodes

Debug Options

Enable verbose output for troubleshooting:

# Ansible verbose mode
ANSIBLE_VERBOSITY=3 make reboot-limit-baseline

# Check individual node status
ansible baseline:dev:service -m shell -a "cat /data/reboot-limit/*/reboot-count.txt"

Interpreting Performance Data

Normal Patterns

Consistent boot times with minor variations (±10%)
Occasional spikes (1-2% of boots) due to system maintenance
Gradual improvement over first few boots (cache warming)

Warning Signs

Increasing trend in boot times
Large standard deviation (>20% of mean)
Component-specific degradation
Frequent outliers or spikes

Best Practices

Baseline First: Always establish a baseline before testing changes
Multiple Runs: Use loop testing for statistical significance
Monitor Resources: Check disk space before long runs
Save Results: Archive results for historical comparison
Regular Analysis: Review trends periodically

Workflow Architecture

The reboot-limit workflow consists of:

Ansible Playbook (playbooks/reboot-limit.yml): Orchestrates the test
Ansible Role (playbooks/roles/reboot-limit/): Implements test logic
Analysis Tools (scripts/workflows/demos/reboot-limit/): Process results
Configuration (workflows/demos/reboot-limit/Kconfig): User options

Contributing

To enhance the reboot-limit workflow:

Additional metrics collection (e.g., dmesg timing, service startup)
More reboot methods (e.g., IPMI, hardware reset)
Enhanced visualization options
Integration with other workflows

See the contributing guide for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The reboot-limit workflow

Purpose

Features

Configuration

Key Configuration Options

Running the Tests

Basic Commands

Loop Testing

Data Collection

Analyzing Results

Visualization Output

Understanding the Results

Advanced Usage

Crash Injection Testing

A/B Testing

Integration with CI/CD

Troubleshooting

Common Issues

Debug Options

Interpreting Performance Data

Normal Patterns

Warning Signs

Best Practices

Workflow Architecture

Contributing

FilesExpand file tree

reboot-limit.md

Latest commit

History

reboot-limit.md

File metadata and controls

The reboot-limit workflow

Purpose

Features

Configuration

Key Configuration Options

Running the Tests

Basic Commands

Loop Testing

Data Collection

Analyzing Results

Visualization Output

Understanding the Results

Advanced Usage

Crash Injection Testing

A/B Testing

Integration with CI/CD

Troubleshooting

Common Issues

Debug Options

Interpreting Performance Data

Normal Patterns

Warning Signs

Best Practices

Workflow Architecture

Contributing