The reboot-limit workflow is designed to test system stability by performing continuous reboots and collecting boot performance metrics. This helps identify boot-related regressions, hardware issues, or kernel stability problems that only manifest after multiple boot cycles.
The primary goals of the reboot-limit workflow are:
- Stability Testing: Verify that a system can reboot reliably over many cycles
- Performance Analysis: Track boot time trends and identify performance regressions
- Hardware Validation: Detect hardware issues that only appear after thermal cycling
- Kernel Testing: Validate kernel changes don't introduce boot-related regressions
- Multiple reboot methods (Ansible, systemctl reboot, systemctl kexec)
- Boot performance tracking via systemd-analyze
- Configurable crash injection for resilience testing
- Loop testing for continuous operation until failure
- A/B testing support for comparing baseline vs development kernels
- Comprehensive visualization and analysis tools
Enable the reboot-limit workflow in your kdevops configuration:
make menuconfig
# Navigate to: Workflows → reboot-limit
# Enable the workflow and configure options- Reboot test type: Choose between Ansible module, systemctl reboot, or systemctl kexec
- Boot count: Number of reboots per test run (default: 100)
- Loop testing: Enable continuous testing until failure or steady state
- Data collection: Enable systemd-analyze data collection for performance tracking
- Crash injection: Force crashes at intervals to test recovery mechanisms
# Initial setup (creates /data/reboot-limit directory)
make reboot-limit
# Run baseline test (performs configured number of reboots)
make reboot-limit-baseline
# Run development test (for A/B testing)
make reboot-limit-dev-baseline
# Reset boot counters
make reboot-limit-baseline-resetFor continuous testing until failure or steady state:
# Run baseline in a loop
make reboot-limit-baseline-loop
# Run with kernel-of-the-day updates
make reboot-limit-baseline-kotdThe workflow collects two types of data for each host:
- Boot count (
reboot-count.txt): Current boot number - Boot timing (
systemctl-analyze.txt): systemd-analyze output for each boot
Data is stored in:
- On nodes:
/data/reboot-limit/<hostname>/ - Locally:
workflows/demos/reboot-limit/results/<hostname>/
After running tests, analyze the results to understand boot performance:
# Generate summary statistics
make reboot-limit-results
# Generate visualization graphs
make reboot-limit-graphThe analysis generates comprehensive graphs showing:
- Boot Component Times: Stacked area chart showing kernel, initrd, and userspace times
- Total Boot Time Analysis: Line graph with statistical indicators (mean, median, standard deviation)
The visualization helps identify:
- Boot time trends over multiple reboots
- Performance spikes or anomalies
- Component-specific slowdowns (kernel vs userspace)
- Statistical variance in boot times
The analysis provides:
- Summary Statistics: Min, max, mean, median, standard deviation
- Component Breakdown: Average times for kernel, initrd, and userspace
- Visual Trends: Graphs showing performance over time
- Anomaly Detection: Spikes visible in the timeline
Test system resilience by forcing crashes at intervals:
make menuconfig
# Enable: Force a crash after certain period of reboots
# Set: After how many reboots should we force a crashThis helps validate:
- Crash recovery mechanisms
- Filesystem consistency after unexpected shutdowns
- Hardware error recovery
Compare baseline and development kernels:
- Configure for A/B testing with separate baseline/dev nodes
- Run tests on both node groups
- Compare results using the visualization tools
The reboot-limit workflow supports integration with continuous integration:
- Exit codes indicate test success/failure
- Results are machine-parseable for automation
- Loop testing can run until steady state is achieved
-
DHCP Timeout: If reboots fail due to network issues
- Check DHCP server configuration
- Increase Ansible timeout values
-
Systemd-analyze Failures: If timing data isn't collected
- Ensure systemd-analyze is available on target systems
- Check that boot has completed before collection
-
Storage Issues: If data collection fails
- Verify
/datapartition has sufficient space - Check file permissions on target nodes
- Verify
Enable verbose output for troubleshooting:
# Ansible verbose mode
ANSIBLE_VERBOSITY=3 make reboot-limit-baseline
# Check individual node status
ansible baseline:dev:service -m shell -a "cat /data/reboot-limit/*/reboot-count.txt"- Consistent boot times with minor variations (±10%)
- Occasional spikes (1-2% of boots) due to system maintenance
- Gradual improvement over first few boots (cache warming)
- Increasing trend in boot times
- Large standard deviation (>20% of mean)
- Component-specific degradation
- Frequent outliers or spikes
- Baseline First: Always establish a baseline before testing changes
- Multiple Runs: Use loop testing for statistical significance
- Monitor Resources: Check disk space before long runs
- Save Results: Archive results for historical comparison
- Regular Analysis: Review trends periodically
The reboot-limit workflow consists of:
- Ansible Playbook (
playbooks/reboot-limit.yml): Orchestrates the test - Ansible Role (
playbooks/roles/reboot-limit/): Implements test logic - Analysis Tools (
scripts/workflows/demos/reboot-limit/): Process results - Configuration (
workflows/demos/reboot-limit/Kconfig): User options
To enhance the reboot-limit workflow:
- Additional metrics collection (e.g., dmesg timing, service startup)
- More reboot methods (e.g., IPMI, hardware reset)
- Enhanced visualization options
- Integration with other workflows
See the contributing guide for more details.
