fix: deflake //rs/tests/message_routing:global_reboot_test_head_nns#10575
fix: deflake //rs/tests/message_routing:global_reboot_test_head_nns#10575basvandijk wants to merge 1 commit into
Conversation
The test's own assertions (setup, test, assert_no_metrics_errors) pass; the
flake came solely from the post-test `assert_no_unallowed_log_patterns` check
matching the pattern `panicked` once in the IC node journald logs:
thread 'main' panicked at rs/replica/src/setup.rs:278:75:
"Crypto state directory /var/lib/ic/crypto has permissions 0o40755,
allowing general access"
Root cause is a reboot/shutdown race triggered by the test's `n.vm().reboot()`
step: the orchestrator restarts the replica at the same instant the node is
shutting down, while the dedicated `/var/lib/ic/crypto` volume
(store-shared--crypto) is being unmounted. The path then resolves to the bare
mount point with default `0o755` perms, so the replica's
`CryptoConfig::check_dir_has_required_permissions` rejects the world-accessible
bits and the `.unwrap()` in `setup_crypto_provider` panics. The node then
completes the reboot and recovers, so this panic is benign.
Exempt this benign panic from the unallowed-log-pattern check via
`add_unallowed_log_pattern_except("panicked", "rs/replica/src/setup.rs")`,
mirroring the existing precedent in
rs/tests/consensus/replica_determinism_test.rs.
Created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.
There was a problem hiding this comment.
Pull request overview
This PR deflakes the //rs/tests/message_routing:global_reboot_test_head_nns system test by preventing a known benign replica panic during node reboot teardown from failing the post-test unallowed-log-pattern check.
Changes:
- Adds an
add_unallowed_log_pattern_except("panicked", "rs/replica/src/setup.rs")exemption to ignore the specific reboot-related panic location. - Documents the reboot/shutdown race and why the resulting panic is benign in an inline comment near the test group configuration.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // When this test reboots a node, the orchestrator can restart the replica at the | ||
| // same instant the node is shutting down, while the dedicated `/var/lib/ic/crypto` |
There was a problem hiding this comment.
This sounds generic enough to concern any system test: why can't this happen in any system test when the nodes are shutting down at the end of the test?
There was a problem hiding this comment.
Yeah good point. We could add it by default. But that could hide panics in that module which we shouldn't ignore. Should we move that panic to a dedicated module so we can only ignore that specific panic?
There was a problem hiding this comment.
Could we make the system test driver kill the orchestrator before it triggers VM shut-down? This should avoid any such issues by design.
Summary
Deflakes
//rs/tests/message_routing:global_reboot_test_head_nns.The test's own assertions are not at fault. In the analyzed flaky runs the
setup,testandassert_no_metrics_errorstasks all passed (canisters madeprogress before/after the reboot, no metric errors). The failure came solely
from the post-test
assert_no_unallowed_log_patternscheck, which matchedthe pattern
panickedonce in the IC node journald logs:Root cause — a reboot/shutdown race
Step 4 of the test reboots every node via
n.vm().reboot(). On the panickingnode the journald timeline shows:
the very same moment the orchestrator itself is being shut down.
store-shared--crypto(dm-3) isbeing unmounted from
/var/lib/ic/crypto. The SELinux audit line at theinstant of the panic shows
getattr … path="/var/lib/ic/crypto" dev="dm-1",i.e. the path now resolves to the bare mount-point directory on the root
fs (
dm-1), not the crypto volume.0o755perms (worldr-x), whereas the cryptovolume is normally
0o700. The freshly-started replica'ssetup_crypto_providercallsCryptoConfig::check_dir_has_required_permissions(rs/config/src/crypto.rs),which rejects any world-accessible bits, and the
.unwrap()atrs/replica/src/setup.rs:278 panics.
0o700, the replicastarts normally and the canisters resume — so the test's assertions pass.
The panic is therefore benign (a transient replica that dies during the
reboot teardown and recovers). The only reason the test fails is the log-pattern
check matching
panicked.Both flaky runs analyzed (2026-06-24 and 2026-06-25) had this identical cause,
and in both the retry attempt passed.
Fix
Exempt this benign panic from the unallowed-log-pattern check on the
SystemTestGroup:The exclusion phrase must occur in the same journald
MESSAGEline aspanicked; here that line isthread 'main' panicked at rs/replica/src/setup.rs:278:75:, sors/replica/src/setup.rsis a validexclusion. This mirrors the existing precedent in
rs/tests/consensus/replica_determinism_test.rs.Verification
cargo clippyonmessage-routing-system-tests: clean.bazel build //rs/tests/message_routing:global_reboot_test_head_nns: OK.bazel test --runs_per_test=3 --jobs=3 //rs/tests/message_routing:global_reboot_test_head_nns: 3/3 PASSED (avg 158s).This PR was created following the steps in
.claude/skills/fix-flaky-tests/SKILL.md.