fix: deflake //rs/tests/message_routing:global_reboot_test_head_nns by basvandijk · Pull Request #10575 · dfinity/ic

basvandijk · 2026-06-26T10:14:11Z

Summary

Deflakes //rs/tests/message_routing:global_reboot_test_head_nns.

The test's own assertions are not at fault. In the analyzed flaky runs the
setup, test and assert_no_metrics_errors tasks all passed (canisters made
progress before/after the reboot, no metric errors). The failure came solely
from the post-test assert_no_unallowed_log_patterns check, which matched
the pattern panicked once in the IC node journald logs:

thread 'main' panicked at rs/replica/src/setup.rs:278:75:
called `Result::unwrap()` on an `Err` value:
  "Crypto state directory /var/lib/ic/crypto has permissions 0o40755, allowing general access"

Root cause — a reboot/shutdown race

Step 4 of the test reboots every node via n.vm().reboot(). On the panicking
node the journald timeline shows:

The node starts shutting down for the reboot and the running replica exits.
The orchestrator's supervision loop immediately restarts the replica at
the very same moment the orchestrator itself is being shut down.
Concurrently the dedicated crypto volume store-shared--crypto (dm-3) is
being unmounted from /var/lib/ic/crypto. The SELinux audit line at the
instant of the panic shows getattr … path="/var/lib/ic/crypto" dev="dm-1",
i.e. the path now resolves to the bare mount-point directory on the root
fs (dm-1), not the crypto volume.
That mount-point has default 0o755 perms (world r-x), whereas the crypto
volume is normally 0o700. The freshly-started replica's
setup_crypto_provider calls
CryptoConfig::check_dir_has_required_permissions (rs/config/src/crypto.rs),
which rejects any world-accessible bits, and the .unwrap() at
rs/replica/src/setup.rs:278 panics.
The node then completes the reboot, remounts crypto with 0o700, the replica
starts normally and the canisters resume — so the test's assertions pass.

The panic is therefore benign (a transient replica that dies during the
reboot teardown and recovers). The only reason the test fails is the log-pattern
check matching panicked.

Both flaky runs analyzed (2026-06-24 and 2026-06-25) had this identical cause,
and in both the retry attempt passed.

Fix

Exempt this benign panic from the unallowed-log-pattern check on the
SystemTestGroup:

.add_unallowed_log_pattern_except("panicked", "rs/replica/src/setup.rs")

The exclusion phrase must occur in the same journald MESSAGE line as
panicked; here that line is thread 'main' panicked at rs/replica/src/setup.rs:278:75:, so rs/replica/src/setup.rs is a valid
exclusion. This mirrors the existing precedent in
rs/tests/consensus/replica_determinism_test.rs.

Verification

cargo clippy on message-routing-system-tests: clean.
bazel build //rs/tests/message_routing:global_reboot_test_head_nns: OK.
bazel test --runs_per_test=3 --jobs=3 //rs/tests/message_routing:global_reboot_test_head_nns: 3/3 PASSED (avg 158s).

This PR was created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.

The test's own assertions (setup, test, assert_no_metrics_errors) pass; the flake came solely from the post-test `assert_no_unallowed_log_patterns` check matching the pattern `panicked` once in the IC node journald logs: thread 'main' panicked at rs/replica/src/setup.rs:278:75: "Crypto state directory /var/lib/ic/crypto has permissions 0o40755, allowing general access" Root cause is a reboot/shutdown race triggered by the test's `n.vm().reboot()` step: the orchestrator restarts the replica at the same instant the node is shutting down, while the dedicated `/var/lib/ic/crypto` volume (store-shared--crypto) is being unmounted. The path then resolves to the bare mount point with default `0o755` perms, so the replica's `CryptoConfig::check_dir_has_required_permissions` rejects the world-accessible bits and the `.unwrap()` in `setup_crypto_provider` panics. The node then completes the reboot and recovers, so this panic is benign. Exempt this benign panic from the unallowed-log-pattern check via `add_unallowed_log_pattern_except("panicked", "rs/replica/src/setup.rs")`, mirroring the existing precedent in rs/tests/consensus/replica_determinism_test.rs. Created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.

Copilot

Pull request overview

This PR deflakes the //rs/tests/message_routing:global_reboot_test_head_nns system test by preventing a known benign replica panic during node reboot teardown from failing the post-test unallowed-log-pattern check.

Changes:

Adds an add_unallowed_log_pattern_except("panicked", "rs/replica/src/setup.rs") exemption to ignore the specific reboot-related panic location.
Documents the reboot/shutdown race and why the resulting panic is benign in an inline comment near the test group configuration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mraszyk · 2026-06-26T11:11:48Z

+        // When this test reboots a node, the orchestrator can restart the replica at the
+        // same instant the node is shutting down, while the dedicated `/var/lib/ic/crypto`


This sounds generic enough to concern any system test: why can't this happen in any system test when the nodes are shutting down at the end of the test?

Yeah good point. We could add it by default. But that could hide panics in that module which we shouldn't ignore. Should we move that panic to a dedicated module so we can only ignore that specific panic?

Could we make the system test driver kill the orchestrator before it triggers VM shut-down? This should avoid any such issues by design.

github-actions Bot added the fix label Jun 26, 2026

basvandijk requested a review from Copilot June 26, 2026 10:14

Copilot started reviewing on behalf of basvandijk June 26, 2026 10:14 View session

Copilot AI reviewed Jun 26, 2026

View reviewed changes

basvandijk marked this pull request as ready for review June 26, 2026 10:17

basvandijk requested a review from a team as a code owner June 26, 2026 10:17

github-actions Bot added the @core-protocol label Jun 26, 2026

mraszyk reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: deflake //rs/tests/message_routing:global_reboot_test_head_nns#10575

fix: deflake //rs/tests/message_routing:global_reboot_test_head_nns#10575
basvandijk wants to merge 1 commit into
masterfrom
ai/deflake-global_reboot_test-2026-06-26

basvandijk commented Jun 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

mraszyk Jun 26, 2026

Uh oh!

basvandijk Jun 26, 2026

Uh oh!

mraszyk Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// When this test reboots a node, the orchestrator can restart the replica at the
		// same instant the node is shutting down, while the dedicated `/var/lib/ic/crypto`

Uh oh!

Conversation

basvandijk commented Jun 26, 2026

Summary

Root cause — a reboot/shutdown race

Fix

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

mraszyk Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

basvandijk Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

mraszyk Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants