Skip to content

fix: deflake //rs/tests/message_routing:global_reboot_test_head_nns#10575

Open
basvandijk wants to merge 1 commit into
masterfrom
ai/deflake-global_reboot_test-2026-06-26
Open

fix: deflake //rs/tests/message_routing:global_reboot_test_head_nns#10575
basvandijk wants to merge 1 commit into
masterfrom
ai/deflake-global_reboot_test-2026-06-26

Conversation

@basvandijk

Copy link
Copy Markdown
Collaborator

Summary

Deflakes //rs/tests/message_routing:global_reboot_test_head_nns.

The test's own assertions are not at fault. In the analyzed flaky runs the
setup, test and assert_no_metrics_errors tasks all passed (canisters made
progress before/after the reboot, no metric errors). The failure came solely
from the post-test assert_no_unallowed_log_patterns check, which matched
the pattern panicked once in the IC node journald logs:

thread 'main' panicked at rs/replica/src/setup.rs:278:75:
called `Result::unwrap()` on an `Err` value:
  "Crypto state directory /var/lib/ic/crypto has permissions 0o40755, allowing general access"

Root cause — a reboot/shutdown race

Step 4 of the test reboots every node via n.vm().reboot(). On the panicking
node the journald timeline shows:

  1. The node starts shutting down for the reboot and the running replica exits.
  2. The orchestrator's supervision loop immediately restarts the replica at
    the very same moment the orchestrator itself is being shut down.
  3. Concurrently the dedicated crypto volume store-shared--crypto (dm-3) is
    being unmounted from /var/lib/ic/crypto. The SELinux audit line at the
    instant of the panic shows getattr … path="/var/lib/ic/crypto" dev="dm-1",
    i.e. the path now resolves to the bare mount-point directory on the root
    fs (dm-1)
    , not the crypto volume.
  4. That mount-point has default 0o755 perms (world r-x), whereas the crypto
    volume is normally 0o700. The freshly-started replica's
    setup_crypto_provider calls
    CryptoConfig::check_dir_has_required_permissions (rs/config/src/crypto.rs),
    which rejects any world-accessible bits, and the .unwrap() at
    rs/replica/src/setup.rs:278 panics.
  5. The node then completes the reboot, remounts crypto with 0o700, the replica
    starts normally and the canisters resume — so the test's assertions pass.

The panic is therefore benign (a transient replica that dies during the
reboot teardown and recovers). The only reason the test fails is the log-pattern
check matching panicked.

Both flaky runs analyzed (2026-06-24 and 2026-06-25) had this identical cause,
and in both the retry attempt passed.

Fix

Exempt this benign panic from the unallowed-log-pattern check on the
SystemTestGroup:

.add_unallowed_log_pattern_except("panicked", "rs/replica/src/setup.rs")

The exclusion phrase must occur in the same journald MESSAGE line as
panicked; here that line is thread 'main' panicked at rs/replica/src/setup.rs:278:75:, so rs/replica/src/setup.rs is a valid
exclusion. This mirrors the existing precedent in
rs/tests/consensus/replica_determinism_test.rs.

Verification

  • cargo clippy on message-routing-system-tests: clean.
  • bazel build //rs/tests/message_routing:global_reboot_test_head_nns: OK.
  • bazel test --runs_per_test=3 --jobs=3 //rs/tests/message_routing:global_reboot_test_head_nns: 3/3 PASSED (avg 158s).

This PR was created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.

The test's own assertions (setup, test, assert_no_metrics_errors) pass; the
flake came solely from the post-test `assert_no_unallowed_log_patterns` check
matching the pattern `panicked` once in the IC node journald logs:

    thread 'main' panicked at rs/replica/src/setup.rs:278:75:
    "Crypto state directory /var/lib/ic/crypto has permissions 0o40755,
     allowing general access"

Root cause is a reboot/shutdown race triggered by the test's `n.vm().reboot()`
step: the orchestrator restarts the replica at the same instant the node is
shutting down, while the dedicated `/var/lib/ic/crypto` volume
(store-shared--crypto) is being unmounted. The path then resolves to the bare
mount point with default `0o755` perms, so the replica's
`CryptoConfig::check_dir_has_required_permissions` rejects the world-accessible
bits and the `.unwrap()` in `setup_crypto_provider` panics. The node then
completes the reboot and recovers, so this panic is benign.

Exempt this benign panic from the unallowed-log-pattern check via
`add_unallowed_log_pattern_except("panicked", "rs/replica/src/setup.rs")`,
mirroring the existing precedent in
rs/tests/consensus/replica_determinism_test.rs.

Created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR deflakes the //rs/tests/message_routing:global_reboot_test_head_nns system test by preventing a known benign replica panic during node reboot teardown from failing the post-test unallowed-log-pattern check.

Changes:

  • Adds an add_unallowed_log_pattern_except("panicked", "rs/replica/src/setup.rs") exemption to ignore the specific reboot-related panic location.
  • Documents the reboot/shutdown race and why the resulting panic is benign in an inline comment near the test group configuration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@basvandijk basvandijk marked this pull request as ready for review June 26, 2026 10:17
@basvandijk basvandijk requested a review from a team as a code owner June 26, 2026 10:17
Comment on lines +60 to +61
// When this test reboots a node, the orchestrator can restart the replica at the
// same instant the node is shutting down, while the dedicated `/var/lib/ic/crypto`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds generic enough to concern any system test: why can't this happen in any system test when the nodes are shutting down at the end of the test?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point. We could add it by default. But that could hide panics in that module which we shouldn't ignore. Should we move that panic to a dedicated module so we can only ignore that specific panic?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make the system test driver kill the orchestrator before it triggers VM shut-down? This should avoid any such issues by design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants