Skip to content

CNTR-1-3: fix gNOI cache staleness after cold reboot and supervisor failover#5509

Open
pjacakArista wants to merge 3 commits into
openconfig:mainfrom
pjacakArista:arista-CNTR-1-3-fix-gnoi-cache-failover
Open

CNTR-1-3: fix gNOI cache staleness after cold reboot and supervisor failover#5509
pjacakArista wants to merge 3 commits into
openconfig:mainfrom
pjacakArista:arista-CNTR-1-3-fix-gnoi-cache-failover

Conversation

@pjacakArista
Copy link
Copy Markdown

This PR depends on PR #5507 which fixes the shared containerztest infrastructure and CNTR-2. This PR adds the cold-reboot and supervisor failover fixes needed for CNTR-1.8 and all of CNTR-3 for Arista's EOS.

Root causes fixed:

  • After a cold reboot (CNTR-1.8) or supervisor switchover (CNTR-3), Ondatra's gNOI connection cache holds the pre-reboot client. Any gNOI RPC through that cached client returns Unavailable or hangs on a half-open TCP connection, causing every post-reboot container operation to fail. Fixed by calling BindingDUT().DialGNOI() after each reboot/switchover to get fresh clients, bypassing the stale cache entirely.
  • CNTR-1.8 (TestContainerPersistenceAfterColdReboot) additionally failed because the containerz config was not saved to startup-config before the reboot, so the containerz gNOI service was not re-enabled when the device came back up. Fixed by calling SaveConfig (write memory via gNMI CLI) before triggering the reboot.

New helpers in containerztest:

  • SaveConfig - saves running-config to startup-config via gNMI CLI, ensuring containerz service config survives a reboot.
  • ClientWithoutConfig - returns a containerz client from caller-supplied fresh gNOI clients without re-pushing config. Used after reboot/switchover when config is already saved to startup-config.
  • WaitForReboot - polls DialGNMI with a short per-call timeout to detect the reboot down/up cycle and returns fresh gNOI clients via BindingDUT().DialGNOI() once the device recovers.

Deviations: Added three timeout deviations so Arista hardware, that needs longer Docker and Octa settle times after a switchover, can be tuned without hardcoding: containerz_post_switchover_verify_timeout_s, containerz_switchover_ready_timeout_s, containerz_switchover_retry_timeout_s.

With those changes CNTR-1.8 ( TestContainerPersistenceAfterColdReboot ) and all of CNTR-3 passes on Arista's EOS. Combined with PR #5507, all of CNTR-1, CNTR-2, and CNTR-3 pass on Arista's EOS.

The three CNTR tests share containerztest.Client(), which fails on Arista because the binding rejects dut.Config().New().WithAristaText(...).Append(t) (PushConfig is unimplemented). As a result the containerz gNOI service is never configured and the management namespace firewall blocks container port 60061. CNTR-2 additionally lacks the containerz_oc_unsupported deviation, so the gNOI service enablement block is never entered. A credential conflict in binding.DialGRPCWithPort (per-RPC credentials combined with a caller-supplied insecure transport) prevents CNTR-2 from reaching the container even when the port is open.

Replace the Append call with helpers.GnmiCLIConfig for reliable CLI-origin gNMI config push. Add gnmiSaveWithRetry and waitForGNOI polling to handle Octa restarts triggered by enabling the containerz gNOI service. Insert a direct ip6tables ACCEPT rule into EOS's ns-mgmt namespace as a reliable fallback for port 60061; the system control-plane ACL mechanism does not create kernel firewall rules on all EOS versions. Re-dial gNOI after config push to bypass Ondatra's stale connection cache. Fix Setup() teardown to call RemoveContainer so stopped containers do not contaminate subsequent tests via Docker's restart policy. Improve WaitForRunning with per-call timeouts and stream error retry instead of a fatal on the first error.

Fix DialGRPCWithPort to strip per-RPC credentials and default to insecure transport when the device config specifies none; this resolves the gRPC errCredentialsConflict when the caller supplies its own transport credentials.

In cntr_test.go, pass TLS skip-verify credentials in dialContainer for Arista (cntrsrv serves a self-signed certificate). Guard AddNH/AddNHG with the new GribiDecapInDefaultNiUnsupported deviation for platforms that cannot FIB-program a Decap NH in the DEFAULT network instance. Add AwaitSwitchoverReady to internal/components for reuse across CNTR tests. Add containerz_oc_unsupported and gribi_decap_in_default_ni_unsupported to CNTR-2 metadata. Add the GribiDecapInDefaultNiUnsupported deviation accessor and proto field 428.

With those changes CNTR-1.1-1.7 ( TestDeployAndStart, TestRetrieveLogs, TestListContainers, TestStopContainer, TestVolumes, TestUpgrade, TestPlugins ) and a whole CNTR-2 passes on Arista's EOS.
Three issues were flagged in review of the initial CNTR infrastructure fixes:
- the `Setup()` cleanup closure captured the caller's test context, which may already be canceled when the cleanup runs (e.g. on test timeout), causing `Stop` and `RemoveContainer` to fail immediately and leave the container running on the DUT
- the gNMI Set call in `gnmiSaveWithRetry` used `context.Background()` with no timeout, which can hang indefinitely if gNMI is unresponsive during an Octa restart
- the `dialContainer` function in `cntr_test.go` used `dut.Vendor()` directly to select TLS credentials, which bypasses the deviation abstraction that featureprofiles tests are expected to use

Fixed all 3:
- give the `Setup` cleanup closure a fresh `context.WithTimeout` of one minute so `Stop` and `RemoveContainer` always execute regardless of the caller's context state.
- add a 10-second per-call timeout around the gNMI Set in `gnmiSaveWithRetry` so the retry loop can make progress even when the server is temporarily unresponsive.
- replace the `dut.Vendor()` switch in `dialContainer` with a new `ContainerzTLSInsecureSkipVerify` deviation and set it true for Arista in the CNTR-2 metadata.
After a cold reboot (CNTR-1.8) or supervisor switchover (CNTR-3), Ondatra's gNOI connection cache holds the pre-reboot client. Any gNOI RPC through that cached client returns Unavailable or hangs on a half-open TCP connection, causing every post-reboot container operation to fail.
The CNTR-1 cold-reboot subtest also failed because the containerz config was not saved to startup-config before the reboot, so the containerz service was not enabled when the device came back up.

Add `SaveConfig` (write memory via gNMI CLI) and `ClientWithoutConfig` (returns a containerz client from caller-supplied fresh gNOI clients without re-pushing config) to containerztest.
Add `WaitForReboot`, which polls `DialGNMI` with a short per-call timeout to detect the reboot down/up cycle and returns fresh gNOI clients via `BindingDUT().DialGNOI()` once the device recovers.
In CNTR-1.8 (TestContainerPersistenceAfterColdReboot), call `SaveConfig` before the reboot and `ClientWithoutConfig` with the fresh clients after.
In CNTR-3, use `BindingDUT().DialGNOI()` after each switchover to bypass the stale cache for all subsequent gNOI calls.
Add three timeout deviations so Arista hardware, which needs longer Docker and Octa settle times after a switchover, can be tuned without hardcoding.

With those changes CNTR-1.8 ( TestContainerPersistenceAfterColdReboot ) and all of CNTR-3 passes on Arista's EOS. Combined with PR openconfig#5507, all of CNTR-1, CNTR-2, and CNTR-3 pass on Arista's EOS.
@pjacakArista pjacakArista requested a review from a team as a code owner May 26, 2026 13:27
@OpenConfigBot
Copy link
Copy Markdown

Pull Request Functional Test Report for #5509 / 64f23e3

Virtual Devices

Device Test Test Documentation Job Raw Log
Arista cEOS status
status
status
CNTR-2: Container network connectivity tests
CNTR-1: Basic container lifecycle via gnoi.Containerz.
CNTR-3: Container Supervisor Failover
Cisco 8000E status
status
status
CNTR-2: Container network connectivity tests
CNTR-1: Basic container lifecycle via gnoi.Containerz.
CNTR-3: Container Supervisor Failover
Cisco XRd status
status
status
CNTR-2: Container network connectivity tests
CNTR-1: Basic container lifecycle via gnoi.Containerz.
CNTR-3: Container Supervisor Failover
Juniper ncPTX status
status
status
CNTR-2: Container network connectivity tests
CNTR-1: Basic container lifecycle via gnoi.Containerz.
CNTR-3: Container Supervisor Failover
Nokia SR Linux status
status
status
CNTR-2: Container network connectivity tests
CNTR-1: Basic container lifecycle via gnoi.Containerz.
CNTR-3: Container Supervisor Failover
Openconfig Lemming status
status
status
CNTR-2: Container network connectivity tests
CNTR-1: Basic container lifecycle via gnoi.Containerz.
CNTR-3: Container Supervisor Failover

Hardware Devices

Device Test Test Documentation Raw Log
Arista 7808 status
status
status
CNTR-2: Container network connectivity tests
CNTR-1: Basic container lifecycle via gnoi.Containerz.
CNTR-3: Container Supervisor Failover
Cisco 8808 status
status
status
CNTR-2: Container network connectivity tests
CNTR-1: Basic container lifecycle via gnoi.Containerz.
CNTR-3: Container Supervisor Failover
Juniper PTX10008 status
status
status
CNTR-2: Container network connectivity tests
CNTR-1: Basic container lifecycle via gnoi.Containerz.
CNTR-3: Container Supervisor Failover
Nokia 7250 IXR-10e status
status
status
CNTR-2: Container network connectivity tests
CNTR-1: Basic container lifecycle via gnoi.Containerz.
CNTR-3: Container Supervisor Failover

Help

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

Gemini encountered an error creating the summary. You can try again by commenting /gemini summary.

@pjacakArista
Copy link
Copy Markdown
Author

/gemini summary

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

Gemini encountered an error creating the summary. You can try again by commenting /gemini summary.

@pjacakArista
Copy link
Copy Markdown
Author

The main root cause is a stale connection cache in Ondatra - filed as openconfig/ondatra#145.
If that gets fixed, the WaitForReboot/ClientWithoutConfig plumbing in this PR can be simplified or removed.

@pjacakArista
Copy link
Copy Markdown
Author

/gemini summary

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

This pull request addresses issues with gNOI connection staleness and configuration persistence in container lifecycle tests following cold reboots and supervisor failovers. By ensuring fresh gNOI clients are dialed after major state changes and explicitly saving configurations to startup-config, the changes improve the reliability of container persistence tests on Arista EOS. Additionally, the PR introduces flexible timeout deviations and improved cleanup logic to prevent cross-test contamination.

Highlights

  • gNOI Connection Cache Fix: Implemented a mechanism to bypass Ondatra's stale gNOI connection cache after cold reboots and supervisor failovers by calling DialGNOI to retrieve fresh clients.
  • Configuration Persistence: Added a SaveConfig helper to save running-config to startup-config via gNMI CLI, ensuring containerz service configurations survive device reboots.
  • Test Stability Improvements: Introduced configurable timeout deviations for Arista hardware to handle longer Docker and Octa settle times after switchovers, and added a cleanup retry mechanism to prevent test contamination.
  • New Test Helpers: Added ClientWithoutConfig and WaitForReboot helpers to facilitate robust container management in post-reboot/failover scenarios.
Activity

@pjacakArista
Copy link
Copy Markdown
Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several new deviations and updates container lifecycle, supervisor failover, and container connectivity tests to improve robustness and support vendor-specific behaviors (particularly for Arista). Key changes include adding retry logic for configuration saving, implementing robust polling for reboots and switchovers, and updating DialGRPCWithPort to strip per-RPC credentials for generic container services. The review feedback highlights opportunities to improve test portability by abstracting vendor-specific checks (dut.Vendor()) into the deviations package, and recommends using contexts with timeouts for gRPC Dial operations to prevent potential hangs.

Comment on lines +942 to +948
if dut.Vendor() == ondatra.ARISTA {
// Skip config push -- config was already saved during Setup. Re-pushing
// races EOS warmup after a fast reboot ("system not yet initialized").
cli = containerztest.ClientWithoutConfig(t, dut, postRebootClients)
} else {
cli = containerztest.Client(t, dut)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid using dut.Vendor() directly in tests for vendor-specific logic. According to the repository's general rules, vendor-specific behavior should be abstracted using the deviations package to maintain test portability across different platforms. Consider defining a new deviation (e.g., ContainerzSkipConfigPushOnCleanup) and using it here.

Suggested change
if dut.Vendor() == ondatra.ARISTA {
// Skip config push -- config was already saved during Setup. Re-pushing
// races EOS warmup after a fast reboot ("system not yet initialized").
cli = containerztest.ClientWithoutConfig(t, dut, postRebootClients)
} else {
cli = containerztest.Client(t, dut)
}
if deviations.ContainerzSkipConfigPushOnCleanup(dut) {
// Skip config push -- config was already saved during Setup. Re-pushing
// races EOS warmup after a fast reboot ("system not yet initialized").
cli = containerztest.ClientWithoutConfig(t, dut, postRebootClients)
} else {
cli = containerztest.Client(t, dut)
}
References
  1. Avoid using "dut.Vendor()" for vendor-specific logic or configurations in tests. Instead, use the "deviations" package to maintain test abstraction and portability across different vendors.

Comment on lines +602 to +606
freshClients, err := dut.RawAPIs().BindingDUT().DialGNOI(context.Background())
if err != nil {
t.Logf("gNOI re-dial after switchover failed (non-fatal): %v", err)
}
return freshClients
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using context.Background() without a timeout for gRPC Dial operations can cause the test to hang indefinitely if the connection is half-open or the device is unresponsive. It is highly recommended to use a context with a reasonable timeout.

                    dialCtx, dialCancel := context.WithTimeout(context.Background(), 30*time.Second)
                    freshClients, err := dut.RawAPIs().BindingDUT().DialGNOI(dialCtx)
                    dialCancel()
                    if err != nil {
                        t.Logf("gNOI re-dial after switchover failed (non-fatal): %v", err)
                    }
                    return freshClients

Comment on lines +675 to +687
if dut.Vendor() == ondatra.ARISTA {
// Fresh dial on every attempt: after a prior switchover the Ondatra gNOI
// cache may point to the old active (now standby). A fresh dial always
// reaches the current active.
if freshClients, err := dut.RawAPIs().BindingDUT().DialGNOI(context.Background()); err == nil {
sysClient = freshClients.System()
} else {
t.Logf("gNOI fresh dial failed, falling back to cached client: %v", err)
sysClient = dut.RawAPIs().GNOI(t).System()
}
} else {
sysClient = dut.RawAPIs().GNOI(t).System()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid using dut.Vendor() directly in tests for vendor-specific logic. According to the repository's general rules, vendor-specific behavior should be abstracted using the deviations package to maintain test portability across different platforms. Consider defining a new deviation (e.g., ContainerzFreshDialOnSwitchover) and using it here. Additionally, this ensures we use a context with a timeout for the gRPC Dial operation to prevent indefinite hangs.

        if deviations.ContainerzFreshDialOnSwitchover(dut) {
            // Fresh dial on every attempt: after a prior switchover the Ondatra gNOI
            // cache may point to the old active (now standby). A fresh dial always
            // reaches the current active.
            dialCtx, dialCancel := context.WithTimeout(context.Background(), 30*time.Second)
            freshClients, err := dut.RawAPIs().BindingDUT().DialGNOI(dialCtx)
            dialCancel()
            if err == nil {
                sysClient = freshClients.System()
            } else {
                t.Logf("gNOI fresh dial failed, falling back to cached client: %v", err)
                sysClient = dut.RawAPIs().GNOI(t).System()
            }
        } else {
            sysClient = dut.RawAPIs().GNOI(t).System()
        }
References
  1. Avoid using "dut.Vendor()" for vendor-specific logic or configurations in tests. Instead, use the "deviations" package to maintain test abstraction and portability across different vendors.

Comment on lines +103 to +107
freshClients, err := dut.RawAPIs().BindingDUT().DialGNOI(context.Background())
if err != nil {
t.Logf("gNOI re-dial in Client() failed (non-fatal): %v", err)
freshClients = dut.RawAPIs().GNOI(t)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using context.Background() without a timeout for gRPC Dial operations can cause the test to hang indefinitely. It is highly recommended to use a context with a reasonable timeout.

    dialCtx, dialCancel := context.WithTimeout(context.Background(), 30*time.Second)
    freshClients, err := dut.RawAPIs().BindingDUT().DialGNOI(dialCtx)
    dialCancel()
    if err != nil {
        t.Logf("gNOI re-dial in Client() failed (non-fatal): %v", err)
        freshClients = dut.RawAPIs().GNOI(t)
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants