CNTR-1-3: fix gNOI cache staleness after cold reboot and supervisor failover#5509
CNTR-1-3: fix gNOI cache staleness after cold reboot and supervisor failover#5509pjacakArista wants to merge 3 commits into
Conversation
The three CNTR tests share containerztest.Client(), which fails on Arista because the binding rejects dut.Config().New().WithAristaText(...).Append(t) (PushConfig is unimplemented). As a result the containerz gNOI service is never configured and the management namespace firewall blocks container port 60061. CNTR-2 additionally lacks the containerz_oc_unsupported deviation, so the gNOI service enablement block is never entered. A credential conflict in binding.DialGRPCWithPort (per-RPC credentials combined with a caller-supplied insecure transport) prevents CNTR-2 from reaching the container even when the port is open. Replace the Append call with helpers.GnmiCLIConfig for reliable CLI-origin gNMI config push. Add gnmiSaveWithRetry and waitForGNOI polling to handle Octa restarts triggered by enabling the containerz gNOI service. Insert a direct ip6tables ACCEPT rule into EOS's ns-mgmt namespace as a reliable fallback for port 60061; the system control-plane ACL mechanism does not create kernel firewall rules on all EOS versions. Re-dial gNOI after config push to bypass Ondatra's stale connection cache. Fix Setup() teardown to call RemoveContainer so stopped containers do not contaminate subsequent tests via Docker's restart policy. Improve WaitForRunning with per-call timeouts and stream error retry instead of a fatal on the first error. Fix DialGRPCWithPort to strip per-RPC credentials and default to insecure transport when the device config specifies none; this resolves the gRPC errCredentialsConflict when the caller supplies its own transport credentials. In cntr_test.go, pass TLS skip-verify credentials in dialContainer for Arista (cntrsrv serves a self-signed certificate). Guard AddNH/AddNHG with the new GribiDecapInDefaultNiUnsupported deviation for platforms that cannot FIB-program a Decap NH in the DEFAULT network instance. Add AwaitSwitchoverReady to internal/components for reuse across CNTR tests. Add containerz_oc_unsupported and gribi_decap_in_default_ni_unsupported to CNTR-2 metadata. Add the GribiDecapInDefaultNiUnsupported deviation accessor and proto field 428. With those changes CNTR-1.1-1.7 ( TestDeployAndStart, TestRetrieveLogs, TestListContainers, TestStopContainer, TestVolumes, TestUpgrade, TestPlugins ) and a whole CNTR-2 passes on Arista's EOS.
Three issues were flagged in review of the initial CNTR infrastructure fixes: - the `Setup()` cleanup closure captured the caller's test context, which may already be canceled when the cleanup runs (e.g. on test timeout), causing `Stop` and `RemoveContainer` to fail immediately and leave the container running on the DUT - the gNMI Set call in `gnmiSaveWithRetry` used `context.Background()` with no timeout, which can hang indefinitely if gNMI is unresponsive during an Octa restart - the `dialContainer` function in `cntr_test.go` used `dut.Vendor()` directly to select TLS credentials, which bypasses the deviation abstraction that featureprofiles tests are expected to use Fixed all 3: - give the `Setup` cleanup closure a fresh `context.WithTimeout` of one minute so `Stop` and `RemoveContainer` always execute regardless of the caller's context state. - add a 10-second per-call timeout around the gNMI Set in `gnmiSaveWithRetry` so the retry loop can make progress even when the server is temporarily unresponsive. - replace the `dut.Vendor()` switch in `dialContainer` with a new `ContainerzTLSInsecureSkipVerify` deviation and set it true for Arista in the CNTR-2 metadata.
After a cold reboot (CNTR-1.8) or supervisor switchover (CNTR-3), Ondatra's gNOI connection cache holds the pre-reboot client. Any gNOI RPC through that cached client returns Unavailable or hangs on a half-open TCP connection, causing every post-reboot container operation to fail. The CNTR-1 cold-reboot subtest also failed because the containerz config was not saved to startup-config before the reboot, so the containerz service was not enabled when the device came back up. Add `SaveConfig` (write memory via gNMI CLI) and `ClientWithoutConfig` (returns a containerz client from caller-supplied fresh gNOI clients without re-pushing config) to containerztest. Add `WaitForReboot`, which polls `DialGNMI` with a short per-call timeout to detect the reboot down/up cycle and returns fresh gNOI clients via `BindingDUT().DialGNOI()` once the device recovers. In CNTR-1.8 (TestContainerPersistenceAfterColdReboot), call `SaveConfig` before the reboot and `ClientWithoutConfig` with the fresh clients after. In CNTR-3, use `BindingDUT().DialGNOI()` after each switchover to bypass the stale cache for all subsequent gNOI calls. Add three timeout deviations so Arista hardware, which needs longer Docker and Octa settle times after a switchover, can be tuned without hardcoding. With those changes CNTR-1.8 ( TestContainerPersistenceAfterColdReboot ) and all of CNTR-3 passes on Arista's EOS. Combined with PR openconfig#5507, all of CNTR-1, CNTR-2, and CNTR-3 pass on Arista's EOS.
|
Warning Gemini encountered an error creating the summary. You can try again by commenting |
|
/gemini summary |
|
Warning Gemini encountered an error creating the summary. You can try again by commenting |
|
The main root cause is a stale connection cache in Ondatra - filed as openconfig/ondatra#145. |
|
/gemini summary |
Summary of ChangesThis pull request addresses issues with gNOI connection staleness and configuration persistence in container lifecycle tests following cold reboots and supervisor failovers. By ensuring fresh gNOI clients are dialed after major state changes and explicitly saving configurations to startup-config, the changes improve the reliability of container persistence tests on Arista EOS. Additionally, the PR introduces flexible timeout deviations and improved cleanup logic to prevent cross-test contamination. Highlights
Activity
|
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces several new deviations and updates container lifecycle, supervisor failover, and container connectivity tests to improve robustness and support vendor-specific behaviors (particularly for Arista). Key changes include adding retry logic for configuration saving, implementing robust polling for reboots and switchovers, and updating DialGRPCWithPort to strip per-RPC credentials for generic container services. The review feedback highlights opportunities to improve test portability by abstracting vendor-specific checks (dut.Vendor()) into the deviations package, and recommends using contexts with timeouts for gRPC Dial operations to prevent potential hangs.
| if dut.Vendor() == ondatra.ARISTA { | ||
| // Skip config push -- config was already saved during Setup. Re-pushing | ||
| // races EOS warmup after a fast reboot ("system not yet initialized"). | ||
| cli = containerztest.ClientWithoutConfig(t, dut, postRebootClients) | ||
| } else { | ||
| cli = containerztest.Client(t, dut) | ||
| } |
There was a problem hiding this comment.
Avoid using dut.Vendor() directly in tests for vendor-specific logic. According to the repository's general rules, vendor-specific behavior should be abstracted using the deviations package to maintain test portability across different platforms. Consider defining a new deviation (e.g., ContainerzSkipConfigPushOnCleanup) and using it here.
| if dut.Vendor() == ondatra.ARISTA { | |
| // Skip config push -- config was already saved during Setup. Re-pushing | |
| // races EOS warmup after a fast reboot ("system not yet initialized"). | |
| cli = containerztest.ClientWithoutConfig(t, dut, postRebootClients) | |
| } else { | |
| cli = containerztest.Client(t, dut) | |
| } | |
| if deviations.ContainerzSkipConfigPushOnCleanup(dut) { | |
| // Skip config push -- config was already saved during Setup. Re-pushing | |
| // races EOS warmup after a fast reboot ("system not yet initialized"). | |
| cli = containerztest.ClientWithoutConfig(t, dut, postRebootClients) | |
| } else { | |
| cli = containerztest.Client(t, dut) | |
| } |
References
- Avoid using "dut.Vendor()" for vendor-specific logic or configurations in tests. Instead, use the "deviations" package to maintain test abstraction and portability across different vendors.
| freshClients, err := dut.RawAPIs().BindingDUT().DialGNOI(context.Background()) | ||
| if err != nil { | ||
| t.Logf("gNOI re-dial after switchover failed (non-fatal): %v", err) | ||
| } | ||
| return freshClients |
There was a problem hiding this comment.
Using context.Background() without a timeout for gRPC Dial operations can cause the test to hang indefinitely if the connection is half-open or the device is unresponsive. It is highly recommended to use a context with a reasonable timeout.
dialCtx, dialCancel := context.WithTimeout(context.Background(), 30*time.Second)
freshClients, err := dut.RawAPIs().BindingDUT().DialGNOI(dialCtx)
dialCancel()
if err != nil {
t.Logf("gNOI re-dial after switchover failed (non-fatal): %v", err)
}
return freshClients| if dut.Vendor() == ondatra.ARISTA { | ||
| // Fresh dial on every attempt: after a prior switchover the Ondatra gNOI | ||
| // cache may point to the old active (now standby). A fresh dial always | ||
| // reaches the current active. | ||
| if freshClients, err := dut.RawAPIs().BindingDUT().DialGNOI(context.Background()); err == nil { | ||
| sysClient = freshClients.System() | ||
| } else { | ||
| t.Logf("gNOI fresh dial failed, falling back to cached client: %v", err) | ||
| sysClient = dut.RawAPIs().GNOI(t).System() | ||
| } | ||
| } else { | ||
| sysClient = dut.RawAPIs().GNOI(t).System() | ||
| } |
There was a problem hiding this comment.
Avoid using dut.Vendor() directly in tests for vendor-specific logic. According to the repository's general rules, vendor-specific behavior should be abstracted using the deviations package to maintain test portability across different platforms. Consider defining a new deviation (e.g., ContainerzFreshDialOnSwitchover) and using it here. Additionally, this ensures we use a context with a timeout for the gRPC Dial operation to prevent indefinite hangs.
if deviations.ContainerzFreshDialOnSwitchover(dut) {
// Fresh dial on every attempt: after a prior switchover the Ondatra gNOI
// cache may point to the old active (now standby). A fresh dial always
// reaches the current active.
dialCtx, dialCancel := context.WithTimeout(context.Background(), 30*time.Second)
freshClients, err := dut.RawAPIs().BindingDUT().DialGNOI(dialCtx)
dialCancel()
if err == nil {
sysClient = freshClients.System()
} else {
t.Logf("gNOI fresh dial failed, falling back to cached client: %v", err)
sysClient = dut.RawAPIs().GNOI(t).System()
}
} else {
sysClient = dut.RawAPIs().GNOI(t).System()
}References
- Avoid using "dut.Vendor()" for vendor-specific logic or configurations in tests. Instead, use the "deviations" package to maintain test abstraction and portability across different vendors.
| freshClients, err := dut.RawAPIs().BindingDUT().DialGNOI(context.Background()) | ||
| if err != nil { | ||
| t.Logf("gNOI re-dial in Client() failed (non-fatal): %v", err) | ||
| freshClients = dut.RawAPIs().GNOI(t) | ||
| } |
There was a problem hiding this comment.
Using context.Background() without a timeout for gRPC Dial operations can cause the test to hang indefinitely. It is highly recommended to use a context with a reasonable timeout.
dialCtx, dialCancel := context.WithTimeout(context.Background(), 30*time.Second)
freshClients, err := dut.RawAPIs().BindingDUT().DialGNOI(dialCtx)
dialCancel()
if err != nil {
t.Logf("gNOI re-dial in Client() failed (non-fatal): %v", err)
freshClients = dut.RawAPIs().GNOI(t)
}
This PR depends on PR #5507 which fixes the shared
containerztestinfrastructure and CNTR-2. This PR adds the cold-reboot and supervisor failover fixes needed for CNTR-1.8 and all of CNTR-3 for Arista's EOS.Root causes fixed:
Unavailableor hangs on a half-open TCP connection, causing every post-reboot container operation to fail. Fixed by callingBindingDUT().DialGNOI()after each reboot/switchover to get fresh clients, bypassing the stale cache entirely.TestContainerPersistenceAfterColdReboot) additionally failed because thecontainerzconfig was not saved to startup-config before the reboot, so the containerz gNOI service was not re-enabled when the device came back up. Fixed by callingSaveConfig(write memory via gNMI CLI) before triggering the reboot.New helpers in
containerztest:SaveConfig- saves running-config to startup-config via gNMI CLI, ensuring containerz service config survives a reboot.ClientWithoutConfig- returns a containerz client from caller-supplied fresh gNOI clients without re-pushing config. Used after reboot/switchover when config is already saved to startup-config.WaitForReboot- pollsDialGNMIwith a short per-call timeout to detect the reboot down/up cycle and returns fresh gNOI clients viaBindingDUT().DialGNOI()once the device recovers.Deviations: Added three timeout deviations so Arista hardware, that needs longer Docker and Octa settle times after a switchover, can be tuned without hardcoding:
containerz_post_switchover_verify_timeout_s,containerz_switchover_ready_timeout_s,containerz_switchover_retry_timeout_s.With those changes CNTR-1.8 ( TestContainerPersistenceAfterColdReboot ) and all of CNTR-3 passes on Arista's EOS. Combined with PR #5507, all of CNTR-1, CNTR-2, and CNTR-3 pass on Arista's EOS.