fix(e2e): improve infra setup reliability with retries and tolerant GC#8488
Open
r2k1 wants to merge 1 commit into
Open
fix(e2e): improve infra setup reliability with retries and tolerant GC#8488r2k1 wants to merge 1 commit into
r2k1 wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR aims to improve E2E infrastructure setup reliability by adding retry/timeout patterns around Azure ARM operations and making cluster garbage-collection more tolerant so transient failures don’t cascade across all tests sharing a cluster.
Changes:
- Add retry with per-call timeouts for Bastion subnet GET and for AKS subnet + route table readiness checks.
- Make stale Kubernetes node GC tolerant to
NotFoundand non-fatal per-node delete errors. - Adjust VMSS GC bookkeeping so VMSS slated for deletion don’t shield their nodes from cleanup.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| e2e/cluster.go | Adds ARM GET retries/timeouts for Bastion subnet, changes VMSS “keep” set semantics, and makes stale node GC more tolerant. |
| e2e/aks_model.go | Adds retries/timeouts around AKS subnet/route table discovery and retries Azure Firewall creation. |
| return false, nil | ||
| }) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("get subnet %q in vnet %q rg %q: %w", bastionSubnetName, vnet.name, nodeRG, subnetGetErr) |
Comment on lines
+726
to
729
| // Build a set of VMSS names that should be kept — exclude VMSS that are | ||
| // being deleted so their stale K8s nodes can be cleaned up in the same pass. | ||
| existingVMSS := map[string]struct{}{} | ||
| pager := config.Azure.VMSS.NewListPager(rg, nil) |
Comment on lines
+753
to
+763
| continue | ||
| } | ||
| toolkit.Logf(ctx, "deleted vmss %q (age: %v)", *vmss.ID, time.Since(*vmss.Properties.TimeCreated)) | ||
| // Don't add to existingVMSS — nodes from this VMSS should be cleaned up |
Comment on lines
+319
to
+338
| // Retry the AKS subnet GET and route table lookup to tolerate: | ||
| // - transient ARM timeouts on the subnet GET | ||
| // - eventual consistency: kubenet route table may still be propagating after cluster create/reuse | ||
| var aksRTName string | ||
| err = wait.PollUntilContextTimeout(ctx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) { | ||
| callCtx, cancel := context.WithTimeout(ctx, 30*time.Second) | ||
| defer cancel() | ||
| aksSubnetResp, subnetErr := config.Azure.Subnet.Get(callCtx, rg, vnet.name, "aks-subnet", nil) | ||
| if subnetErr != nil { | ||
| toolkit.Logf(ctx, "transient error getting AKS subnet (retrying): %v", subnetErr) | ||
| return false, nil | ||
| } | ||
| rtName, rtErr := ensureFirewallRouteTable(ctx, clusterModel, vnet.name, aksSubnetResp.Subnet) | ||
| if rtErr != nil { | ||
| toolkit.Logf(ctx, "route table not ready (retrying): %v", rtErr) | ||
| return false, nil | ||
| } | ||
| aksRTName = rtName | ||
| return true, nil | ||
| }) |
Comment on lines
+406
to
+419
| var fwResp armnetwork.AzureFirewallsClientCreateOrUpdateResponse | ||
| err = wait.PollUntilContextTimeout(ctx, 10*time.Second, 10*time.Minute, true, func(ctx context.Context) (bool, error) { | ||
| fwPoller, fwErr := config.Azure.AzureFirewall.BeginCreateOrUpdate(ctx, rg, firewallName, *firewall, nil) | ||
| if fwErr != nil { | ||
| toolkit.Logf(ctx, "transient error starting firewall creation (retrying): %v", fwErr) | ||
| return false, nil | ||
| } | ||
| fwResp, fwErr = fwPoller.PollUntilDone(ctx, nil) | ||
| if fwErr != nil { | ||
| toolkit.Logf(ctx, "firewall creation did not complete (retrying): %v", fwErr) | ||
| return false, nil | ||
| } | ||
| return true, nil | ||
| }) |
8c798e4 to
d79fabe
Compare
d79fabe to
a8baa38
Compare
Address infrastructure failures that cause cascading test failures
(464 failures across 46 builds over 3 weeks).
1. Make stale node GC tolerant (cluster.go):
- Ignore NotFound errors on node deletion (already gone)
- Log other delete errors as warnings instead of failing
- Only return error if zero nodes could be deleted
Evidence: 79 failures from 'failed to delete 3 stale nodes'
2. Fix existingVMSS map (cluster.go):
- Only add VMSS to existingVMSS if they are being kept
- VMSS queued for deletion are excluded, so their stale K8s
nodes can be cleaned up in the same pass
- If VMSS deletion fails, keep in map to avoid orphaned deletes
3. Retry Bastion subnet GET (cluster.go):
- Poll with backoff for up to 30s on transient ARM errors
- 404 still handled normally (create subnet)
Evidence: 179 failures from 'get subnet AzureBastionSubnet:
context deadline exceeded'
4. Retry AKS subnet + route table lookup (aks_model.go):
- Poll with backoff for up to 2 minutes
- Handles both transient GET failures and kubenet route table
propagation delays after cluster create/reuse
Evidence: 39 failures from 'AKS subnet has no route table'
5. Retry Firewall creation (aks_model.go):
- Poll with backoff for up to 10 minutes
- BeginCreateOrUpdate is idempotent, safe to retry
Evidence: 90 failures from 'failed to create Firewall:
context deadline exceeded'
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
a8baa38 to
e19fe06
Compare
timmy-wright
approved these changes
May 11, 2026
Contributor
timmy-wright
left a comment
There was a problem hiding this comment.
LGTM - it'd be good to know which copilot comments you've responded to. Currently they are all unresolved, but it looks like you took some of them into account.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Infrastructure setup failures cause cascading test failures — one transient Azure error fails ALL tests sharing that cluster. This PR adds retries, tolerant cleanup, and handles cluster lifecycle states properly.
Evidence: 464 test failures across 46 builds over 3 weeks.
Changes
1. Tolerant stale node GC (
cluster.go)Previously any single failed K8s node delete killed cluster prep for all tests. The actual error was
nodes "..." not found— the node was already gone.Now:
NotFoundignored, other errors logged as warnings, only fails if zero progress.Evidence: 79 failures from "failed to delete 3 stale nodes, first error: ... not found"
2. Fix existingVMSS map (
cluster.go)VMSS queued for deletion were kept in the "existing" map, protecting their orphaned K8s nodes from cleanup. Now only kept VMSS go in the map.
3. Retry Bastion subnet GET with per-call timeout (
cluster.go)A single unresponsive
Subnet.Getconsumed the entire 20-minute cluster prep budget. Logs confirmed: "preparing cluster done (1200.0s)".Now: each GET has a 30-second timeout, retried every 5s for up to 2 minutes.
Evidence: 179 failures across 3 builds from "get subnet AzureBastionSubnet: context deadline exceeded"
4. Retry AKS subnet + route table lookup (
aks_model.go)Same per-call timeout pattern. Also handles kubenet route table propagation delays after cluster create/reuse.
Evidence: 39 failures across 6 builds from "AKS subnet has no route table associated"
5. Retry Firewall creation (
aks_model.go)BeginCreateOrUpdate+PollUntilDoneretried on failure. Idempotent — retrying resumes or restarts.Evidence: 90 failures across 2 builds from "failed to create Firewall: context deadline exceeded"
6. Handle Deleting/Canceled cluster states (
cluster.go)waitUntilClusterReadytreatedDeletingas an unknown state and failed immediately. A cluster stuck inDeletingfrom a previous run killed all tests.Now: waits for
Deleting/Canceled/Cancelingto finish. If cluster gets deleted (404), returns nil so a new cluster is created.Evidence: 74 failures across 2 builds from "cluster ... is in state Deleting, and wont retry"