Skip to content

fix(e2e): improve infra setup reliability with retries and tolerant GC#8488

Open
r2k1 wants to merge 1 commit into
mainfrom
e2e-infra-reliability
Open

fix(e2e): improve infra setup reliability with retries and tolerant GC#8488
r2k1 wants to merge 1 commit into
mainfrom
e2e-infra-reliability

Conversation

@r2k1
Copy link
Copy Markdown
Contributor

@r2k1 r2k1 commented May 10, 2026

Summary

Infrastructure setup failures cause cascading test failures — one transient Azure error fails ALL tests sharing that cluster. This PR adds retries, tolerant cleanup, and handles cluster lifecycle states properly.

Evidence: 464 test failures across 46 builds over 3 weeks.

Changes

1. Tolerant stale node GC (cluster.go)

Previously any single failed K8s node delete killed cluster prep for all tests. The actual error was nodes "..." not found — the node was already gone.

Now: NotFound ignored, other errors logged as warnings, only fails if zero progress.

Evidence: 79 failures from "failed to delete 3 stale nodes, first error: ... not found"

2. Fix existingVMSS map (cluster.go)

VMSS queued for deletion were kept in the "existing" map, protecting their orphaned K8s nodes from cleanup. Now only kept VMSS go in the map.

3. Retry Bastion subnet GET with per-call timeout (cluster.go)

A single unresponsive Subnet.Get consumed the entire 20-minute cluster prep budget. Logs confirmed: "preparing cluster done (1200.0s)".

Now: each GET has a 30-second timeout, retried every 5s for up to 2 minutes.

Evidence: 179 failures across 3 builds from "get subnet AzureBastionSubnet: context deadline exceeded"

4. Retry AKS subnet + route table lookup (aks_model.go)

Same per-call timeout pattern. Also handles kubenet route table propagation delays after cluster create/reuse.

Evidence: 39 failures across 6 builds from "AKS subnet has no route table associated"

5. Retry Firewall creation (aks_model.go)

BeginCreateOrUpdate + PollUntilDone retried on failure. Idempotent — retrying resumes or restarts.

Evidence: 90 failures across 2 builds from "failed to create Firewall: context deadline exceeded"

6. Handle Deleting/Canceled cluster states (cluster.go)

waitUntilClusterReady treated Deleting as an unknown state and failed immediately. A cluster stuck in Deleting from a previous run killed all tests.

Now: waits for Deleting/Canceled/Canceling to finish. If cluster gets deleted (404), returns nil so a new cluster is created.

Evidence: 74 failures across 2 builds from "cluster ... is in state Deleting, and wont retry"

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to improve E2E infrastructure setup reliability by adding retry/timeout patterns around Azure ARM operations and making cluster garbage-collection more tolerant so transient failures don’t cascade across all tests sharing a cluster.

Changes:

  • Add retry with per-call timeouts for Bastion subnet GET and for AKS subnet + route table readiness checks.
  • Make stale Kubernetes node GC tolerant to NotFound and non-fatal per-node delete errors.
  • Adjust VMSS GC bookkeeping so VMSS slated for deletion don’t shield their nodes from cleanup.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
e2e/cluster.go Adds ARM GET retries/timeouts for Bastion subnet, changes VMSS “keep” set semantics, and makes stale node GC more tolerant.
e2e/aks_model.go Adds retries/timeouts around AKS subnet/route table discovery and retries Azure Firewall creation.

Comment thread e2e/cluster.go Outdated
return false, nil
})
if err != nil {
return nil, fmt.Errorf("get subnet %q in vnet %q rg %q: %w", bastionSubnetName, vnet.name, nodeRG, subnetGetErr)
Comment thread e2e/cluster.go
Comment on lines +726 to 729
// Build a set of VMSS names that should be kept — exclude VMSS that are
// being deleted so their stale K8s nodes can be cleaned up in the same pass.
existingVMSS := map[string]struct{}{}
pager := config.Azure.VMSS.NewListPager(rg, nil)
Comment thread e2e/cluster.go Outdated
Comment on lines +753 to +763
continue
}
toolkit.Logf(ctx, "deleted vmss %q (age: %v)", *vmss.ID, time.Since(*vmss.Properties.TimeCreated))
// Don't add to existingVMSS — nodes from this VMSS should be cleaned up
Comment thread e2e/aks_model.go
Comment on lines +319 to +338
// Retry the AKS subnet GET and route table lookup to tolerate:
// - transient ARM timeouts on the subnet GET
// - eventual consistency: kubenet route table may still be propagating after cluster create/reuse
var aksRTName string
err = wait.PollUntilContextTimeout(ctx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) {
callCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
defer cancel()
aksSubnetResp, subnetErr := config.Azure.Subnet.Get(callCtx, rg, vnet.name, "aks-subnet", nil)
if subnetErr != nil {
toolkit.Logf(ctx, "transient error getting AKS subnet (retrying): %v", subnetErr)
return false, nil
}
rtName, rtErr := ensureFirewallRouteTable(ctx, clusterModel, vnet.name, aksSubnetResp.Subnet)
if rtErr != nil {
toolkit.Logf(ctx, "route table not ready (retrying): %v", rtErr)
return false, nil
}
aksRTName = rtName
return true, nil
})
Comment thread e2e/aks_model.go
Comment on lines +406 to +419
var fwResp armnetwork.AzureFirewallsClientCreateOrUpdateResponse
err = wait.PollUntilContextTimeout(ctx, 10*time.Second, 10*time.Minute, true, func(ctx context.Context) (bool, error) {
fwPoller, fwErr := config.Azure.AzureFirewall.BeginCreateOrUpdate(ctx, rg, firewallName, *firewall, nil)
if fwErr != nil {
toolkit.Logf(ctx, "transient error starting firewall creation (retrying): %v", fwErr)
return false, nil
}
fwResp, fwErr = fwPoller.PollUntilDone(ctx, nil)
if fwErr != nil {
toolkit.Logf(ctx, "firewall creation did not complete (retrying): %v", fwErr)
return false, nil
}
return true, nil
})
@r2k1 r2k1 force-pushed the e2e-infra-reliability branch from 8c798e4 to d79fabe Compare May 10, 2026 23:34
Copilot AI review requested due to automatic review settings May 10, 2026 23:36
@r2k1 r2k1 force-pushed the e2e-infra-reliability branch from d79fabe to a8baa38 Compare May 10, 2026 23:36
r2k1

This comment was marked as low quality.

This comment was marked as outdated.

Address infrastructure failures that cause cascading test failures
(464 failures across 46 builds over 3 weeks).

1. Make stale node GC tolerant (cluster.go):
   - Ignore NotFound errors on node deletion (already gone)
   - Log other delete errors as warnings instead of failing
   - Only return error if zero nodes could be deleted
   Evidence: 79 failures from 'failed to delete 3 stale nodes'

2. Fix existingVMSS map (cluster.go):
   - Only add VMSS to existingVMSS if they are being kept
   - VMSS queued for deletion are excluded, so their stale K8s
     nodes can be cleaned up in the same pass
   - If VMSS deletion fails, keep in map to avoid orphaned deletes

3. Retry Bastion subnet GET (cluster.go):
   - Poll with backoff for up to 30s on transient ARM errors
   - 404 still handled normally (create subnet)
   Evidence: 179 failures from 'get subnet AzureBastionSubnet:
   context deadline exceeded'

4. Retry AKS subnet + route table lookup (aks_model.go):
   - Poll with backoff for up to 2 minutes
   - Handles both transient GET failures and kubenet route table
     propagation delays after cluster create/reuse
   Evidence: 39 failures from 'AKS subnet has no route table'

5. Retry Firewall creation (aks_model.go):
   - Poll with backoff for up to 10 minutes
   - BeginCreateOrUpdate is idempotent, safe to retry
   Evidence: 90 failures from 'failed to create Firewall:
   context deadline exceeded'

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@r2k1 r2k1 force-pushed the e2e-infra-reliability branch from a8baa38 to e19fe06 Compare May 11, 2026 01:20
r2k1

This comment was marked as low quality.

Copy link
Copy Markdown
Contributor

@timmy-wright timmy-wright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - it'd be good to know which copilot comments you've responded to. Currently they are all unresolved, but it looks like you took some of them into account.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants