fix(e2e): improve infra setup reliability with retries and tolerant GC by r2k1 · Pull Request #8488 · Azure/AgentBaker

r2k1 · 2026-05-10T23:26:23Z

Summary

Infrastructure setup failures cause cascading test failures — one transient Azure error fails ALL tests sharing that cluster. This PR adds retries, tolerant cleanup, and handles cluster lifecycle states properly.

Evidence: 464 test failures across 46 builds over 3 weeks.

Changes

1. Tolerant stale node GC (`cluster.go`)

Previously any single failed K8s node delete killed cluster prep for all tests. The actual error was nodes "..." not found — the node was already gone.

Now: NotFound ignored, other errors logged as warnings, only fails if zero progress.

Evidence: 79 failures from "failed to delete 3 stale nodes, first error: ... not found"

2. Fix existingVMSS map (`cluster.go`)

VMSS queued for deletion were kept in the "existing" map, protecting their orphaned K8s nodes from cleanup. Now only kept VMSS go in the map.

3. Retry Bastion subnet GET with per-call timeout (`cluster.go`)

A single unresponsive Subnet.Get consumed the entire 20-minute cluster prep budget. Logs confirmed: "preparing cluster done (1200.0s)".

Now: each GET has a 30-second timeout, retried every 5s for up to 2 minutes.

Evidence: 179 failures across 3 builds from "get subnet AzureBastionSubnet: context deadline exceeded"

4. Retry AKS subnet + route table lookup (`aks_model.go`)

Same per-call timeout pattern. Also handles kubenet route table propagation delays after cluster create/reuse.

Evidence: 39 failures across 6 builds from "AKS subnet has no route table associated"

5. Retry Firewall creation (`aks_model.go`)

BeginCreateOrUpdate + PollUntilDone retried on failure. Idempotent — retrying resumes or restarts.

Evidence: 90 failures across 2 builds from "failed to create Firewall: context deadline exceeded"

6. Handle Deleting/Canceled cluster states (`cluster.go`)

waitUntilClusterReady treated Deleting as an unknown state and failed immediately. A cluster stuck in Deleting from a previous run killed all tests.

Now: waits for Deleting/Canceled/Canceling to finish. If cluster gets deleted (404), returns nil so a new cluster is created.

Evidence: 74 failures across 2 builds from "cluster ... is in state Deleting, and wont retry"

Copilot

Pull request overview

This PR aims to improve E2E infrastructure setup reliability by adding retry/timeout patterns around Azure ARM operations and making cluster garbage-collection more tolerant so transient failures don’t cascade across all tests sharing a cluster.

Changes:

Add retry with per-call timeouts for Bastion subnet GET and for AKS subnet + route table readiness checks.
Make stale Kubernetes node GC tolerant to NotFound and non-fatal per-node delete errors.
Adjust VMSS GC bookkeeping so VMSS slated for deletion don’t shield their nodes from cleanup.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
e2e/cluster.go	Adds ARM GET retries/timeouts for Bastion subnet, changes VMSS “keep” set semantics, and makes stale node GC more tolerant.
e2e/aks_model.go	Adds retries/timeouts around AKS subnet/route table discovery and retries Azure Firewall creation.

+		return false, nil
+	})
+	if err != nil {
+		return nil, fmt.Errorf("get subnet %q in vnet %q rg %q: %w", bastionSubnetName, vnet.name, nodeRG, subnetGetErr)


+	// Build a set of VMSS names that should be kept — exclude VMSS that are
+	// being deleted so their stale K8s nodes can be cleaned up in the same pass.
 	existingVMSS := map[string]struct{}{}
 	pager := config.Azure.VMSS.NewListPager(rg, nil)


 				continue
 			}
 			toolkit.Logf(ctx, "deleted vmss %q (age: %v)", *vmss.ID, time.Since(*vmss.Properties.TimeCreated))
+			// Don't add to existingVMSS — nodes from this VMSS should be cleaned up


+	// Retry the AKS subnet GET and route table lookup to tolerate:
+	// - transient ARM timeouts on the subnet GET
+	// - eventual consistency: kubenet route table may still be propagating after cluster create/reuse
+	var aksRTName string
+	err = wait.PollUntilContextTimeout(ctx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) {
+		callCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
+		defer cancel()
+		aksSubnetResp, subnetErr := config.Azure.Subnet.Get(callCtx, rg, vnet.name, "aks-subnet", nil)
+		if subnetErr != nil {
+			toolkit.Logf(ctx, "transient error getting AKS subnet (retrying): %v", subnetErr)
+			return false, nil
+		}
+		rtName, rtErr := ensureFirewallRouteTable(ctx, clusterModel, vnet.name, aksSubnetResp.Subnet)
+		if rtErr != nil {
+			toolkit.Logf(ctx, "route table not ready (retrying): %v", rtErr)
+			return false, nil
+		}
+		aksRTName = rtName
+		return true, nil
+	})


+	var fwResp armnetwork.AzureFirewallsClientCreateOrUpdateResponse
+	err = wait.PollUntilContextTimeout(ctx, 10*time.Second, 10*time.Minute, true, func(ctx context.Context) (bool, error) {
+		fwPoller, fwErr := config.Azure.AzureFirewall.BeginCreateOrUpdate(ctx, rg, firewallName, *firewall, nil)
+		if fwErr != nil {
+			toolkit.Logf(ctx, "transient error starting firewall creation (retrying): %v", fwErr)
+			return false, nil
+		}
+		fwResp, fwErr = fwPoller.PollUntilDone(ctx, nil)
+		if fwErr != nil {
+			toolkit.Logf(ctx, "firewall creation did not complete (retrying): %v", fwErr)
+			return false, nil
+		}
+		return true, nil
+	})


Address infrastructure failures that cause cascading test failures (464 failures across 46 builds over 3 weeks). 1. Make stale node GC tolerant (cluster.go): - Ignore NotFound errors on node deletion (already gone) - Log other delete errors as warnings instead of failing - Only return error if zero nodes could be deleted Evidence: 79 failures from 'failed to delete 3 stale nodes' 2. Fix existingVMSS map (cluster.go): - Only add VMSS to existingVMSS if they are being kept - VMSS queued for deletion are excluded, so their stale K8s nodes can be cleaned up in the same pass - If VMSS deletion fails, keep in map to avoid orphaned deletes 3. Retry Bastion subnet GET (cluster.go): - Poll with backoff for up to 30s on transient ARM errors - 404 still handled normally (create subnet) Evidence: 179 failures from 'get subnet AzureBastionSubnet: context deadline exceeded' 4. Retry AKS subnet + route table lookup (aks_model.go): - Poll with backoff for up to 2 minutes - Handles both transient GET failures and kubenet route table propagation delays after cluster create/reuse Evidence: 39 failures from 'AKS subnet has no route table' 5. Retry Firewall creation (aks_model.go): - Poll with backoff for up to 10 minutes - BeginCreateOrUpdate is idempotent, safe to retry Evidence: 90 failures from 'failed to create Firewall: context deadline exceeded' Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

timmy-wright

LGTM - it'd be good to know which copilot comments you've responded to. Currently they are all unresolved, but it looks like you took some of them into account.

Copilot AI review requested due to automatic review settings May 10, 2026 23:26

r2k1 requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, djsly, ganeshkumarashok, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, sulixu, surajssd, timmy-wright and zachary-bailey as code owners May 10, 2026 23:26

r2k1 temporarily deployed to test May 10, 2026 23:26 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 May 10, 2026 23:27 View session

Copilot AI reviewed May 10, 2026

View reviewed changes

r2k1 force-pushed the e2e-infra-reliability branch from 8c798e4 to d79fabe Compare May 10, 2026 23:34

r2k1 temporarily deployed to test May 10, 2026 23:34 — with GitHub Actions Inactive

Copilot AI review requested due to automatic review settings May 10, 2026 23:36

r2k1 force-pushed the e2e-infra-reliability branch from d79fabe to a8baa38 Compare May 10, 2026 23:36

r2k1 temporarily deployed to test May 10, 2026 23:36 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 May 10, 2026 23:36 View session

This comment was marked as low quality.

Sign in to view

This comment was marked as outdated.

Sign in to view

r2k1 force-pushed the e2e-infra-reliability branch from a8baa38 to e19fe06 Compare May 11, 2026 01:20

r2k1 temporarily deployed to test May 11, 2026 01:20 — with GitHub Actions Inactive

This comment was marked as low quality.

Sign in to view

timmy-wright approved these changes May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): improve infra setup reliability with retries and tolerant GC#8488

fix(e2e): improve infra setup reliability with retries and tolerant GC#8488
r2k1 wants to merge 1 commit into
mainfrom
e2e-infra-reliability

r2k1 commented May 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

This comment was marked as low quality.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as low quality.

Uh oh!

timmy-wright left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

r2k1 commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

1. Tolerant stale node GC (cluster.go)

2. Fix existingVMSS map (cluster.go)

3. Retry Bastion subnet GET with per-call timeout (cluster.go)

4. Retry AKS subnet + route table lookup (aks_model.go)

5. Retry Firewall creation (aks_model.go)

6. Handle Deleting/Canceled cluster states (cluster.go)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

This comment was marked as low quality.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as low quality.

Uh oh!

timmy-wright left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

r2k1 commented May 10, 2026 •

edited

Loading

1. Tolerant stale node GC (`cluster.go`)

2. Fix existingVMSS map (`cluster.go`)

3. Retry Bastion subnet GET with per-call timeout (`cluster.go`)

4. Retry AKS subnet + route table lookup (`aks_model.go`)

5. Retry Firewall creation (`aks_model.go`)

6. Handle Deleting/Canceled cluster states (`cluster.go`)