Background
When a Karpenter-provisioned AKS node fails to bootstrap, the root cause is almost always in the VM's Custom Script Extension (CSE) output. CSE exits with a numeric error code defined in AgentBaker's cse_helpers.sh. These codes are precise and actionable — for example:
| Code |
Constant |
Meaning |
| 211 |
ERR_ORAS_PULL_NETWORK_TIMEOUT |
ACR token exchange timed out |
| 212 |
ERR_ORAS_PULL_UNAUTHORIZED |
ACR pull unauthorized |
| 231 |
ERR_IMDS_FETCH_FAILED |
IMDS metadata fetch failed |
| 232 |
ERR_ORAS_BINARY_NOT_FOUND |
oras binary absent from image |
Currently, when a node fails to join the cluster, the operator sees something like:
NodeClaim/some-node: NotReady — waiting for node to be ready
...with no indication of the CSE exit code or error message. The operator must SSH into the node (often impossible on a failed node), inspect Azure portal VM extension logs, or dig through raw Azure API responses to find the real error.
This was directly observed in the wild: AzureLinux V3 image 202601.27.0 shipped without the oras binary, causing nodes to fail CSE with exit code 211 (ERR_ORAS_PULL_NETWORK_TIMEOUT) — a misleading code because the real cause was a missing binary. See Azure/AgentBaker#7907 and Azure/AgentBaker#7908 for context.
Problem Statement
Karpenter-provider-azure does not expose VM extension (CSE) failure details in:
NodeClaim conditions
NodeClaim events
- Karpenter controller logs (at a useful level)
This makes post-failure diagnosis significantly harder than it needs to be, especially for end users who lack access to the Azure portal or raw ARM APIs.
Proposed Enhancement
Surface CSE failure information in NodeClaim status so it is visible via kubectl and the Karpenter controller.
Option A — NodeClaim Condition (preferred)
When a VM's CSE extension enters a failed state, set a condition on the NodeClaim:
status:
conditions:
- type: CSEProvisioned
status: "False"
reason: CSEFailed
message: "CSE exited with code 211 (ERR_ORAS_PULL_NETWORK_TIMEOUT): failed to retrieve refresh token for oras login. See VM extension logs for details."
lastTransitionTime: "2026-01-27T14:32:00Z"
Option B — NodeClaim Event
Emit a Kubernetes event on the NodeClaim when CSE fails:
Warning CSEFailed karpenter-controller CSE exited with code 211 on node karpenter-abc123: ERR_ORAS_PULL_NETWORK_TIMEOUT
Option C — Both
Emit an event for immediate visibility and set the condition for persistent status.
Implementation Sketch
- After VM provisioning, poll the VM extension status via the Azure Compute API:
GET /subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Compute/virtualMachines/{vm}/extensions/CustomScript
- Check
properties.provisioningState and properties.instanceView.statuses
- If the extension is in a failed state, extract:
message (contains the CSE stderr output including the exit code)
- The numeric exit code (parseable from the message or from the extension status code)
- Update the NodeClaim condition and/or emit an event with the extracted details.
Why This Matters
- MTTR reduction: Operators can run
kubectl describe nodeclaim <name> and immediately see the CSE exit code, cutting diagnosis from hours to minutes.
- User experience: End users without Azure portal access are currently blind to CSE failures.
- On-call ergonomics: Alerts can be written against specific CSE exit codes (e.g., alert on 232 = image missing oras binary) rather than just "node not ready".
Acceptance Criteria
Related
Background
When a Karpenter-provisioned AKS node fails to bootstrap, the root cause is almost always in the VM's Custom Script Extension (CSE) output. CSE exits with a numeric error code defined in AgentBaker's cse_helpers.sh. These codes are precise and actionable — for example:
ERR_ORAS_PULL_NETWORK_TIMEOUTERR_ORAS_PULL_UNAUTHORIZEDERR_IMDS_FETCH_FAILEDERR_ORAS_BINARY_NOT_FOUNDorasbinary absent from imageCurrently, when a node fails to join the cluster, the operator sees something like:
...with no indication of the CSE exit code or error message. The operator must SSH into the node (often impossible on a failed node), inspect Azure portal VM extension logs, or dig through raw Azure API responses to find the real error.
This was directly observed in the wild: AzureLinux V3 image 202601.27.0 shipped without the
orasbinary, causing nodes to fail CSE with exit code 211 (ERR_ORAS_PULL_NETWORK_TIMEOUT) — a misleading code because the real cause was a missing binary. See Azure/AgentBaker#7907 and Azure/AgentBaker#7908 for context.Problem Statement
Karpenter-provider-azure does not expose VM extension (CSE) failure details in:
NodeClaimconditionsNodeClaimeventsThis makes post-failure diagnosis significantly harder than it needs to be, especially for end users who lack access to the Azure portal or raw ARM APIs.
Proposed Enhancement
Surface CSE failure information in NodeClaim status so it is visible via
kubectland the Karpenter controller.Option A — NodeClaim Condition (preferred)
When a VM's CSE extension enters a failed state, set a condition on the NodeClaim:
Option B — NodeClaim Event
Emit a Kubernetes event on the NodeClaim when CSE fails:
Option C — Both
Emit an event for immediate visibility and set the condition for persistent status.
Implementation Sketch
GET /subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Compute/virtualMachines/{vm}/extensions/CustomScriptproperties.provisioningStateandproperties.instanceView.statusesmessage(contains the CSE stderr output including the exit code)Why This Matters
kubectl describe nodeclaim <name>and immediately see the CSE exit code, cutting diagnosis from hours to minutes.Acceptance Criteria
messageincludes at minimum: the numeric exit code and a human-readable summary.ERR_ORAS_BINARY_NOT_FOUND), it is included in the message.Related
ERR_ORAS_BINARY_NOT_FOUND=232)