Skip to content

enhancement: surface CSE exit code and error details in NodeClaim conditions #1441

@Lyqed

Description

@Lyqed

Background

When a Karpenter-provisioned AKS node fails to bootstrap, the root cause is almost always in the VM's Custom Script Extension (CSE) output. CSE exits with a numeric error code defined in AgentBaker's cse_helpers.sh. These codes are precise and actionable — for example:

Code Constant Meaning
211 ERR_ORAS_PULL_NETWORK_TIMEOUT ACR token exchange timed out
212 ERR_ORAS_PULL_UNAUTHORIZED ACR pull unauthorized
231 ERR_IMDS_FETCH_FAILED IMDS metadata fetch failed
232 ERR_ORAS_BINARY_NOT_FOUND oras binary absent from image

Currently, when a node fails to join the cluster, the operator sees something like:

NodeClaim/some-node: NotReady — waiting for node to be ready

...with no indication of the CSE exit code or error message. The operator must SSH into the node (often impossible on a failed node), inspect Azure portal VM extension logs, or dig through raw Azure API responses to find the real error.

This was directly observed in the wild: AzureLinux V3 image 202601.27.0 shipped without the oras binary, causing nodes to fail CSE with exit code 211 (ERR_ORAS_PULL_NETWORK_TIMEOUT) — a misleading code because the real cause was a missing binary. See Azure/AgentBaker#7907 and Azure/AgentBaker#7908 for context.

Problem Statement

Karpenter-provider-azure does not expose VM extension (CSE) failure details in:

  • NodeClaim conditions
  • NodeClaim events
  • Karpenter controller logs (at a useful level)

This makes post-failure diagnosis significantly harder than it needs to be, especially for end users who lack access to the Azure portal or raw ARM APIs.

Proposed Enhancement

Surface CSE failure information in NodeClaim status so it is visible via kubectl and the Karpenter controller.

Option A — NodeClaim Condition (preferred)

When a VM's CSE extension enters a failed state, set a condition on the NodeClaim:

status:
  conditions:
    - type: CSEProvisioned
      status: "False"
      reason: CSEFailed
      message: "CSE exited with code 211 (ERR_ORAS_PULL_NETWORK_TIMEOUT): failed to retrieve refresh token for oras login. See VM extension logs for details."
      lastTransitionTime: "2026-01-27T14:32:00Z"

Option B — NodeClaim Event

Emit a Kubernetes event on the NodeClaim when CSE fails:

Warning  CSEFailed  karpenter-controller  CSE exited with code 211 on node karpenter-abc123: ERR_ORAS_PULL_NETWORK_TIMEOUT

Option C — Both

Emit an event for immediate visibility and set the condition for persistent status.

Implementation Sketch

  1. After VM provisioning, poll the VM extension status via the Azure Compute API:
    • GET /subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Compute/virtualMachines/{vm}/extensions/CustomScript
    • Check properties.provisioningState and properties.instanceView.statuses
  2. If the extension is in a failed state, extract:
    • message (contains the CSE stderr output including the exit code)
    • The numeric exit code (parseable from the message or from the extension status code)
  3. Update the NodeClaim condition and/or emit an event with the extracted details.

Why This Matters

  • MTTR reduction: Operators can run kubectl describe nodeclaim <name> and immediately see the CSE exit code, cutting diagnosis from hours to minutes.
  • User experience: End users without Azure portal access are currently blind to CSE failures.
  • On-call ergonomics: Alerts can be written against specific CSE exit codes (e.g., alert on 232 = image missing oras binary) rather than just "node not ready".

Acceptance Criteria

  • When a VM's CSE fails, the NodeClaim condition reflects the failure within the next reconcile loop.
  • The condition message includes at minimum: the numeric exit code and a human-readable summary.
  • If the CSE output contains the AgentBaker error constant name (e.g., ERR_ORAS_BINARY_NOT_FOUND), it is included in the message.
  • The enhancement does not introduce additional Azure API calls in the hot path for healthy nodes (poll only on failure detection).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions