enhancement: surface CSE exit code and error details in NodeClaim conditions

## Background

When a Karpenter-provisioned AKS node fails to bootstrap, the root cause is almost always in the VM's Custom Script Extension (CSE) output. CSE exits with a numeric error code defined in [AgentBaker's cse_helpers.sh](https://github.com/Azure/AgentBaker/blob/main/parts/linux/cloud-init/artifacts/cse_helpers.sh). These codes are precise and actionable — for example:

| Code | Constant | Meaning |
|---|---|---|
| 211 | `ERR_ORAS_PULL_NETWORK_TIMEOUT` | ACR token exchange timed out |
| 212 | `ERR_ORAS_PULL_UNAUTHORIZED` | ACR pull unauthorized |
| 231 | `ERR_IMDS_FETCH_FAILED` | IMDS metadata fetch failed |
| 232 | `ERR_ORAS_BINARY_NOT_FOUND` | `oras` binary absent from image |

Currently, when a node fails to join the cluster, the operator sees something like:

```
NodeClaim/some-node: NotReady — waiting for node to be ready
```

...with no indication of the CSE exit code or error message. The operator must SSH into the node (often impossible on a failed node), inspect Azure portal VM extension logs, or dig through raw Azure API responses to find the real error.

This was directly observed in the wild: AzureLinux V3 image **202601.27.0** shipped without the `oras` binary, causing nodes to fail CSE with exit code **211** (`ERR_ORAS_PULL_NETWORK_TIMEOUT`) — a misleading code because the real cause was a missing binary. See Azure/AgentBaker#7907 and Azure/AgentBaker#7908 for context.

## Problem Statement

Karpenter-provider-azure does not expose VM extension (CSE) failure details in:
- `NodeClaim` conditions
- `NodeClaim` events
- Karpenter controller logs (at a useful level)

This makes post-failure diagnosis significantly harder than it needs to be, especially for end users who lack access to the Azure portal or raw ARM APIs.

## Proposed Enhancement

Surface CSE failure information in NodeClaim status so it is visible via `kubectl` and the Karpenter controller.

### Option A — NodeClaim Condition (preferred)

When a VM's CSE extension enters a failed state, set a condition on the NodeClaim:

```yaml
status:
  conditions:
    - type: CSEProvisioned
      status: "False"
      reason: CSEFailed
      message: "CSE exited with code 211 (ERR_ORAS_PULL_NETWORK_TIMEOUT): failed to retrieve refresh token for oras login. See VM extension logs for details."
      lastTransitionTime: "2026-01-27T14:32:00Z"
```

### Option B — NodeClaim Event

Emit a Kubernetes event on the NodeClaim when CSE fails:

```
Warning  CSEFailed  karpenter-controller  CSE exited with code 211 on node karpenter-abc123: ERR_ORAS_PULL_NETWORK_TIMEOUT
```

### Option C — Both

Emit an event for immediate visibility and set the condition for persistent status.

## Implementation Sketch

1. After VM provisioning, poll the VM extension status via the Azure Compute API:
   - `GET /subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Compute/virtualMachines/{vm}/extensions/CustomScript`
   - Check `properties.provisioningState` and `properties.instanceView.statuses`
2. If the extension is in a failed state, extract:
   - `message` (contains the CSE stderr output including the exit code)
   - The numeric exit code (parseable from the message or from the extension status code)
3. Update the NodeClaim condition and/or emit an event with the extracted details.

## Why This Matters

- **MTTR reduction**: Operators can run `kubectl describe nodeclaim <name>` and immediately see the CSE exit code, cutting diagnosis from hours to minutes.
- **User experience**: End users without Azure portal access are currently blind to CSE failures.
- **On-call ergonomics**: Alerts can be written against specific CSE exit codes (e.g., alert on 232 = image missing oras binary) rather than just "node not ready".

## Acceptance Criteria

- [ ] When a VM's CSE fails, the NodeClaim condition reflects the failure within the next reconcile loop.
- [ ] The condition `message` includes at minimum: the numeric exit code and a human-readable summary.
- [ ] If the CSE output contains the AgentBaker error constant name (e.g., `ERR_ORAS_BINARY_NOT_FOUND`), it is included in the message.
- [ ] The enhancement does not introduce additional Azure API calls in the hot path for healthy nodes (poll only on failure detection).

## Related

- Azure/AgentBaker#7907 — bug: oras binary missing from AzureLinux V3 image 202601.27.0
- Azure/AgentBaker#7908 — fix: add pre-flight check for oras binary before ACR login (adds `ERR_ORAS_BINARY_NOT_FOUND=232`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhancement: surface CSE exit code and error details in NodeClaim conditions #1441

Background

Problem Statement

Proposed Enhancement

Option A — NodeClaim Condition (preferred)

Option B — NodeClaim Event

Option C — Both

Implementation Sketch

Why This Matters

Acceptance Criteria

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Code	Constant	Meaning
211	`ERR_ORAS_PULL_NETWORK_TIMEOUT`	ACR token exchange timed out
212	`ERR_ORAS_PULL_UNAUTHORIZED`	ACR pull unauthorized
231	`ERR_IMDS_FETCH_FAILED`	IMDS metadata fetch failed
232	`ERR_ORAS_BINARY_NOT_FOUND`	`oras` binary absent from image

enhancement: surface CSE exit code and error details in NodeClaim conditions #1441

Description

Background

Problem Statement

Proposed Enhancement

Option A — NodeClaim Condition (preferred)

Option B — NodeClaim Event

Option C — Both

Implementation Sketch

Why This Matters

Acceptance Criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions