Skip to content

fix: add pre-flight check for oras binary before ACR login#7908

Open
Lyqed wants to merge 1 commit intoAzure:mainfrom
Lyqed:fix/check-oras-binary-before-acr-login
Open

fix: add pre-flight check for oras binary before ACR login#7908
Lyqed wants to merge 1 commit intoAzure:mainfrom
Lyqed:fix/check-oras-binary-before-acr-login

Conversation

@Lyqed
Copy link

@Lyqed Lyqed commented Feb 19, 2026

Summary

Closes #7907

AzureLinux V3 image 202601.27.0 shipped without the oras binary. When oras_login_with_kubelet_identity runs on such a node, every call to oras fails silently and error propagation eventually surfaces as exit code 211 (ERR_ORAS_PULL_NETWORK_TIMEOUT) — a misleading code that sends operators chasing network/IMDS issues instead of the real problem: the binary is simply absent.

This PR adds a pre-flight guard at the top of oras_login_with_kubelet_identity that:

  1. Checks command -v oras before doing any ACR work.
  2. On failure: logs $PATH, probes the three canonical install locations (/usr/local/bin/oras, /usr/bin/oras, /opt/bin/oras), dumps /etc/os-release, and queries rpm (AzureLinux/Mariner) or dpkg (Ubuntu) for installed oras packages.
  3. Returns the new, unambiguous ERR_ORAS_BINARY_NOT_FOUND=232 error code so the failure is instantly understandable from CSE logs.

Changes

File Change
parts/linux/cloud-init/artifacts/cse_helpers.sh Add ERR_ORAS_BINARY_NOT_FOUND=232; add pre-flight check in oras_login_with_kubelet_identity
pkg/agent/testdata/**/CustomData Regenerated by make generate (snapshot test data embeds the script)

Test plan

  • make generate — shellcheck passes on cse_helpers.sh; Go snapshot tests regenerated
  • make test — all unit tests pass
  • E2E: provision an AzureLinux V3 node with a mock/patched image that lacks oras; confirm CSE exits with code 232 and the diagnostic block appears in logs
  • E2E: provision a normal AzureLinux V3 node with oras present; confirm no regression (pre-flight check passes and login succeeds)

Notes for reviewers

  • The pre-flight check is placed before the client_id/tenant_id guard and before any oras call, so it catches the missing-binary case regardless of ACR anonymity or identity configuration.
  • Cross-distro: uses command -v (POSIX), rpm (AzureLinux/Mariner), dpkg (Ubuntu) — no bashisms in the diagnostic path; local variable oras_path declared with local per shell script guidelines.
  • Error code 232 follows 231 (ERR_IMDS_FETCH_FAILED) in the numeric sequence and does not collide with any existing code.

🤖 Generated with Claude Code

Without this check, a missing `oras` binary causes CSE to exit with
ERR_ORAS_PULL_NETWORK_TIMEOUT (211) — a misleading code that points
engineers at networking rather than the actual problem.

What:
- Add ERR_ORAS_BINARY_NOT_FOUND=232 error code to cse_helpers.sh
- Add pre-flight check at the start of oras_login_with_kubelet_identity
  that verifies `oras` is present in PATH before doing any ACR work

Why:
- AzureLinux V3 image 202601.27.0 shipped without the oras binary,
  causing all Karpenter-provisioned nodes to fail CSE with exit 211.
  The new check emits clear diagnostic output (PATH, known install
  paths, OS info, rpm/dpkg package list) and returns the unambiguous
  ERR_ORAS_BINARY_NOT_FOUND code so operators know immediately what
  went wrong.

How:
- command -v oras checked first; on failure, probe
  /usr/local/bin/oras, /usr/bin/oras, /opt/bin/oras and log OS info
  and installed packages via rpm (AzureLinux/Mariner) or dpkg (Ubuntu)
  before returning ERR_ORAS_BINARY_NOT_FOUND

Fixes Azure#7907
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: oras binary missing from AzureLinux V3 image 202601.27.0 — CSE fails with exit code 211

1 participant