feat(anc): wire check-hotfix into node wrapper behind ENABLE_PROVISIONING_HOTFIX#8715
Conversation
0c90761 to
b33ec66
Compare
f842590 to
3ebabf0
Compare
|
Read-channel pivot: the hotfix-pointer read moves from Option 2 (kube-system anc-hotfix-version ConfigMap via apiserver + bootstrap token) to Option 4 (LPS endpoint, IMDS-attested), validated by e2e showing the node can reach LPS pre-kubelet. The fetch/auth rewrite lives in #8696 (2.1b). This wrapper wiring is channel-agnostic: the check-hotfix -> download-hotfix call sequence, the ENABLE_PROVISIONING_HOTFIX gate (relaxed by 2.1d via the enable_provisioning_hotfix contract field), and the fail-open semantics are all unchanged. Only comments/wording were updated to name the new read channel. |
07b497b to
0d6f945
Compare
Add a default-off ANC_HOTFIX_ENABLED-gated call to the 2.1b check-hotfix subcommand in aks-node-controller-wrapper.sh, placed before the existing download-hotfix block since check-hotfix refreshes the hotfix pointer that block consumes. The call is fail-open and wrapped defensively so it can never block provisioning. When the flag is unset/non-true the wrapper behaves exactly as before (6-month VHD backward compat). Parameterize HOTFIX_JSON to match the existing path-var pattern and enable shellspec coverage of the download-hotfix branch. Add shellspec tests for flag off, flag on ordering, fail-open, and non-true value handling. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Clarify that the check-hotfix non-zero (fail-open) case also models a node whose VHD-baked binary predates 2.1b, where check-hotfix is an unknown subcommand. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Match the design's EnableProvisioningHotfix aks-rp region toggle and AgentBaker's contract->env naming convention (EnableIMDSRestriction -> ENABLE_IMDS_RESTRICTION), so the toggle -> absvc -> ANC opt-in chain stays traceable. No behavior change; still default-off and fail-open. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The hotfix pointer read channel moved from the kube-system ConfigMap (apiserver + bootstrap token) to the LPS endpoint (IMDS-attested); the fetch/auth rewrite lives in 2.1b. The wrapper's check-hotfix -> download-hotfix call contract, the ENABLE_PROVISIONING_HOTFIX gate, and the fail-open semantics are unchanged - only the explanatory comment is updated to name the new read channel accurately. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
9b0f1fd to
abdcd9f
Compare
Changes cached containers or packages on windows VHDsPlease get a Windows SIG member to approve. The following dif file shows any additions or deletions from what will be cached on windows VHDs organised by VHD type.
diff --git a/vhd_files/2022-containerd-gen2.txt b/vhd_files/2022-containerd-gen2.txt
index db10c9e..c51a47f 100644
--- a/vhd_files/2022-containerd-gen2.txt
+++ b/vhd_files/2022-containerd-gen2.txt
@@ -122,0 +123 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -124 +124,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp
@@ -129,0 +130 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/secrets-store/driver:v1.5.4
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.11-windows-hpc-1
@@ -131 +131,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.13-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.14-windows-hpc-1
@@ -133 +133,2 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.10-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.11-windows-hpc-1
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.8-windows-hpc-1
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.3-windows-hpc-1
@@ -135 +135,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.5-windows-hpc
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.6-windows-hpc-1
@@ -137 +136,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.36.1-windows-hpc
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.36.2-windows-hpc-1
diff --git a/vhd_files/2022-containerd.txt b/vhd_files/2022-containerd.txt
index 94de353..7312c49 100644
--- a/vhd_files/2022-containerd.txt
+++ b/vhd_files/2022-containerd.txt
@@ -122,0 +123 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -124 +124,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp
@@ -129,0 +130 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/secrets-store/driver:v1.5.4
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.11-windows-hpc-1
@@ -131 +131,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.13-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.14-windows-hpc-1
@@ -133 +133,2 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.10-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.11-windows-hpc-1
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.8-windows-hpc-1
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.3-windows-hpc-1
@@ -135 +135,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.5-windows-hpc
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.6-windows-hpc-1
@@ -137 +136,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.36.1-windows-hpc
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.36.2-windows-hpc-1
diff --git a/vhd_files/2025-gen2.txt b/vhd_files/2025-gen2.txt
index d0ea692..36e3641 100644
--- a/vhd_files/2025-gen2.txt
+++ b/vhd_files/2025-gen2.txt
@@ -52,0 +53 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -54 +54,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp
@@ -59,0 +60 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/secrets-store/driver:v1.5.4
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.11-windows-hpc-1
@@ -61 +61,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.13-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.14-windows-hpc-1
@@ -63 +63,2 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.10-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.11-windows-hpc-1
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.8-windows-hpc-1
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.3-windows-hpc-1
@@ -65 +65,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.5-windows-hpc
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.6-windows-hpc-1
@@ -67 +66,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.36.1-windows-hpc
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.36.2-windows-hpc-1
diff --git a/vhd_files/2025.txt b/vhd_files/2025.txt
index ab44d8b..b8873d5 100644
--- a/vhd_files/2025.txt
+++ b/vhd_files/2025.txt
@@ -52,0 +53 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -54 +54,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp
@@ -59,0 +60 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/secrets-store/driver:v1.5.4
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.11-windows-hpc-1
@@ -61 +61,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.13-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.14-windows-hpc-1
@@ -63 +63,2 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.10-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.11-windows-hpc-1
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.8-windows-hpc-1
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.3-windows-hpc-1
@@ -65 +65,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.5-windows-hpc
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.6-windows-hpc-1
@@ -67 +66,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.36.1-windows-hpc
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.36.2-windows-hpc-1 |
2.1c - Wire check-hotfix into the node wrapper (shell only)
POC / M1 draft. Shell-only wiring for the Provisioning-Hotfix flow. No Go changes.
Why this exists (the scale-up gap it closes)
There are two writers of the hotfix pointer file
(
/opt/azure/containers/aks-node-controller-hotfix.json):write_filesentry generated fromhotfix/anc-hotfix-version.jsonintonodecustomdata.yml(
hotfix/anc_hotfix_generate.py). This lands in the VMSS custom data, i.e. the VMSSmodel.
check-hotfix(this wiring), which pulls the pointer live atboot and refreshes the same file before
download-hotfixgates/installs.With (A) alone, the pointer is only as fresh as the VMSS model. When an existing nodepool
scales up (autoscale or manual), the new instance boots from the nodepool's frozen model,
so it gets whatever hotfix pointer was baked in at the last model PUT - NOT the current
one. A hotfix published after that point silently misses exactly the newest nodes until a
control-plane reconcile refreshes every affected model. That stale-model-on-scale-up gap
is the core problem this design (and the PoC) targets.
(B) closes it:
check-hotfixruns in the wrapper's ExecStart and reads the pointer livefrom the LPS endpoint (IMDS-attested, reachable pre-kubelet), decoupled from the VMSS
model. A scale-up node booting from a months-old model still converges to the current
hotfix state. The read channel is deliberately pre-kubelet / IMDS-attested (not
apiserver/ConfigMap) precisely so a brand-new scale-up node can fetch it before it has a
kubeconfig. (A) remains useful as the cold-start/offline default baked into the model;
(B) is the authoritative live override.
Enablement (where this sits in the rollout chain)
This env gate is the on-node terminal of the design's region-staged opt-in:
EnableProvisioningHotfixaks-rp toggle (AKS Toggles-as-code, per region) -> absvcrespects toggle -> ANC respects toggle. This PR implements only the last hop ("ANC
respects toggle"). The env var name mirrors the toggle/contract name to match the
existing contract->env convention (e.g. EnableIMDSRestriction -> ENABLE_IMDS_RESTRICTION),
keeping the chain traceable. Wiring absvc to render this var from a contract field is a
separate follow-up PR; the aks-rp toggle + toggle YAML live in the aks-rp repo. Until
those land, the var renders unset everywhere, so this change is inert (default-off).
Note: 2.1d (#8717) relaxes this env gate, moving the on/off decision into the Go binary
via the
enable_provisioning_hotfixcontract field (single source of truth). This PRintentionally ADDS the gate; #8717 relaxes it, so each PR stays reviewable on its own.
What this does
Adds one call to the
check-hotfixsubcommand (added in 2.1b) insideaks-node-controller-wrapper.sh, gated behind a new env flagENABLE_PROVISIONING_HOTFIXthat is OFF by default.
check-hotfixreads the hotfix pointer from the LPS endpoint(IMDS-attested) and refreshes
$HOTFIX_JSON, which the existingdownload-hotfixblock consumes - so it must runfirst. The call is fail-open (the command always exits 0) and additionally wrapped
defensively so it can never block provisioning.
Default-off / fail-open guarantee
When
ENABLE_PROVISIONING_HOTFIXis unset, empty, or any value other than the literalstring
true, the wrapper behaves EXACTLY as it does today. This preserves the6-month VHD backward-compatibility window: older VHDs running newer CSE, and newer
VHDs running older CSE, are unaffected unless the flag is explicitly turned on.
Known-safe: old VHD + flag on
If
ENABLE_PROVISIONING_HOTFIX=trueever reaches a node whose VHD-baked ANC binary predates2.1b,
"$BIN_PATH" check-hotfixis an unknown subcommand and exits non-zero. Theif ... else log "...continuing (fail-open)" fiwrapper swallows that error, soprovisioning still proceeds unchanged. This path is covered by shellspec case 4 below
(check-hotfix exits non-zero -> wrapper still provisions), which models the missing
subcommand. This matters for the 6-month VHD support window.
Before / after flow
Flag off (default - unchanged):
Flag on (
ENABLE_PROVISIONING_HOTFIX=true):Notes
check-hotfixtakes no flags/args; it reads the AKSNodeConfig from its defaulton-node path internally for the LPS endpoint (IMDS-attested) it reads, so the wrapper passes nothing.
HOTFIX_JSONis parameterized as${HOTFIX_JSON:-<default>}to match the existingBIN_PATH/CONFIG_PATH/NBC_CMD_PATHpattern and to allow shellspec to exercisethe download-hotfix branch. Production default path is unchanged.
defaultHotfixVersionPath(
/opt/azure/containers/aks-node-controller-hotfix.json, hotfix.go) and download-hotfixreads the same constant. The wrapper's
HOTFIX_JSONdefault is byte-identical, and theGo
hotfixVersionPathoverride exists only for tests (no env/production override andcheck-hotfix takes no path flag), so the two never diverge on a node.
[ ],=,${VAR:-}); passes shellcheck generic + POSIX (SC3010/SC3014)and the wrapper shellspec suite (8 examples, 0 failures).
Tests
New shellspec cases in
aks_node_controller_wrapper_spec.sh:Stack
Base is set to the 2.1b branch so the diff shows only the wrapper + shellspec changes.
Will retarget to main as the stack merges down.
This unblocks the on-node e2e PoC tests (fail-open and multi-base) since check-hotfix
is otherwise never invoked at boot.