Skip to content

refactor: write ANC hotfix.json via boothook and rename target_version to scripts_version#8767

Open
Devinwong wants to merge 25 commits into
mainfrom
devinwong/anc-hotfix-cse-version-gating
Open

refactor: write ANC hotfix.json via boothook and rename target_version to scripts_version#8767
Devinwong wants to merge 25 commits into
mainfrom
devinwong/anc-hotfix-cse-version-gating

Conversation

@Devinwong

@Devinwong Devinwong commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

Write the ANC hotfix config (aks-node-controller-hotfix.json) directly from the boothook
instead of via a write_files entry in nodecustomdata.yml. This eliminates the double-parse
of nodecustomdata.yml in download-hotfix, and gates an optional script hotfix payload by
scripts_version.

Files / source of truth

  • Source of truth (committed, edited by operators): hotfix/anc-hotfix-version.json
    • version = ANC binary hotfix version to download/install
    • scripts_version = ANC/VHD version base that should receive the script hotfix write_files
  • Embedded config read by baker.go: parts/hotfix/aks-node-controller-hotfix.json
    • Generated from hotfix/anc-hotfix-version.json by hotfix/anc_hotfix_generate.py.
    • Embedded into the ABSVC binary via parts.Templates and shipped in custom data as
      /opt/azure/containers/aks-node-controller-hotfix.json.
  • Embedded script hotfix payload (optional): parts/hotfix/anc-scripts-hotfix-files.yml
    • Generated by hotfix/hotfix_generate.py; shipped as
      /opt/azure/containers/anc-scripts-hotfix-files.yml.

Both generated files are only committed on official/* hotfix branches. On normal builds they
are absent, which baker.go treats as "no active hotfix" (ships nothing).

Flow

flowchart TD
    A["ABSVC / baker.go\nreads parts/hotfix/aks-node-controller-hotfix.json\n(embedded at build time)"]
    B["Boothook\nwrite nodecustomdata.yml\nwrite aks-node-controller-hotfix.json\nwrite anc-scripts-hotfix-files.yml (if active)"]
    C["download-hotfix\nread aks-node-controller-hotfix.json"]
    D{"ANC binary\nhotfix applies?"}
    E["Download & stage\nhotfix binary"]
    F["Skip binary download"]
    S{"scripts_version\nmatches running version?"}
    T["apply script hotfix\nwrite_files"]
    U["skip script hotfix"]
    G["provision\nread nodecustomdata.yml\napply write_files\nrun NBC cmd"]

    A --> B --> C --> D
    D -- Yes --> E --> S
    D -- No --> F --> S
    S -- Yes --> T --> G
    S -- No --> U --> G
Loading

Note: the script hotfix write_files are applied inside download-hotfix (via
applyScriptHotfix), which the wrapper runs before provision. provision then
materializes the main nodecustomdata.yml and runs the NBC command.

Changes

File Change
pkg/agent/baker.go Add loadHotfixConfigJSON() and getLinuxNodeScriptsHotfixFilesPayload(); ship the gzipped hotfix config + optional script payload via the shared gzipFileBlockFmt boothook block / Flatcar Ignition entries. Absent embed files are treated as "no active hotfix".
hotfix/anc-hotfix-version.json Source-of-truth config ({version, scripts_version}), edited by operators on official/*.
parts/hotfix/aks-node-controller-hotfix.json Generated embed (not committed on main); produced by hotfix/anc_hotfix_generate.py.
parts/hotfix/anc-scripts-hotfix-files.yml Generated script-hotfix write_files payload (not committed on main); produced by hotfix/hotfix_generate.py.
hotfix/anc_hotfix_generate.py / hotfix/hotfix_generate.py Emit standalone payload files (no marker injection into nodecustomdata.yml); delete the generated file when no hotfix is active.
aks-node-controller/hotfix.go download-hotfix reads the config, conditionally downloads the binary, and applies the script hotfix write_files (gated by scripts_version). target_version → scripts_version.
aks-node-controller/nodecustomdata.go applyWriteFiles no-ops on a missing (optional) payload; applyNodeCustomData stays strict (errors on missing required nodecustomdata.yml).
.github/workflows/*hotfix-generate.yml Regenerate + commit/update/delete the generated embed files on hotfix PRs.
Tests / snapshots Updated renames; regenerated test data.

Ensure download-hotfix can materialize hotfix metadata before wrapper pre-checks, and gate script hotfix write_files by cse_version in scriptless --nbc-cmd provisioning.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 23, 2026 22:33

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refines the scriptless --nbc-cmd hotfix flow for aks-node-controller (ANC) by ensuring the hotfix config file can be materialized early from nodecustomdata.yml, adding cse_version-based gating for applying auto-generated script hotfix write_files, and updating tests around the new behavior.

Changes:

  • Always run aks-node-controller download-hotfix as a pre-check in the wrapper before selecting the binary to execute.
  • Extend ANC hotfix config parsing to support optional cse_version, and skip applying auto-generated script hotfix blocks when it doesn’t match.
  • Materialize only the hotfix config write_file from nodecustomdata.yml before evaluating hotfix applicability; add/update unit tests.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
parts/linux/cloud-init/artifacts/aks-node-controller-wrapper.sh Always runs download-hotfix pre-check before selecting hotfix vs VHD-baked ANC binary.
hotfix/anc_hotfix_generate.py Extends injected hotfix JSON payload to optionally include cse_version, with format validation.
aks-node-controller/nodecustomdata.go Adds script-hotfix block stripping gated by cse_version, plus filtered application for specific write_files.
aks-node-controller/nodecustomdata_test.go Adds a unit test for cse_version mismatch behavior when applying nodecustomdata write_files.
aks-node-controller/hotfix.go Materializes hotfix config from nodecustomdata before reading; adds cse_version parsing and gating helper.
aks-node-controller/hotfix_test.go Updates/extends tests for forward-compatible parsing, config materialization, and cse_version gating helper.

Comment thread aks-node-controller/nodecustomdata_test.go Outdated
Comment thread parts/linux/cloud-init/artifacts/aks-node-controller-wrapper.sh Outdated
Devin Wong and others added 2 commits June 23, 2026 15:52
Use target_version naming for script hotfix gating and ANC hotfix config generation to reflect generic version targeting.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Allow target_version in YYYYMM.DD form to match any patch for that base while keeping exact YYYYMM.DD.PATCH support.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 23, 2026 23:03

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comment thread aks-node-controller/hotfix.go Outdated
Document base and exact target_version matching semantics in ANC hotfix gating logic.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Devinwong Devinwong changed the title Add cse_version gating for script hotfix flow Add target_version-gated script hotfix flow Jun 23, 2026
Add shellspec assertions that wrapper invokes download-hotfix before provision when config or nbc-cmd exists, and update hotfix config error wording.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 23, 2026 23:15
@Devinwong Devinwong changed the title Add target_version-gated script hotfix flow fix: gate script hotfix by target version Jun 23, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment thread aks-node-controller/nodecustomdata_test.go Outdated
Comment thread aks-node-controller/nodecustomdata.go Outdated
Devin Wong and others added 2 commits June 23, 2026 16:33
Move script hotfix write_files application into download-hotfix (gated by target_version) while keeping ANC binary hotfix install behavior. This makes script hotfix file materialization part of the download-hotfix phase.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Only trim a preceding newline when removing script hotfix block if it is an actual blank separator, and extend mismatch test to validate post-hotfix write_files still apply.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 23, 2026 23:48

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment thread aks-node-controller/nodecustomdata_test.go Outdated
Comment thread hotfix/anc_hotfix_generate.py Outdated
Devin Wong and others added 2 commits June 23, 2026 17:04
Rename inner error variable in download-hotfix script hotfix apply step to avoid shadowing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 24, 2026 00:09

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

@Devinwong Devinwong changed the title fix: gate script hotfix by target version fix: gate script hotfix by target version and materialize aks-node-controller-hotfix.json before hotfix pre-check Jun 24, 2026
@Devinwong Devinwong marked this pull request as ready for review June 24, 2026 00:17

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comment thread pkg/agent/baker.go Outdated
Comment thread aks-node-controller/nodecustomdata.go
Only invoke 'aks-node-controller download-hotfix' when the hotfix config
JSON exists on disk, instead of spawning the ANC binary on every boot to
have it no-op. Encapsulates the hotfix decision in the wrapper and saves
boot time on nodes without an active hotfix.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 24, 2026 22:48

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comment thread pkg/agent/baker.go Outdated
Comment thread aks-node-controller/nodecustomdata.go
Devin Wong and others added 2 commits June 24, 2026 16:01
Remove the committed placeholder parts/hotfix/aks-node-controller-hotfix.json ({})
and parts/hotfix/anc-scripts-hotfix-files.yml (empty write_files). They only exist
on official/* branches when a hotfix is active.

- baker.go: getLinuxNodeScriptsHotfixFilesPayload and loadHotfixConfigJSON now treat
  fs.ErrNotExist as 'no active hotfix' (return empty), instead of a hard error.
- hack/hotfix_generate.py: reset_payload removes the file (was: write placeholder);
  write_payload makes the parts/hotfix dir before writing.
- hotfix/anc_hotfix_generate.py: removes the file when no hotfix (was: write {});
  makes the dir before writing.
- Both GH workflows: commit step now handles create/update/delete via the contents
  API. Fixes a stale bug where anc-hotfix-generate diffed nodecustomdata.yml instead
  of the JSON config file.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-locate the script-hotfix payload generator with the other hotfix tooling and
data (anc-hotfix-version.json, anc_hotfix_generate.py) under hotfix/. Updates the
hotfix-generate GH Action invocation and in-file usage/comment references.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 24, 2026 23:14

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

hotfix/hotfix_generate.py:327

  • reset_payload() deletes OUTPUT so there is no committed payload file. The log line "payload reset to placeholder" is misleading because no placeholder is written; the file is removed. This can confuse hotfix action logs when diagnosing whether a payload exists.

Comment thread aks-node-controller/nodecustomdata.go
Comment thread aks-node-controller/hotfix.go
Comment thread .github/workflows/anc-hotfix-generate.yml
Devin Wong and others added 2 commits June 24, 2026 16:23
Addresses awesomenix's review comment to generalize the per-file boothook block
format. Previously hotfix.json shipped plain-base64 via a separate hotfixConfigBlockFmt;
now it is gzipped like the other payloads and reuses the single gzipFileBlockFmt.

- baker.go: encode hotfix config with getBase64EncodedGzippedCustomScriptFromStr;
  use gzipFileBlockFmt for it; delete hotfixConfigBlockFmt.
- flatcar: add compression:gzip to flatcarHotfixConfigEntry to match the others.
- On-node result is unchanged: the boothook gunzips it back to plain JSON on disk.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses awesomenix's comment about over-massaging the hotfix config file:
- Existence is handled idiomatically via fs.ErrNotExist (no separate stat needed
  for embed.FS).
- Drop the redundant '{}'/len==0 check (subsumed by the empty-fields guard; empty
  data already errors at json.Unmarshal).
- Drop per-field and final TrimSpace; the generator writes compact JSON and ANC
  tolerates whitespace when parsing, so return the raw bytes.
- Keep one cheap semantic guard: empty version+scripts_version means no active hotfix.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 24, 2026 23:25

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

aks-node-controller/nodecustomdata.go:53

  • applyWriteFiles is now a general helper for optional #cloud-config write_files payloads (e.g., script hotfix payload), but its error messages still say "nodecustomdata". This will be misleading when failures come from anc-scripts-hotfix-files.yml (or other future payloads). Consider updating the error strings to refer to a generic write_files payload/document.
func applyWriteFiles(path string) error {
	data, err := os.ReadFile(path)
	if err != nil {
		if os.IsNotExist(err) {
			return nil
		}
		return fmt.Errorf("read nodecustomdata %s: %w", path, err)
	}

	var customData nodeCustomData
	if err := yaml.Unmarshal(data, &customData); err != nil {
		return fmt.Errorf("unmarshal nodecustomdata %s: %w", path, err)
	}

Comment thread aks-node-controller/nodecustomdata.go
Comment thread pkg/agent/baker.go
Comment thread pkg/agent/baker.go
Devin Wong and others added 2 commits June 24, 2026 16:44
Addresses awesomenix's 'TrimSpace everywhere?' comment. The hotfix config JSON is
generated by our own tooling (compact JSON) and the version is our build ldflag, so
the defensive re-trims were noise.

- readHotfixConfig: drop field-level TrimSpace on Version/ScriptsVersion; empty-file
  guard uses len(data) instead of TrimSpace.
- shouldApplyScriptsVersion / shouldUpgradeToHotfix: compare the values directly.
- Keep the /etc/os-release parsing trims (that file is the OS's, genuine text parsing).
- Drop the now-invalid 'trimmed exact match' test case.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tion

Addresses Copilot review comments:
- applyNodeCustomData (required NBCCmd path, exit 240) now errors on a missing
  nodecustomdata.yml instead of inheriting applyWriteFiles' optional no-op, so a
  missing required file fails loudly instead of silently skipping provisioning.
  applyWriteFiles keeps no-op-on-absence for optional payloads (script hotfix).
- getLinuxNodeScriptsHotfixFilesPayload detects write_files entries with a
  line-anchored helper (line starts with '- path:') instead of strings.Contains,
  so a stray '- path:' inside a comment or file content can't be a false positive.
- Drop stale 'placeholder' wording now that no placeholder file is committed.
- Add TestApplyNodeCustomDataMissingFileErrors.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 24, 2026 23:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

hotfix/hotfix_generate.py:327

  • When no matching write_files blocks are found, reset_payload() deletes the payload file, but the final log says "payload reset to placeholder". This message is misleading and makes it harder to reason about whether a payload file should exist/has been committed.
    aks-node-controller/nodecustomdata.go:54
  • applyWriteFiles is now used for optional non-nodecustomdata payloads (e.g., the script hotfix), but the read error still says "read nodecustomdata" which can be confusing when debugging failures on anc-scripts-hotfix-files.yml.
	data, err := os.ReadFile(path)
	if err != nil {
		if os.IsNotExist(err) {
			return nil
		}
		return fmt.Errorf("read nodecustomdata %s: %w", path, err)
	}

Comment thread aks-node-controller/hotfix.go
Comment thread aks-node-controller/hotfix.go
Devin Wong and others added 2 commits June 24, 2026 20:22
Addresses Copilot review: readHotfixConfig now treats whitespace-only files as an
empty/no-op config (a partial/truncated write no longer errors download-hotfix) and
trims version/scripts_version once at this boundary. Downstream comparisons
(shouldUpgradeToHotfix, shouldApplyScriptsVersion) stay trim-free since they consume
the already-normalized values -- keeping TrimSpace at the single boundary instead of
scattering it. Added whitespace-only and trimmed-version test cases.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…yNodeCustomData

The applyNodeCustomData strictness change (missing required file now errors) broke
TestApp_Run/TestApp_Provision NBCCmd cases that never created a nodecustomdata.yml and
previously relied on the silent no-op. NewTestApp now writes a minimal empty-write_files
nodecustomdata.yml and sets App.nodeCustomDataPath; the one raw-&App{} test does the same.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 25, 2026 03:47

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

hotfix/hotfix_generate.py:327

  • The log message says the payload was “reset to placeholder”, but reset_payload() actually deletes the file (absence = no active hotfix). This can mislead operators/debugging when no matching blocks are found.

if err := a.downloadHotfixBinary(ctx, hotfixCfg); err != nil {
return err
}
return a.applyScriptHotfix(hotfixCfg)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend applying scripts hotfix before binary download, greater chance of success for scripts not to fail

return fmt.Errorf("read hotfix version from %s: %w", hotfixPath, err)
return fmt.Errorf("read hotfix config from %s: %w", hotfixPath, err)
}
if !exists {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this exists flag here?

Isnt Version == "" && Scripts_Version == "" mean the same or just fall through and everything should work out since we check for emptyness in individual functions anyways. Or simple

if hotfixCfg.Version != "" {
   if err := a.downloadHotfixBinary(ctx, hotfixCfg.Version); err != nil {
		return err
	}
}

if hotfixCfg.Scripts_Version != "" {
   if err := a.applyScriptHotfix(ctx, hotfixCfg.Scripts_version); err != nil {
		return err
	}
}

return nil

// nodecustomdata.yml is required input for the NBCCmd provisioning path, so a missing
// file must be a hard error. applyWriteFiles itself treats absence as a no-op for
// optional payloads (e.g. the script hotfix), so guard existence here before delegating.
if _, err := os.Stat(path); err != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already perform the check inside applyWriteFiles this is not needed here

@aks-node-assistant

Copy link
Copy Markdown
Contributor

Failed gate run

Run: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=169423717
Failed job/stage/task: e2e / Run AgentBaker E2E / Run AgentBaker E2E

Detective summary

Test_Ubuntu2204_ContainerdHasCurrentVersion/scriptless_nbc reached node validation, then validators.go:1010 reported localdns-exporter@...service unexpectedly entered a failed state. Timeline and build status both corroborate the E2E task exit code 1; this signature is already recurring across unrelated gate runs.

Likely cause and signature

Likely recurring LocalDNS exporter/test-infra flake rather than a deterministic PR regression. Signature: localdns-exporter-systemd-failed-state. Confidence: medium-high. Strongest alternative: this PR's hotfix/customdata changes contributed to adjacent bootstrap failures, but the LocalDNS failed-unit signature predates this PR and is already tracked.

Recommended owner/action

Node Lifecycle E2E/LocalDNS owners should continue investigating the failed-unit scenario bundle. PR author action is not clearly indicated from this evidence.

Evidence

@aks-node-assistant

Copy link
Copy Markdown
Contributor

Failed gate run

Run: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=169447199
Failed job/stage/task: e2e / Run AgentBaker E2E / Run AgentBaker E2E

Detective summary

Test_Ubuntu2204_HTTPSProxy_PrivateDNS/default failed during VMSS CSE provisioning with CustomScript ExitCode=50 and KubeletStartTime=n/a. Timeline and build status confirm the E2E task failed; the Build VHD exit-code-2 warning retried successfully and was not the failed pair.

Likely cause and signature

Known recurring CSE/provisioning flake. Signature: ubuntu2204-httpsproxy-privatedns-cse-exit50-outbound-kubelet-na. Confidence: medium-high. Strongest alternative: PR #8767 customdata/hotfix refactor caused CSE failure, but the exact kubelet-n/a plus mandb/oldlocal pattern is already tracked under repair Bug #38559852 across unrelated PRs.

Recommended owner/action

Node Lifecycle repair owner should continue Bug #38559852; PR author should investigate only if subsequent runs show a new ANC/customdata-specific signature.

Evidence

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants