refactor: write ANC hotfix.json via boothook and rename target_version to scripts_version#8767
refactor: write ANC hotfix.json via boothook and rename target_version to scripts_version#8767Devinwong wants to merge 25 commits into
Conversation
Ensure download-hotfix can materialize hotfix metadata before wrapper pre-checks, and gate script hotfix write_files by cse_version in scriptless --nbc-cmd provisioning. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR refines the scriptless --nbc-cmd hotfix flow for aks-node-controller (ANC) by ensuring the hotfix config file can be materialized early from nodecustomdata.yml, adding cse_version-based gating for applying auto-generated script hotfix write_files, and updating tests around the new behavior.
Changes:
- Always run
aks-node-controller download-hotfixas a pre-check in the wrapper before selecting the binary to execute. - Extend ANC hotfix config parsing to support optional
cse_version, and skip applying auto-generated script hotfix blocks when it doesn’t match. - Materialize only the hotfix config
write_filefromnodecustomdata.ymlbefore evaluating hotfix applicability; add/update unit tests.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| parts/linux/cloud-init/artifacts/aks-node-controller-wrapper.sh | Always runs download-hotfix pre-check before selecting hotfix vs VHD-baked ANC binary. |
| hotfix/anc_hotfix_generate.py | Extends injected hotfix JSON payload to optionally include cse_version, with format validation. |
| aks-node-controller/nodecustomdata.go | Adds script-hotfix block stripping gated by cse_version, plus filtered application for specific write_files. |
| aks-node-controller/nodecustomdata_test.go | Adds a unit test for cse_version mismatch behavior when applying nodecustomdata write_files. |
| aks-node-controller/hotfix.go | Materializes hotfix config from nodecustomdata before reading; adds cse_version parsing and gating helper. |
| aks-node-controller/hotfix_test.go | Updates/extends tests for forward-compatible parsing, config materialization, and cse_version gating helper. |
Use target_version naming for script hotfix gating and ANC hotfix config generation to reflect generic version targeting. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Allow target_version in YYYYMM.DD form to match any patch for that base while keeping exact YYYYMM.DD.PATCH support. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document base and exact target_version matching semantics in ANC hotfix gating logic. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add shellspec assertions that wrapper invokes download-hotfix before provision when config or nbc-cmd exists, and update hotfix config error wording. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move script hotfix write_files application into download-hotfix (gated by target_version) while keeping ANC binary hotfix install behavior. This makes script hotfix file materialization part of the download-hotfix phase. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Only trim a preceding newline when removing script hotfix block if it is an actual blank separator, and extend mismatch test to validate post-hotfix write_files still apply. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rename inner error variable in download-hotfix script hotfix apply step to avoid shadowing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Only invoke 'aks-node-controller download-hotfix' when the hotfix config JSON exists on disk, instead of spawning the ANC binary on every boot to have it no-op. Encapsulates the hotfix decision in the wrapper and saves boot time on nodes without an active hotfix. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the committed placeholder parts/hotfix/aks-node-controller-hotfix.json ({})
and parts/hotfix/anc-scripts-hotfix-files.yml (empty write_files). They only exist
on official/* branches when a hotfix is active.
- baker.go: getLinuxNodeScriptsHotfixFilesPayload and loadHotfixConfigJSON now treat
fs.ErrNotExist as 'no active hotfix' (return empty), instead of a hard error.
- hack/hotfix_generate.py: reset_payload removes the file (was: write placeholder);
write_payload makes the parts/hotfix dir before writing.
- hotfix/anc_hotfix_generate.py: removes the file when no hotfix (was: write {});
makes the dir before writing.
- Both GH workflows: commit step now handles create/update/delete via the contents
API. Fixes a stale bug where anc-hotfix-generate diffed nodecustomdata.yml instead
of the JSON config file.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-locate the script-hotfix payload generator with the other hotfix tooling and data (anc-hotfix-version.json, anc_hotfix_generate.py) under hotfix/. Updates the hotfix-generate GH Action invocation and in-file usage/comment references. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
hotfix/hotfix_generate.py:327
- reset_payload() deletes OUTPUT so there is no committed payload file. The log line "payload reset to placeholder" is misleading because no placeholder is written; the file is removed. This can confuse hotfix action logs when diagnosing whether a payload exists.
Addresses awesomenix's review comment to generalize the per-file boothook block format. Previously hotfix.json shipped plain-base64 via a separate hotfixConfigBlockFmt; now it is gzipped like the other payloads and reuses the single gzipFileBlockFmt. - baker.go: encode hotfix config with getBase64EncodedGzippedCustomScriptFromStr; use gzipFileBlockFmt for it; delete hotfixConfigBlockFmt. - flatcar: add compression:gzip to flatcarHotfixConfigEntry to match the others. - On-node result is unchanged: the boothook gunzips it back to plain JSON on disk. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses awesomenix's comment about over-massaging the hotfix config file:
- Existence is handled idiomatically via fs.ErrNotExist (no separate stat needed
for embed.FS).
- Drop the redundant '{}'/len==0 check (subsumed by the empty-fields guard; empty
data already errors at json.Unmarshal).
- Drop per-field and final TrimSpace; the generator writes compact JSON and ANC
tolerates whitespace when parsing, so return the raw bytes.
- Keep one cheap semantic guard: empty version+scripts_version means no active hotfix.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
aks-node-controller/nodecustomdata.go:53
- applyWriteFiles is now a general helper for optional #cloud-config write_files payloads (e.g., script hotfix payload), but its error messages still say "nodecustomdata". This will be misleading when failures come from anc-scripts-hotfix-files.yml (or other future payloads). Consider updating the error strings to refer to a generic write_files payload/document.
func applyWriteFiles(path string) error {
data, err := os.ReadFile(path)
if err != nil {
if os.IsNotExist(err) {
return nil
}
return fmt.Errorf("read nodecustomdata %s: %w", path, err)
}
var customData nodeCustomData
if err := yaml.Unmarshal(data, &customData); err != nil {
return fmt.Errorf("unmarshal nodecustomdata %s: %w", path, err)
}
Addresses awesomenix's 'TrimSpace everywhere?' comment. The hotfix config JSON is generated by our own tooling (compact JSON) and the version is our build ldflag, so the defensive re-trims were noise. - readHotfixConfig: drop field-level TrimSpace on Version/ScriptsVersion; empty-file guard uses len(data) instead of TrimSpace. - shouldApplyScriptsVersion / shouldUpgradeToHotfix: compare the values directly. - Keep the /etc/os-release parsing trims (that file is the OS's, genuine text parsing). - Drop the now-invalid 'trimmed exact match' test case. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tion Addresses Copilot review comments: - applyNodeCustomData (required NBCCmd path, exit 240) now errors on a missing nodecustomdata.yml instead of inheriting applyWriteFiles' optional no-op, so a missing required file fails loudly instead of silently skipping provisioning. applyWriteFiles keeps no-op-on-absence for optional payloads (script hotfix). - getLinuxNodeScriptsHotfixFilesPayload detects write_files entries with a line-anchored helper (line starts with '- path:') instead of strings.Contains, so a stray '- path:' inside a comment or file content can't be a false positive. - Drop stale 'placeholder' wording now that no placeholder file is committed. - Add TestApplyNodeCustomDataMissingFileErrors. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (2)
hotfix/hotfix_generate.py:327
- When no matching write_files blocks are found, reset_payload() deletes the payload file, but the final log says "payload reset to placeholder". This message is misleading and makes it harder to reason about whether a payload file should exist/has been committed.
aks-node-controller/nodecustomdata.go:54 - applyWriteFiles is now used for optional non-nodecustomdata payloads (e.g., the script hotfix), but the read error still says "read nodecustomdata" which can be confusing when debugging failures on anc-scripts-hotfix-files.yml.
data, err := os.ReadFile(path)
if err != nil {
if os.IsNotExist(err) {
return nil
}
return fmt.Errorf("read nodecustomdata %s: %w", path, err)
}
Addresses Copilot review: readHotfixConfig now treats whitespace-only files as an empty/no-op config (a partial/truncated write no longer errors download-hotfix) and trims version/scripts_version once at this boundary. Downstream comparisons (shouldUpgradeToHotfix, shouldApplyScriptsVersion) stay trim-free since they consume the already-normalized values -- keeping TrimSpace at the single boundary instead of scattering it. Added whitespace-only and trimmed-version test cases. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…yNodeCustomData
The applyNodeCustomData strictness change (missing required file now errors) broke
TestApp_Run/TestApp_Provision NBCCmd cases that never created a nodecustomdata.yml and
previously relied on the silent no-op. NewTestApp now writes a minimal empty-write_files
nodecustomdata.yml and sets App.nodeCustomDataPath; the one raw-&App{} test does the same.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
hotfix/hotfix_generate.py:327
- The log message says the payload was “reset to placeholder”, but reset_payload() actually deletes the file (absence = no active hotfix). This can mislead operators/debugging when no matching blocks are found.
| if err := a.downloadHotfixBinary(ctx, hotfixCfg); err != nil { | ||
| return err | ||
| } | ||
| return a.applyScriptHotfix(hotfixCfg) |
There was a problem hiding this comment.
Recommend applying scripts hotfix before binary download, greater chance of success for scripts not to fail
| return fmt.Errorf("read hotfix version from %s: %w", hotfixPath, err) | ||
| return fmt.Errorf("read hotfix config from %s: %w", hotfixPath, err) | ||
| } | ||
| if !exists { |
There was a problem hiding this comment.
Why do you need this exists flag here?
Isnt Version == "" && Scripts_Version == "" mean the same or just fall through and everything should work out since we check for emptyness in individual functions anyways. Or simple
if hotfixCfg.Version != "" {
if err := a.downloadHotfixBinary(ctx, hotfixCfg.Version); err != nil {
return err
}
}
if hotfixCfg.Scripts_Version != "" {
if err := a.applyScriptHotfix(ctx, hotfixCfg.Scripts_version); err != nil {
return err
}
}
return nil
| // nodecustomdata.yml is required input for the NBCCmd provisioning path, so a missing | ||
| // file must be a hard error. applyWriteFiles itself treats absence as a no-op for | ||
| // optional payloads (e.g. the script hotfix), so guard existence here before delegating. | ||
| if _, err := os.Stat(path); err != nil { |
There was a problem hiding this comment.
we already perform the check inside applyWriteFiles this is not needed here
Failed gate runRun: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=169423717 Detective summary
Likely cause and signatureLikely recurring LocalDNS exporter/test-infra flake rather than a deterministic PR regression. Signature: Recommended owner/actionNode Lifecycle E2E/LocalDNS owners should continue investigating the failed-unit scenario bundle. PR author action is not clearly indicated from this evidence. Evidence
|
Failed gate runRun: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=169447199 Detective summary
Likely cause and signatureKnown recurring CSE/provisioning flake. Signature: Recommended owner/actionNode Lifecycle repair owner should continue Bug #38559852; PR author should investigate only if subsequent runs show a new ANC/customdata-specific signature. Evidence
|
Summary
Write the ANC hotfix config (
aks-node-controller-hotfix.json) directly from the boothookinstead of via a
write_filesentry innodecustomdata.yml. This eliminates the double-parseof
nodecustomdata.ymlindownload-hotfix, and gates an optional script hotfix payload byscripts_version.Files / source of truth
hotfix/anc-hotfix-version.jsonversion= ANC binary hotfix version to download/installscripts_version= ANC/VHD version base that should receive the script hotfixwrite_filesbaker.go:parts/hotfix/aks-node-controller-hotfix.jsonhotfix/anc-hotfix-version.jsonbyhotfix/anc_hotfix_generate.py.parts.Templatesand shipped in custom data as/opt/azure/containers/aks-node-controller-hotfix.json.parts/hotfix/anc-scripts-hotfix-files.ymlhotfix/hotfix_generate.py; shipped as/opt/azure/containers/anc-scripts-hotfix-files.yml.Both generated files are only committed on
official/*hotfix branches. On normal builds theyare absent, which
baker.gotreats as "no active hotfix" (ships nothing).Flow
flowchart TD A["ABSVC / baker.go\nreads parts/hotfix/aks-node-controller-hotfix.json\n(embedded at build time)"] B["Boothook\nwrite nodecustomdata.yml\nwrite aks-node-controller-hotfix.json\nwrite anc-scripts-hotfix-files.yml (if active)"] C["download-hotfix\nread aks-node-controller-hotfix.json"] D{"ANC binary\nhotfix applies?"} E["Download & stage\nhotfix binary"] F["Skip binary download"] S{"scripts_version\nmatches running version?"} T["apply script hotfix\nwrite_files"] U["skip script hotfix"] G["provision\nread nodecustomdata.yml\napply write_files\nrun NBC cmd"] A --> B --> C --> D D -- Yes --> E --> S D -- No --> F --> S S -- Yes --> T --> G S -- No --> U --> GNote: the script hotfix
write_filesare applied insidedownload-hotfix(viaapplyScriptHotfix), which the wrapper runs beforeprovision.provisionthenmaterializes the main
nodecustomdata.ymland runs the NBC command.Changes
pkg/agent/baker.goloadHotfixConfigJSON()andgetLinuxNodeScriptsHotfixFilesPayload(); ship the gzipped hotfix config + optional script payload via the sharedgzipFileBlockFmtboothook block / Flatcar Ignition entries. Absent embed files are treated as "no active hotfix".hotfix/anc-hotfix-version.json{version, scripts_version}), edited by operators onofficial/*.parts/hotfix/aks-node-controller-hotfix.jsonmain); produced byhotfix/anc_hotfix_generate.py.parts/hotfix/anc-scripts-hotfix-files.ymlwrite_filespayload (not committed onmain); produced byhotfix/hotfix_generate.py.hotfix/anc_hotfix_generate.py/hotfix/hotfix_generate.pynodecustomdata.yml); delete the generated file when no hotfix is active.aks-node-controller/hotfix.godownload-hotfixreads the config, conditionally downloads the binary, and applies the script hotfixwrite_files(gated byscripts_version).target_version→scripts_version.aks-node-controller/nodecustomdata.goapplyWriteFilesno-ops on a missing (optional) payload;applyNodeCustomDatastays strict (errors on missing requirednodecustomdata.yml)..github/workflows/*hotfix-generate.yml