Skip to content

[FA] Gate start task DONE on DaemonSet rollout readiness#3028

Open
coignetp wants to merge 5 commits into
mainfrom
paul/fa-done-state
Open

[FA] Gate start task DONE on DaemonSet rollout readiness#3028
coignetp wants to merge 5 commits into
mainfrom
paul/fa-done-state

Conversation

@coignetp
Copy link
Copy Markdown
Contributor

@coignetp coignetp commented May 19, 2026

Summary

  • Gates the start task DONE state on actual DaemonSet rollout completion (all pods updated and ready), preventing a premature DONE before the rollout finishes.
  • Switches setTaskState to proto.Clone so all existing fields are preserved without manual enumeration.

Test plan

  • go test ./pkg/fleet/... passes
  • New TestRunPendingOperationWorker_Start* tests cover rollout gating (nil agent, partial update, complete)
  • TestSetTaskState_PreservesAllFields pins proto.Clone behaviour
  • TestIsDaemonSetRolloutComplete covers nil / zero-desired / partial / complete cases

- Gate start task completion on DaemonSet rollout (all pods updated and
  ready) so the DONE state reflects actual rollout progress rather than
  just task dispatch.
- Report rollout progress in Task.Error.Message while RUNNING so the RC
  backend can surface it to the user.
- Populate RunningVersion in PackageState RC state; use proto.Clone in
  setTaskState so all fields are preserved without manual enumeration.
- Revert INVALID_STATE for promote precondition failures back to ERROR
  (INVALID_STATE is reserved for RC config state mismatches).
- Bump datadog-agent/pkg/proto to v0.77.0.
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.78%. Comparing base (20ecb9e) to head (a2a84af).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3028      +/-   ##
==========================================
+ Coverage   41.50%   41.78%   +0.27%     
==========================================
  Files         335      336       +1     
  Lines       28714    29101     +387     
==========================================
+ Hits        11919    12161     +242     
- Misses      16001    16128     +127     
- Partials      794      812      +18     
Flag Coverage Δ
unittests 41.78% <100.00%> (+0.27%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
pkg/fleet/daemon_worker.go 77.48% <100.00%> (+1.25%) ⬆️

... and 5 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 20ecb9e...a2a84af. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@datadog-datadog-prod-us1-2
Copy link
Copy Markdown

datadog-datadog-prod-us1-2 Bot commented May 19, 2026

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 5 Pipeline jobs failed

DataDog/datadog-operator | build   View in Datadog   GitLab

🔧 Fix in code (Fix with Cursor). Compilation error due to undefined: v2alpha1.AnnotationPendingPreExperimentHash in pkg/fleet/daemon_operations.go

DataDog/datadog-operator | build_operator_image_arm64   View in Datadog   GitLab

🔧 Fix in code (Fix with Cursor). Undefined: v2alpha1.AnnotationPendingPreExperimentHash in pkg/fleet/daemon_operations.go and pkg/fleet/daemon_worker.go leads to build failure.

DataDog/datadog-operator | build_operator_image_fips_amd64   View in Datadog   GitLab

🔧 Fix in code (Fix with Cursor). Compilation error: undefined: v2alpha1.AnnotationPendingPreExperimentHash in multiple files at pkg/fleet/daemon_operations.go and pkg/fleet/daemon_worker.go.

View all 5 failed jobs.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 435014c | Docs | Datadog PR Page | Give us feedback!

coignetp added 3 commits May 19, 2026 15:02
Remove proto.Clone refactor and comment update — these are independent
of the DONE state gating feature and belong in a separate PR.
@coignetp coignetp marked this pull request as ready for review May 19, 2026 13:23
@coignetp coignetp requested a review from a team May 19, 2026 13:23
@coignetp coignetp requested a review from a team as a code owner May 19, 2026 13:23
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a2a84af182

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

case pendingIntentStart:
if phase == v2alpha1.ExperimentPhaseRunning {
return true, nil
return isDaemonSetRolloutComplete(snapshot.agent), nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Verify the rollout status matches the new spec

For a start task on an already-healthy DatadogAgent, this can still mark the task DONE before the experiment DaemonSet has rolled out. The Fleet start patch writes the new spec and signal together, then processStartSignal sets status.experiment.phase=running; in the same reconcile, addDDAIStatusToDDAStatus copies the current DDAI status, which is still the pre-experiment DaemonSet status until the DDAI controller updates and rolls the DaemonSet. If that old status was healthy (UpToDate == Desired && Ready == Desired), this line returns true, clears the pending annotations, and updates RC as if the rollout completed. The gate needs to ensure the status corresponds to the experiment spec/hash/generation, not just that the last reported counts were healthy.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants