[FA] Gate start task DONE on DaemonSet rollout readiness#3028
Conversation
- Gate start task completion on DaemonSet rollout (all pods updated and ready) so the DONE state reflects actual rollout progress rather than just task dispatch. - Report rollout progress in Task.Error.Message while RUNNING so the RC backend can surface it to the user. - Populate RunningVersion in PackageState RC state; use proto.Clone in setTaskState so all fields are preserved without manual enumeration. - Revert INVALID_STATE for promote precondition failures back to ERROR (INVALID_STATE is reserved for RC config state mismatches). - Bump datadog-agent/pkg/proto to v0.77.0.
e32ec91 to
0be6bc5
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3028 +/- ##
==========================================
+ Coverage 41.50% 41.78% +0.27%
==========================================
Files 335 336 +1
Lines 28714 29101 +387
==========================================
+ Hits 11919 12161 +242
- Misses 16001 16128 +127
- Partials 794 812 +18
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 5 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
|
Remove proto.Clone refactor and comment update — these are independent of the DONE state gating feature and belong in a separate PR.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a2a84af182
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| case pendingIntentStart: | ||
| if phase == v2alpha1.ExperimentPhaseRunning { | ||
| return true, nil | ||
| return isDaemonSetRolloutComplete(snapshot.agent), nil |
There was a problem hiding this comment.
Verify the rollout status matches the new spec
For a start task on an already-healthy DatadogAgent, this can still mark the task DONE before the experiment DaemonSet has rolled out. The Fleet start patch writes the new spec and signal together, then processStartSignal sets status.experiment.phase=running; in the same reconcile, addDDAIStatusToDDAStatus copies the current DDAI status, which is still the pre-experiment DaemonSet status until the DDAI controller updates and rolls the DaemonSet. If that old status was healthy (UpToDate == Desired && Ready == Desired), this line returns true, clears the pending annotations, and updates RC as if the rollout completed. The gate needs to ensure the status corresponds to the experiment spec/hash/generation, not just that the last reported counts were healthy.
Useful? React with 👍 / 👎.
Summary
DONEstate on actual DaemonSet rollout completion (all pods updated and ready), preventing a prematureDONEbefore the rollout finishes.setTaskStatetoproto.Cloneso all existing fields are preserved without manual enumeration.Test plan
go test ./pkg/fleet/...passesTestRunPendingOperationWorker_Start*tests cover rollout gating (nil agent, partial update, complete)TestSetTaskState_PreservesAllFieldspins proto.Clone behaviourTestIsDaemonSetRolloutCompletecovers nil / zero-desired / partial / complete cases