FA rollout experiment fixes#3035
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3035 +/- ##
==========================================
+ Coverage 42.24% 42.30% +0.06%
==========================================
Files 337 337
Lines 28952 29008 +56
==========================================
+ Hits 12230 12273 +43
- Misses 15917 15925 +8
- Partials 805 810 +5
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
This comment has been minimized.
This comment has been minimized.
86065ef to
5aa1b51
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5aa1b51cb3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if rev != nil && instance.Status.Experiment.StartedAt != nil { | ||
| // status.experiment.startedAt is the timeout anchor. rev.CreationTimestamp |
There was a problem hiding this comment.
Preserve timeout behavior when StartedAt is missing
Add a backward-compatible timeout anchor for running experiments that predate this field. After an operator upgrade, existing status.experiment.phase=running entries won't have startedAt, so this guard skips timeout checks forever and those experiments never auto-rollback. Previously they still timed out via ControllerRevision.CreationTimestamp; without a fallback, in-flight experiments can remain running indefinitely until manually stopped.
Useful? React with 👍 / 👎.
What does this PR do?
Addressing several bugs/issues:
creationTimestampwhich is used as proxy for experiment start time, doesn't change. Once these revisions are old enough controller will timeout and rollback.Easiest to review Commit by commit
b6ef9e5 Decouples experiment timing from
ControllerRevisionmetadata by adding explicitStartedAtfield toExperimentStatus.rev.CreationTimestampmight be stale if new spec has is same as earlier experiment. This could lead to immediate timeout if spec matched pre-existing controller rev spec.d202cf7 rehydrate installer state from DatadogAgent on daemon startup to report state with correct task ID instead of empty one forcing FA to send start requests.
6cf24ad report TaskState_ERROR for the original start task on timeout.
8509c37 replaces two annotations (
*experiment-promoted,*experiment-rollback) on controller rev with a single annotation to avoid accidentally accumulating two mutually exclusive annotations and to simplify logic. Bug reproducible by flipping same config back and forth and intentionally failing experiment to force rollback.5aa1b51 similar to 6cf24ad, report
aborterror for the experiment task when DDA is changed during experiment.Motivation
What inspired you to submit this pull request?
Additional Notes
Anything else we should know when reviewing?
Minimum Agent Versions
Are there minimum versions of the Datadog Agent and/or Cluster Agent required?
Describe your test plan
Write there any instructions and details you may have to test your PR.
Checklist
bug,enhancement,refactoring,documentation,tooling, and/ordependenciesqa/skip-qalabel