Skip to content

FA rollout experiment fixes#3035

Open
levan-m wants to merge 5 commits into
mainfrom
levan-m/experiment-fixes
Open

FA rollout experiment fixes#3035
levan-m wants to merge 5 commits into
mainfrom
levan-m/experiment-fixes

Conversation

@levan-m
Copy link
Copy Markdown
Collaborator

@levan-m levan-m commented May 21, 2026

What does this PR do?

Addressing several bugs/issues:

  • When changing same config back and forth to rollout, two controller revisions basically get version bump reusing same name/hash. This leads to accumulation of both marker annotations if one config update is rolled back.
        operator.datadoghq.com/experiment-promoted: "true"
        operator.datadoghq.com/experiment-rollback: "true"
  • Again when flipping same config, revisions of two controllerrevs increase while their creationTimestamp which is used as proxy for experiment start time, doesn't change. Once these revisions are old enough controller will timeout and rollback.
  • If experiment times out, subsequent experiments will fail unless different config change to force new controller rev creation.
  • Operator doesn't send task abort or timeout errors to FA (regardless how FA processes those, which now are ignored).

Easiest to review Commit by commit

b6ef9e5 Decouples experiment timing from ControllerRevision metadata by adding explicit StartedAt field to ExperimentStatus. rev.CreationTimestamp might be stale if new spec has is same as earlier experiment. This could lead to immediate timeout if spec matched pre-existing controller rev spec.

d202cf7 rehydrate installer state from DatadogAgent on daemon startup to report state with correct task ID instead of empty one forcing FA to send start requests.

6cf24ad report TaskState_ERROR for the original start task on timeout.

8509c37 replaces two annotations (*experiment-promoted, *experiment-rollback) on controller rev with a single annotation to avoid accidentally accumulating two mutually exclusive annotations and to simplify logic. Bug reproducible by flipping same config back and forth and intentionally failing experiment to force rollback.

5aa1b51 similar to 6cf24ad, report abort error for the experiment task when DDA is changed during experiment.

Motivation

What inspired you to submit this pull request?

Additional Notes

Anything else we should know when reviewing?

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

  • Agent: vX.Y.Z
  • Cluster Agent: vX.Y.Z

Describe your test plan

Write there any instructions and details you may have to test your PR.

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

@levan-m levan-m added this to the v1.27.0 milestone May 21, 2026
@levan-m levan-m added bug Something isn't working enhancement New feature or request labels May 21, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 21, 2026

Codecov Report

❌ Patch coverage is 79.01235% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.30%. Comparing base (03e4b46) to head (5aa1b51).

Files with missing lines Patch % Lines
pkg/fleet/daemon.go 65.38% 6 Missing and 3 partials ⚠️
internal/controller/datadogagent/experiment.go 76.92% 4 Missing and 2 partials ⚠️
pkg/fleet/daemon_worker.go 92.85% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3035      +/-   ##
==========================================
+ Coverage   42.24%   42.30%   +0.06%     
==========================================
  Files         337      337              
  Lines       28952    29008      +56     
==========================================
+ Hits        12230    12273      +43     
- Misses      15917    15925       +8     
- Partials      805      810       +5     
Flag Coverage Δ
unittests 42.30% <79.01%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
internal/controller/datadogagent/revision.go 78.46% <100.00%> (ø)
pkg/fleet/daemon_worker.go 78.18% <92.85%> (+1.95%) ⬆️
internal/controller/datadogagent/experiment.go 77.52% <76.92%> (-0.08%) ⬇️
pkg/fleet/daemon.go 69.34% <65.38%> (-0.60%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 03e4b46...5aa1b51. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@datadog-datadog-prod-us1-2

This comment has been minimized.

@levan-m levan-m force-pushed the levan-m/experiment-fixes branch from 86065ef to 5aa1b51 Compare May 21, 2026 21:07
@levan-m levan-m marked this pull request as ready for review May 22, 2026 01:19
@levan-m levan-m requested a review from a team May 22, 2026 01:19
@levan-m levan-m requested review from a team as code owners May 22, 2026 01:19
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5aa1b51cb3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +474 to +475
if rev != nil && instance.Status.Experiment.StartedAt != nil {
// status.experiment.startedAt is the timeout anchor. rev.CreationTimestamp
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve timeout behavior when StartedAt is missing

Add a backward-compatible timeout anchor for running experiments that predate this field. After an operator upgrade, existing status.experiment.phase=running entries won't have startedAt, so this guard skips timeout checks forever and those experiments never auto-rollback. Previously they still timed out via ControllerRevision.CreationTimestamp; without a fallback, in-flight experiments can remain running indefinitely until manually stopped.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants