Skip to content

ensure the machine state snapshot capturing on build pods too#5052

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
droslean:machines-2
Mar 31, 2026
Merged

ensure the machine state snapshot capturing on build pods too#5052
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
droslean:machines-2

Conversation

@droslean
Copy link
Copy Markdown
Member

No description provided.

Signed-off-by: Nikolaos Moraitis <nmoraiti@redhat.com>
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 26, 2026

Walkthrough

Added a new method to MetricsAgent that asynchronously polls for a Kubernetes Pod and captures a machine snapshot when a ci-workload label is detected. The method is integrated into the build pod handling flow in the source step processor.

Changes

Cohort / File(s) Summary
Metrics snapshot polling for build pods
pkg/metrics/metrics.go, pkg/steps/source.go
Added StoreMachinesSnapshotForBuildPod() method with goroutine-based polling logic (5s intervals, 60-minute timeout) that waits for a Pod's ci-workload label before capturing metrics. Integrated the new method into handleBuild() to trigger snapshot capture for build pods.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from deepsm007 and hector-vido March 26, 2026 14:28
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 26, 2026
@droslean
Copy link
Copy Markdown
Member Author

/test e2e

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/steps/source.go`:
- Line 601: The call to
client.MetricsAgent().StoreMachinesSnapshotForBuildPod(...) is being started
inside the retry loop and thus spawns a new 60-minute watcher on every retry;
move this call so the snapshot watcher is started only once per build attempt
set—either relocate the invocation outside the retry loop that contains the pod
start/retry logic or protect it with a sync.Once (or a boolean on the build
attempt struct) so client.MetricsAgent().StoreMachinesSnapshotForBuildPod(ctx,
ns, fmt.Sprintf("%s-build", name), podClient) is invoked a single time using the
established podClient/ns/name for that attempt.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0d50279f-43d2-4bd7-865b-e5a3129f78dc

📥 Commits

Reviewing files that changed from the base of the PR and between 5357e13 and 9273294.

📒 Files selected for processing (2)
  • pkg/metrics/metrics.go
  • pkg/steps/source.go

}

client.MetricsAgent().AddNodeWorkload(ctx, ns, fmt.Sprintf("%s-build", name), name, podClient)
client.MetricsAgent().StoreMachinesSnapshotForBuildPod(ctx, ns, fmt.Sprintf("%s-build", name), podClient)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Start the build-pod snapshot watcher only once per build attempt set.

Line 601 is inside the retry loop, so each retry starts another 60-minute polling goroutine for the same pod. That can inflate API calls and duplicate machine snapshot events.

Suggested fix
 func handleBuild(ctx context.Context, client BuildClient, podClient kubernetes.PodClient, build buildapi.Build, needsMultiArchWorkaround func() bool) error {
 	const attempts = 5
 	ns, name := build.Namespace, build.Name
 	var errs []error
+	snapshotWatcherStarted := false
 	if err := wait.ExponentialBackoff(wait.Backoff{Duration: time.Minute, Factor: 1.5, Steps: attempts}, func() (bool, error) {
 		var attempt buildapi.Build
@@
 		client.MetricsAgent().AddNodeWorkload(ctx, ns, fmt.Sprintf("%s-build", name), name, podClient)
-		client.MetricsAgent().StoreMachinesSnapshotForBuildPod(ctx, ns, fmt.Sprintf("%s-build", name), podClient)
+		if !snapshotWatcherStarted {
+			client.MetricsAgent().StoreMachinesSnapshotForBuildPod(ctx, ns, fmt.Sprintf("%s-build", name), podClient)
+			snapshotWatcherStarted = true
+		}
 		if err := waitForBuildOrTimeout(ctx, client, podClient, ns, name); err != nil {
 			errs = append(errs, err)
 			return false, handleFailedBuild(ctx, client, ns, name, err)
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/steps/source.go` at line 601, The call to
client.MetricsAgent().StoreMachinesSnapshotForBuildPod(...) is being started
inside the retry loop and thus spawns a new 60-minute watcher on every retry;
move this call so the snapshot watcher is started only once per build attempt
set—either relocate the invocation outside the retry loop that contains the pod
start/retry logic or protect it with a sync.Once (or a boolean on the build
attempt struct) so client.MetricsAgent().StoreMachinesSnapshotForBuildPod(ctx,
ns, fmt.Sprintf("%s-build", name), podClient) is invoked a single time using the
established podClient/ns/name for that attempt.

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage.

@droslean
Copy link
Copy Markdown
Member Author

/test e2e

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 31, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 31, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: droslean, liangxia

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 31, 2026

@droslean: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/breaking-changes 9273294 link false /test breaking-changes

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@droslean
Copy link
Copy Markdown
Member Author

/test images

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e

@openshift-merge-bot openshift-merge-bot bot merged commit 621c8b7 into openshift:main Mar 31, 2026
14 of 15 checks passed
@droslean droslean deleted the machines-2 branch April 1, 2026 01:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants