Skip to content

fix: implement OnEventStatusHandler.GetOperationStatus for DataLoad and DataMigrate#5945

Open
adityaupasani2 wants to merge 4 commits into
fluid-cloudnative:masterfrom
adityaupasani2:fix/on-event-status-handler
Open

fix: implement OnEventStatusHandler.GetOperationStatus for DataLoad and DataMigrate#5945
adityaupasani2 wants to merge 4 commits into
fluid-cloudnative:masterfrom
adityaupasani2:fix/on-event-status-handler

Conversation

@adityaupasani2
Copy link
Copy Markdown

What this PR does

Fixes #5944

OnEventStatusHandler.GetOperationStatus was an unimplemented stub returning nil, nil in both dataload and datamigrate packages. Since OnEvent is a valid selectable policy, this caused silent status loss at runtime — the operation status was overwritten with nil with no error surfaced to the user.

Root cause

In pkg/ddc/base/operation.go, the result of GetOperationStatus is used directly:

  • err is nil so no error is logged
  • opStatusToUpdate is nil
  • reflect.DeepEqual(opStatus, nil) is false, so UpdateOperationApiStatus(nil) is called
  • Status gets silently wiped

Fix

Implement OnEventStatusHandler.GetOperationStatus in both packages following the same pattern as OnceStatusHandler — look up the triggered job, check its finished condition, and return the correct phase, conditions and duration.

Tests

Replace the stub tests (returns nil result and nil error) with real coverage:

  • job complete → PhaseComplete
  • job failed → PhaseFailed
  • job still running → unchanged status returned

…nd DataMigrate

OnEventStatusHandler.GetOperationStatus was a stub returning nil, nil
in both dataload and datamigrate packages. Since OnEvent is a valid
selectable policy, this caused silent status loss at runtime - the
operation status was overwritten with nil with no error surfaced.

Implement the handler following the same pattern as OnceStatusHandler:
look up the triggered job, check its finished condition, and return
the correct phase, conditions and duration.

Also replace the stub tests with real coverage for complete, failed
and still-running job scenarios.

Fixes fluid-cloudnative#5944

Signed-off-by: Aditya Upasani <adityaupasani29@gmail.com>
@fluid-e2e-bot
Copy link
Copy Markdown

fluid-e2e-bot Bot commented Jun 2, 2026

Hi @adityaupasani2. Thanks for your PR.

I'm waiting for a fluid-cloudnative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the GetOperationStatus method for OnEventStatusHandler in both the dataload and datamigrate controllers, replacing the previous TODO stubs. It also adds comprehensive unit tests to verify the status handling logic under different job conditions. The review feedback highlights a missing return statement in the dataload status handler when a missing job's helm release is successfully deleted, which could lead to logging a nil error and incorrect execution flow. Additionally, it is recommended to use ctx.Log instead of o.Log in the datamigrate status handler for consistency and proper context propagation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +197 to +207
if err != nil {
if utils.IgnoreNotFound(err) == nil {
ctx.Log.Info("Related Job missing, will delete helm chart and retry", "namespace", ctx.Namespace, "jobName", jobName)
if err = helm.DeleteReleaseIfExists(releaseName, ctx.Namespace); err != nil {
ctx.Log.Error(err, "can't delete dataload release", "namespace", ctx.Namespace, "releaseName", releaseName)
return
}
}
ctx.Log.Error(err, "can't get dataload job", "namespace", ctx.Namespace, "jobName", jobName)
return
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a missing return statement when the related job is not found and the helm release is successfully deleted. If helm.DeleteReleaseIfExists succeeds, err is set to nil. Without an early return, the execution flows to ctx.Log.Error(err, ...) which will log a nil error as an error, and then return result, nil instead of returning early.

Suggested change
if err != nil {
if utils.IgnoreNotFound(err) == nil {
ctx.Log.Info("Related Job missing, will delete helm chart and retry", "namespace", ctx.Namespace, "jobName", jobName)
if err = helm.DeleteReleaseIfExists(releaseName, ctx.Namespace); err != nil {
ctx.Log.Error(err, "can't delete dataload release", "namespace", ctx.Namespace, "releaseName", releaseName)
return
}
}
ctx.Log.Error(err, "can't get dataload job", "namespace", ctx.Namespace, "jobName", jobName)
return
}
if err != nil {
if utils.IgnoreNotFound(err) == nil {
ctx.Log.Info("Related Job missing, will delete helm chart and retry", "namespace", ctx.Namespace, "jobName", jobName)
if err = helm.DeleteReleaseIfExists(releaseName, ctx.Namespace); err != nil {
ctx.Log.Error(err, "can't delete dataload release", "namespace", ctx.Namespace, "releaseName", releaseName)
return
}
return
}
ctx.Log.Error(err, "can't get dataload job", "namespace", ctx.Namespace, "jobName", jobName)
return
}

Comment on lines +225 to +228
if err = helm.DeleteReleaseIfExists(releaseName, ctx.Namespace); err != nil {
o.Log.Error(err, "can't delete DataMigrate release", "namespace", ctx.Namespace, "releaseName", releaseName)
return
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and proper context propagation, use ctx.Log instead of o.Log to log the error when deleting the release fails.

Suggested change
if err = helm.DeleteReleaseIfExists(releaseName, ctx.Namespace); err != nil {
o.Log.Error(err, "can't delete DataMigrate release", "namespace", ctx.Namespace, "releaseName", releaseName)
return
}
if err = helm.DeleteReleaseIfExists(releaseName, ctx.Namespace); err != nil {
ctx.Log.Error(err, "can't delete DataMigrate release", "namespace", ctx.Namespace, "releaseName", releaseName)
return
}

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.35%. Comparing base (488376b) to head (86c0727).
⚠️ Report is 8 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5945   +/-   ##
=======================================
  Coverage   63.35%   63.35%           
=======================================
  Files         481      481           
  Lines       33070    33070           
=======================================
  Hits        20953    20953           
  Misses      10442    10442           
  Partials     1675     1675           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@cheyang cheyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superseding this summary — consolidating all feedback into the inline comments on this PR. Please disregard the earlier /lgtm; this is not an approval.

…andler

Add explicit return after successfully deleting the helm release when
the job is not found, consistent with the DataMigrate implementation.

Addresses review comment on fluid-cloudnative#5945

Signed-off-by: Aditya Upasani <adityaupasani29@gmail.com>
@fluid-e2e-bot fluid-e2e-bot Bot removed the lgtm label Jun 3, 2026
@fluid-e2e-bot
Copy link
Copy Markdown

fluid-e2e-bot Bot commented Jun 3, 2026

New changes are detected. LGTM label has been removed.

@fluid-e2e-bot
Copy link
Copy Markdown

fluid-e2e-bot Bot commented Jun 3, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from cheyang by writing /assign @cheyang in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@adityaupasani2
Copy link
Copy Markdown
Author

@cheyang Addressed the review nit added the missing return after helm release deletion in dataload/status_handler.go to match the DataMigrate pattern.

…rate

Use ctx.Log for consistency with other handlers and proper context
propagation when logging helm release deletion errors.

Addresses Gemini bot review comment on fluid-cloudnative#5945

Signed-off-by: Aditya Upasani <adityaupasani29@gmail.com>
@adityaupasani2
Copy link
Copy Markdown
Author

Addressed Gemini bot's review comment changed o.Log to ctx.Log in the DataMigrate OnEventStatusHandler for consistency and proper context propagation.

/assign @cheyang

@@ -188,6 +188,54 @@ func (c *CronStatusHandler) GetOperationStatus(ctx cruntime.ReconcileRequestCont
}

func (o *OnEventStatusHandler) GetOperationStatus(ctx cruntime.ReconcileRequestContext, opStatus *datav1alpha1.OperationStatus) (result *datav1alpha1.OperationStatus, err error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost a line-for-line copy of OnceStatusHandler.GetOperationStatus above (lines 54-111), and the same duplication shows up in the datamigrate package. Could the shared body be pulled into a small helper (e.g. one that takes the client plus releaseName/jobName and returns the updated OperationStatus) so the two handlers do not drift over time? Not blocking, but it would make future status-handling fixes a single edit instead of four.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged . I'll keep this fix focused and open a separate issue + PR for the refactor to avoid expanding scope here. Happy to do it as a follow-up once this merges.

result = opStatus.DeepCopy()
object := o.dataMigrate
releaseName := utils.GetDataMigrateReleaseName(object.GetName())
jobName := utils.GetDataMigrateJobName(releaseName)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetJob looks the job up with object.GetNamespace(), but the missing-job branch logs and deletes the release using ctx.Namespace. This matches the existing Once handler so I won't block on it, but it would be more robust to use a single namespace source throughout.

ctx.Log.Error(err, "can't delete dataload release", "namespace", ctx.Namespace, "releaseName", releaseName)
return
}
return
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call adding the explicit return here after DeleteReleaseIfExists succeeds. The OnceStatusHandler version falls through and ends up logging a nil error, so this branch is genuinely cleaner and matches the Cron handler. Worth keeping.

ctx.Log.V(1).Info("DataLoad chart already existed, check its running status")
job, err := kubeclient.GetJob(o.Client, jobName, ctx.Namespace)
if err != nil {
if utils.IgnoreNotFound(err) == nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This job-missing branch (the helm.DeleteReleaseIfExists call plus early return) isn't exercised by any test. Since it's a real control-flow path, a case where the job is NotFound would be worth adding.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The job-missing path is hard to cover in unit tests because DeleteReleaseIfExists shells out to the ddc-helm binary which isn't present in the test environment the same gap exists in OnceStatusHandler. I attempted to add a NotFound test but it fails with exec: "ddc-helm": executable file not found. Happy to add it if there's a recommended way to mock helm calls in this codebase, otherwise I'll leave it consistent with the existing OnceStatusHandler pattern.

}
isJobSucceed := finishedJobCondition.Type == batchv1.JobComplete

if result.NodeAffinity == nil && isJobSucceed {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The success-path tests use jobs without the affinity-inject annotation, so GenerateNodeAffinity short-circuits to (nil, nil) and this label-generation block (plus its wrapped-error path) never actually runs. A test with the annotation present would cover it.

}

isJobSucceed := finishedJobCondition.Type == batchv1.JobComplete
if o.dataMigrate.Spec.Parallelism == 1 && result.NodeAffinity == nil && isJobSucceed {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only Parallelism == 1 is tested, so the branch that skips NodeAffinity for parallel migrations isn't verified. A Parallelism > 1 case would lock this guard in.

fakeClient := fake.NewFakeClientWithScheme(testScheme, dm)
handler := &OnEventStatusHandler{Client: fakeClient, dataMigrate: dm}
ctx := cruntime.ReconcileRequestContext{Log: fake.NullLogger()}
var (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're touching this file: the header reads Copyright 2026 whereas its siblings use 2023. A mismatched license header is a likely cause of the failing Project Check, so it's worth aligning and ruling out before the next CI run.

- Fix copyright year 2026 -> 2023 in datamigrate status_handler_test.go
  (was causing Project Check / lint failure)
- Run gofmt on implementation files to fix formatting
- Add NodeAffinity generation test with affinity annotation for DataLoad
- Add Parallelism > 1 test to verify NodeAffinity is skipped for
  parallel DataMigrate operations
- Use object.GetNamespace() consistently instead of ctx.Namespace
  in DataMigrate OnEventStatusHandler for correctness

Addresses review comments by cheyang on fluid-cloudnative#5945

Signed-off-by: Aditya Upasani <adityaupasani29@gmail.com>
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 4, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] OnEventStatusHandler.GetOperationStatus is unimplemented in DataLoad and DataMigrate, causing silent status loss

2 participants