Skip to content

[Fix-18338] check task if it's waiting for TaskGroup slot when pause/kill#18344

Open
eye-gu wants to merge 4 commits into
apache:devfrom
eye-gu:fix-18338-pause-task-wait-wotk-group
Open

[Fix-18338] check task if it's waiting for TaskGroup slot when pause/kill#18344
eye-gu wants to merge 4 commits into
apache:devfrom
eye-gu:fix-18338-pause-task-wait-wotk-group

Conversation

@eye-gu

@eye-gu eye-gu commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Was this PR generated or assisted by AI?

Add unit tests.

Purpose of the pull request

fix #18338

Brief change log

Verify this pull request

  • Added TaskSubmittedStateActionTest
  • Added TaskGroupCoordinatorTest#isTaskWaitingForTaskGroupSlot

Pull Request Notice

Pull Request Notice

If your pull request contains incompatible change, you should also add it to docs/docs/en/guide/upgrade/incompatible.md

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes issue #18338 by ensuring tasks that are waiting for a TaskGroup slot can be paused/killed directly (instead of entering a retry loop), and adds unit coverage for the new behavior.

Changes:

  • Add ITaskGroupCoordinator#isTaskWaitingForTaskGroupSlot and implement it in TaskGroupCoordinator.
  • Update TaskSubmittedStateAction pause/kill handling to release TaskGroup queue entries and publish paused/killed events immediately when the task is waiting for a TaskGroup slot.
  • Add unit tests for both the new coordinator query and the submitted-state pause/kill behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/ITaskGroupCoordinator.java Adds API to detect whether a task is currently waiting for a TaskGroup slot.
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/TaskGroupCoordinator.java Implements the runtime “waiting for slot” check used by pause/kill handling.
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/task/statemachine/TaskSubmittedStateAction.java Pauses/kills tasks immediately when they’re waiting for a TaskGroup slot (avoids retry loop).
dolphinscheduler-master/src/test/java/org/apache/dolphinscheduler/server/master/engine/TaskGroupCoordinatorTest.java Adds unit test coverage for isTaskWaitingForTaskGroupSlot.
dolphinscheduler-master/src/test/java/org/apache/dolphinscheduler/server/master/engine/task/statemachine/TaskSubmittedStateActionTest.java Adds unit tests validating pause/kill behavior for tasks waiting on TaskGroup slots.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@SbloodyS SbloodyS added this to the 3.5.0 milestone Jun 12, 2026
@SbloodyS SbloodyS added improvement make more easy to user or prompt friendly bug Something isn't working and removed improvement make more easy to user or prompt friendly labels Jun 12, 2026
@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 60%)

See analysis details on SonarQube Cloud

Comment on lines +147 to +154
if (taskGroupCoordinator.isTaskWaitingForTaskGroupSlot(taskExecution.getTaskInstance())) {
// Release the TaskGroupQueue first to prevent TaskGroupCoordinator
// from concurrently acquiring the slot and racing with this pause
taskGroupCoordinator.releaseTaskGroupSlot(taskExecution.getTaskInstance());
log.info("Task: {} is waiting for TaskGroup slot, pause it directly", taskExecution.getName());
taskExecution.getWorkflowEventBus().publish(TaskPausedLifecycleEvent.of(taskExecution));
return;
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may result in a state error due to concurrency.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, there is a concurrency issue with the TaskGroupCoordinator's periodic task. However, the leaked slot will be fixed by amendTaskGroupUseSize eventually, and after the task enters PAUSE state, the dispatch event triggered by TGC's RPC will be safely ignored (just a warn log). So in practice there's no real impact. Is this acceptable? I haven't come up with a clean way to fully fix the concurrency yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend bug Something isn't working test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] [master] Infinite Retry Loop When Pausing a Workflow with Pending TaskGroup Tasks

4 participants