Skip to content

[AMORO-3988][ams] Automatic restart the optimizers if it is down unexpected#4180

Open
lintingbin wants to merge 4 commits intoapache:masterfrom
lintingbin:feature/auto-restart-optimizer
Open

[AMORO-3988][ams] Automatic restart the optimizers if it is down unexpected#4180
lintingbin wants to merge 4 commits intoapache:masterfrom
lintingbin:feature/auto-restart-optimizer

Conversation

@lintingbin
Copy link
Copy Markdown
Contributor

Search before asking

  • I have searched in the issues and found no similar issues.

What type of PR is this?

  • Improvement
  • Bug Fix
  • Feature
  • Refactoring

What does this PR do?

Add an automatic restart mechanism for optimizers that die unexpectedly.

Problem

When an optimizer process crashes, the OptimizerKeeper detects the heartbeat timeout and removes the optimizer record from the optimizer table. However, the corresponding resource record in the resource table is not cleaned up, leaving an "orphaned" resource with no active optimizer. Currently there is no mechanism to detect this situation and restart the optimizer.

Solution

Extend the existing OptimizerGroupKeeper to periodically detect orphaned resources and automatically restart the optimizer:

  1. Orphaned resource detection: In OptimizerGroupKeeper.processTask(), cross-check the resource table against the optimizer table to find resources without any active optimizer instances.
  2. Automatic restart: For each orphaned resource, call container.requestResource(resource) to restart the optimizer process.
  3. Retry tracking: Track the number of restart attempts per orphaned resource. If a restart exceeds the configurable maximum retries, clean up the orphaned resource from the database instead of retrying indefinitely.

Configuration

Two new configuration options are added (disabled by default):

Config Key Default Description
optimizer.auto-restart-enabled false Whether to enable automatic restart of unexpectedly down optimizers
optimizer.auto-restart-max-retries 5 Maximum restart attempts per orphaned resource before cleanup

Bug fix

Also fixes ResourceMapper.selectResourcesByGroup result mapping where property names did not match Resource class fields (groupgroupName, containercontainerName, totalMemorymemoryMb).

Checklist

  • I have added corresponding tests for my changes.
  • This PR only changes one thing, and it is clear from the title.
  • I have documented my changes in the relevant documentation.

Closes #3988

…ectedly

Add auto-restart mechanism for optimizers that die unexpectedly.
When an optimizer process crashes, the resource record remains in the DB
but the optimizer record is removed by heartbeat timeout. This leaves
'orphaned' resources with no active optimizer.

Changes:
- Add orphaned resource detection in OptimizerGroupKeeper: periodically
  cross-check resource table against optimizer table to find resources
  without active optimizers
- Restart orphaned optimizer via container.requestResource()
- Track retry count per orphaned resource; clean up resource from DB
  after exceeding configurable max retries
- Add configuration options:
  - optimizer.auto-restart-enabled (default: false)
  - optimizer.auto-restart-max-retries (default: 5)
- Fix ResourceMapper.selectResourcesByGroup result mapping: property
  names did not match Resource class fields (group->groupName,
  container->containerName, totalMemory->memoryMb)
- Add 3 test cases for auto-restart scenarios
@github-actions github-actions bot added the module:ams-server Ams server module label Apr 14, 2026
Fix several issues in the auto-restart mechanism:

1. Prevent double provisioning: restartOrphanedOptimizers now returns
   the total thread count of orphaned resources being restarted, which
   is subtracted from requiredCores in tryKeeping to avoid duplicate
   resource allocation.

2. Add grace period for orphaned resource detection: new config
   optimizer.auto-restart-grace-period (default 5min) prevents
   misidentifying resources whose optimizer is still starting up
   (e.g. Flink/Kubernetes) as orphaned.

3. Persist updated properties after restart: add updateResource method
   to ResourceMapper/ResourceManager/DefaultOptimizerManager so that
   new job-id and other properties from doScaleOut are persisted to DB
   after restarting an orphaned resource.

4. Use timestamp-based tracking: replace simple retry counter with
   OrphanedResourceState that tracks both firstDetectedTime and
   restartAttempts. After a successful restart, the grace period timer
   is reset to allow the optimizer time to register.

5. Use InternalResourceContainer interface instead of casting to
   AbstractOptimizerContainer, with instanceof check for safety.

6. Improve test stability: replace Thread.sleep with polling-based
   waitUntil helper to reduce flakiness in CI environments.
- Fix grace period to use lastRestartTime instead of firstDetectedTime so
  retry attempts are rate-limited from the most recent restart, not first detection
- Fix rawRequiredCores vs requiredCores: orphaned cores must not be counted as
  satisfied capacity when resetting minParallelism after max keeping attempts
- Fix non-InternalResourceContainer: log warn only once, suppress further cycles
  via grace period, and do not consume restartAttempts (requires manual cleanup)
- Fix TOCTOU: add defensive deleteOptimizer before deleteResource on max retries
- Fix ResourceMapper: remove start_time from selectResourcesByGroup (no mapping),
  and add start_time = CURRENT_TIMESTAMP to updateResource so restarts are visible
- Fix test race in testNoRestartWhenOptimizerIsActive: authenticate before
  createResource to close the orphan detection window
- Replace ConcurrentHashMap with HashMap in orphanedResourceStates (single-threaded)
- Add docs/configuration/ams-config.md entries for three new config keys
- Add comments explaining TOCTOU, grace period semantics, and test rationale

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the type:docs Improvements or additions to documentation label Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ams-server Ams server module module:common type:docs Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Improvement]: Automatic restart the optimizers if it is down unexpected

1 participant