[AMORO-3988][ams] Automatic restart the optimizers if it is down unexpected by lintingbin · Pull Request #4180 · apache/amoro

lintingbin · 2026-04-14T10:21:01Z

Search before asking

I have searched in the issues and found no similar issues.

What type of PR is this?

Improvement
Bug Fix
Feature
Refactoring

What does this PR do?

Add an automatic restart mechanism for optimizers that die unexpectedly.

Problem

When an optimizer process crashes, the OptimizerKeeper detects the heartbeat timeout and removes the optimizer record from the optimizer table. However, the corresponding resource record in the resource table is not cleaned up, leaving an "orphaned" resource with no active optimizer. Currently there is no mechanism to detect this situation and restart the optimizer.

Solution

Extend the existing OptimizerGroupKeeper to periodically detect orphaned resources and automatically restart the optimizer:

Orphaned resource detection: In OptimizerGroupKeeper.processTask(), cross-check the resource table against the optimizer table to find resources without any active optimizer instances.
Automatic restart: For each orphaned resource, call container.requestResource(resource) to restart the optimizer process.
Retry tracking: Track the number of restart attempts per orphaned resource. If a restart exceeds the configurable maximum retries, clean up the orphaned resource from the database instead of retrying indefinitely.

Configuration

Two new configuration options are added (disabled by default):

Config Key	Default	Description
`optimizer.auto-restart-enabled`	`false`	Whether to enable automatic restart of unexpectedly down optimizers
`optimizer.auto-restart-max-retries`	`5`	Maximum restart attempts per orphaned resource before cleanup

Bug fix

Also fixes ResourceMapper.selectResourcesByGroup result mapping where property names did not match Resource class fields (group→groupName, container→containerName, totalMemory→memoryMb).

Checklist

I have added corresponding tests for my changes.
This PR only changes one thing, and it is clear from the title.
I have documented my changes in the relevant documentation.

Closes #3988

…ectedly Add auto-restart mechanism for optimizers that die unexpectedly. When an optimizer process crashes, the resource record remains in the DB but the optimizer record is removed by heartbeat timeout. This leaves 'orphaned' resources with no active optimizer. Changes: - Add orphaned resource detection in OptimizerGroupKeeper: periodically cross-check resource table against optimizer table to find resources without active optimizers - Restart orphaned optimizer via container.requestResource() - Track retry count per orphaned resource; clean up resource from DB after exceeding configurable max retries - Add configuration options: - optimizer.auto-restart-enabled (default: false) - optimizer.auto-restart-max-retries (default: 5) - Fix ResourceMapper.selectResourcesByGroup result mapping: property names did not match Resource class fields (group->groupName, container->containerName, totalMemory->memoryMb) - Add 3 test cases for auto-restart scenarios

Fix several issues in the auto-restart mechanism: 1. Prevent double provisioning: restartOrphanedOptimizers now returns the total thread count of orphaned resources being restarted, which is subtracted from requiredCores in tryKeeping to avoid duplicate resource allocation. 2. Add grace period for orphaned resource detection: new config optimizer.auto-restart-grace-period (default 5min) prevents misidentifying resources whose optimizer is still starting up (e.g. Flink/Kubernetes) as orphaned. 3. Persist updated properties after restart: add updateResource method to ResourceMapper/ResourceManager/DefaultOptimizerManager so that new job-id and other properties from doScaleOut are persisted to DB after restarting an orphaned resource. 4. Use timestamp-based tracking: replace simple retry counter with OrphanedResourceState that tracks both firstDetectedTime and restartAttempts. After a successful restart, the grace period timer is reset to allow the optimizer time to register. 5. Use InternalResourceContainer interface instead of casting to AbstractOptimizerContainer, with instanceof check for safety. 6. Improve test stability: replace Thread.sleep with polling-based waitUntil helper to reduce flakiness in CI environments.

- Fix grace period to use lastRestartTime instead of firstDetectedTime so retry attempts are rate-limited from the most recent restart, not first detection - Fix rawRequiredCores vs requiredCores: orphaned cores must not be counted as satisfied capacity when resetting minParallelism after max keeping attempts - Fix non-InternalResourceContainer: log warn only once, suppress further cycles via grace period, and do not consume restartAttempts (requires manual cleanup) - Fix TOCTOU: add defensive deleteOptimizer before deleteResource on max retries - Fix ResourceMapper: remove start_time from selectResourcesByGroup (no mapping), and add start_time = CURRENT_TIMESTAMP to updateResource so restarts are visible - Fix test race in testNoRestartWhenOptimizerIsActive: authenticate before createResource to close the orphan detection window - Replace ConcurrentHashMap with HashMap in orphanedResourceStates (single-threaded) - Add docs/configuration/ams-config.md entries for three new config keys - Add comments explaining TOCTOU, grace period semantics, and test rationale Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions bot added the module:ams-server Ams server module label Apr 14, 2026

github-actions bot added the module:common label Apr 14, 2026

github-actions bot added the type:docs Improvements or additions to documentation label Apr 15, 2026

[AMORO-3988] Apply spotless formatting

859c8a9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMORO-3988][ams] Automatic restart the optimizers if it is down unexpected#4180

[AMORO-3988][ams] Automatic restart the optimizers if it is down unexpected#4180
lintingbin wants to merge 4 commits intoapache:masterfrom
lintingbin:feature/auto-restart-optimizer

lintingbin commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lintingbin commented Apr 14, 2026

Search before asking

What type of PR is this?

What does this PR do?

Problem

Solution

Configuration

Bug fix

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant