[develop][E2E Test] Add E2E test for Slurm 25.11 expedited requeue mode feature#7211
Open
hehe7318 wants to merge 13 commits intoaws:developfrom
Open
[develop][E2E Test] Add E2E test for Slurm 25.11 expedited requeue mode feature#7211hehe7318 wants to merge 13 commits intoaws:developfrom
hehe7318 wants to merge 13 commits intoaws:developfrom
Conversation
Extend test_fast_capacity_failover to validate the new --requeue=expedite option introduced in Slurm 25.11.2. This feature allows batch jobs to automatically requeue on node failure with highest priority.
…eue jobs are treated as highest priority.
- Change job commands from simple 'sleep 30' to output hostname and timestamps, making it easier to verify job execution in output files - Add --prefer option to job2 targeting the same compute resource as job1 - Increase job2 node request from 1 to 2 nodes to prevent it from immediately running on another CR before job1 requeues
…er and use recoverable ICE simulation Move _test_expedited_requeue_on_ice out of test_fast_capacity_failover into a standalone test_expedited_requeue with its own cluster config (multi-instance-type CR using create_fleet). Replace the unrecoverable overrides.py-based ICE simulation with create_fleet_overrides.json: write invalid 'ICE-' prefixed InstanceTypes to trigger ICE, then change them back to real ones (t3.medium, c5.xlarge) to recover. This allows verifying that after ICE recovery, the expedited requeue job starts before a normal job submitted earlier.
…avoid MissingParameter error
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of changes
Add
test_expedited_requeuein test_slurm.py to validate that jobs submitted with--requeue=expediteare treated as highest priority after ICE recovery.create_fleet_overrides.json(instead of the permanent overrides.py approach used bytest_fast_capacity_failover).Tests
Checklist
developadd the branch name as prefix in the PR title (e.g.[release-3.6]).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.