Skip to content

[develop][E2E Test] Add E2E test for Slurm 25.11 expedited requeue mode feature#7211

Open
hehe7318 wants to merge 13 commits intoaws:developfrom
hehe7318:wip/add-test-for-expedited-requeue-mode
Open

[develop][E2E Test] Add E2E test for Slurm 25.11 expedited requeue mode feature#7211
hehe7318 wants to merge 13 commits intoaws:developfrom
hehe7318:wip/add-test-for-expedited-requeue-mode

Conversation

@hehe7318
Copy link
Contributor

@hehe7318 hehe7318 commented Jan 27, 2026

Description of changes

Add test_expedited_requeue in test_slurm.py to validate that jobs submitted with --requeue=expedite are treated as highest priority after ICE recovery.

  • The test uses recoverable ICE simulation via create_fleet_overrides.json (instead of the permanent overrides.py approach used by test_fast_capacity_failover).
  • Write create_fleet_overrides.json with invalid InstanceTypes → create_fleet returns no instances → InsufficientInstanceCapacity → nodes go down
  • Recover by changing InstanceTypes back to real ones (t3.medium, c5.large) → next launch succeeds
  • Verify the expedited requeue job started before the normal job

Tests

  • Running

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Extend test_fast_capacity_failover to validate the new --requeue=expedite
option introduced in Slurm 25.11.2. This feature allows batch jobs to
automatically requeue on node failure with highest priority.
@hehe7318 hehe7318 requested review from a team as code owners January 27, 2026 14:56
@hehe7318 hehe7318 added the 3.x label Jan 27, 2026
hehe7318 and others added 9 commits January 28, 2026 14:07
- Change job commands from simple 'sleep 30' to output hostname and
  timestamps, making it easier to verify job execution in output files
- Add --prefer option to job2 targeting the same compute resource as job1
- Increase job2 node request from 1 to 2 nodes to prevent it from
  immediately running on another CR before job1 requeues
…er and use recoverable ICE simulation

Move _test_expedited_requeue_on_ice out of test_fast_capacity_failover into a standalone
test_expedited_requeue with its own cluster config (multi-instance-type CR using create_fleet).

Replace the unrecoverable overrides.py-based ICE simulation with create_fleet_overrides.json:
write invalid 'ICE-' prefixed InstanceTypes to trigger ICE, then change them back to real ones
(t3.medium, c5.xlarge) to recover. This allows verifying that after ICE recovery, the expedited
requeue job starts before a normal job submitted earlier.
@hehe7318 hehe7318 added the skip-changelog-update Disables the check that enforces changelog updates in PRs label Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.x skip-changelog-update Disables the check that enforces changelog updates in PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant