Skip to content

Commit 825fbbb

Browse files
committed
fix(dds-submit-slurm): resolve SLURM job submission failures
Fixed a critical bug in the SLURM plugin that caused job submissions to fail with "No partition specified" error when using lightweight mode. The issue was due to executable code appearing before #SBATCH directives in the generated job script, violating SLURM's parsing requirements. * Removed lightweight validation code from the job script template. * Eliminated blank lines between #SBATCH directive placeholders. * Validation logic is now handled within worker nodes. This ensures SLURM correctly parses all #SBATCH directives, including partition specifications and resource requirements.
1 parent f040ca3 commit 825fbbb

4 files changed

Lines changed: 32 additions & 75 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1414
### Fixed
1515

1616
- **DDSWorker.sh**: Fixed inverted logic bug that caused worker package deployment to fail when pre-compiled binaries were present. The script now correctly handles both full packages (with binaries) and lightweight packages (without binaries).
17+
- **dds-submit-slurm**: Fixed critical bug in lightweight mode where #SBATCH directives were ignored by SLURM scheduler, causing job submission failures with "No partition specified" error. The issue was caused by executable code appearing before #SBATCH directives in the generated job script, violating SLURM's parsing requirements. Fixed by removing the lightweight validation code from the job script template and eliminating blank lines between #SBATCH directive placeholders.
1718

1819
## [3.15.0] - 2025-10-08
1920

ReleaseNotes.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,17 @@
1414

1515
### 🐛 Bug Fixes
1616

17+
#### Critical SLURM Plugin Fix for Lightweight Mode
18+
19+
- **Job Submission Failure**: Fixed critical bug in SLURM plugin that caused job submissions to fail with "No partition specified or system default partition" error when using lightweight mode
20+
- **Root Cause**: The job script template incorrectly placed executable validation code before #SBATCH directives, violating SLURM's parsing requirements. SLURM stops processing #SBATCH options when it encounters the first executable line, causing all subsequent directives (including `--partition`) to be ignored
21+
- **Template Bug**: The placeholder `#DDS_LIGHTWEIGHT_VALIDATION` appeared in both a comment line and the code section. The `boost::replace_all()` function replaced both occurrences, breaking the comment syntax and injecting executable code before #SBATCH directives
22+
- **Resolution**:
23+
- Removed lightweight validation code from the job script template entirely
24+
- Eliminated blank lines between #SBATCH directive placeholders
25+
- Validation logic moved to worker nodes where it's actually needed (DDSWorker.sh)
26+
- **Impact**: SLURM now correctly parses all #SBATCH directives including partition specifications, resource requirements, and job options
27+
1728
#### Critical Worker Package Deployment Fix
1829

1930
- **DDSWorker.sh Logic Error**: Fixed inverted logic bug that caused worker package deployment to fail when pre-compiled binaries were present
@@ -24,6 +35,26 @@
2435

2536
### 🚀 For Users
2637

38+
#### If You Use SLURM with Lightweight Mode
39+
40+
If you experienced SLURM job submission failures with errors like:
41+
42+
```text
43+
Batch job submission failed: No partition specified or system default partition
44+
```
45+
46+
This was caused by a critical bug in the SLURM plugin template that placed executable code before #SBATCH directives. SLURM stopped parsing directives when it encountered this code, ignoring your partition specifications and other settings.
47+
48+
**The fix requires rebuilding DDS:**
49+
50+
```bash
51+
cd /path/to/DDS/build
52+
make
53+
make install
54+
```
55+
56+
After rebuilding, your SLURM submissions with lightweight mode will work correctly, and all #SBATCH directives (including partition, CPU requirements, etc.) will be properly recognized.
57+
2758
#### If You Use Tools API
2859

2960
Before this fix, you had to explicitly set the lightweight flag:

plugins/dds-submit-slurm/src/job.slurm.in

Lines changed: 0 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -3,46 +3,16 @@
33
#
44
# DDS SLURM Job Execution Workflow:
55
#
6-
# PHASE 1: Resource Allocation & Job Scheduling
7-
# - SLURM scheduler allocates nodes based on #SBATCH directives
8-
# - Job script is dispatched to a single allocation node (sbatch host)
9-
#
10-
# PHASE 2: Sbatch Host Execution (Single Node)
11-
# - Script executes sequentially on the sbatch host
12-
# - Lightweight package validation (#DDS_LIGHTWEIGHT_VALIDATION) runs once
13-
# * Validates DDS_COMMANDER_BIN_LOCATION environment variable
14-
# * Validates DDS_COMMANDER_LIBS_LOCATION environment variable
15-
# * Provides immediate failure feedback if prerequisites are missing
16-
# - Signal handlers are configured to manage process lifecycle
17-
#
18-
# PHASE 3: Parallel Task Distribution (Multiple Nodes)
19-
# - srun command distributes DDS worker tasks across all allocated nodes
20-
# - Each allocated node receives and executes DDSWorker.sh scout script
21-
# - Per-node validation occurs within DDSWorker.sh on each srun host:
22-
# * Secondary validation of environment variables
23-
# * DDS agent initialization and registration
24-
# * Worker-specific error handling and logging
25-
#
26-
# PHASE 4: Job Lifecycle Management
27-
# - Parent process waits for all srun tasks to complete
28-
# - Signal propagation ensures graceful shutdown of distributed tasks
29-
# - Job artifacts and logs are collected per-node basis
30-
#
316

327
#SBATCH --nodes=%DDS_NMININSTANCES%%DDS_NINSTANCES%
338
#SBATCH --no-kill
349

3510
#SBATCH --job-name=%DDS_SUBMISSION_TAG%
3611
#SBATCH --chdir=%DDS_JOB_ROOT_WRK_DIR%
37-
3812
#DDS_AGENT_CPU_REQUIREMENT
39-
4013
#DDS_INLINE_CONFIG
41-
4214
#DDS_USER_OPTIONS
4315

44-
#DDS_LIGHTWEIGHT_VALIDATION
45-
4616
# ignore signals
4717
# continue waiting for child processes by any means
4818
trap -- '' SIGINT SIGTERM

plugins/dds-submit-slurm/src/main.cpp

Lines changed: 0 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -203,51 +203,6 @@ int main(int argc, char* argv[])
203203
// Replace %DDS_SUBMISSION_TAG%
204204
boost::replace_all(sSrcScript, "%DDS_SUBMISSION_TAG%", _submit.m_submissionTag);
205205

206-
// #DDS_LIGHTWEIGHT_VALIDATION
207-
if (isLightweightMode)
208-
{
209-
string lightweightValidation = R"DELIMITER(
210-
# Early validation for lightweight mode
211-
echo "Lightweight mode detected. Validating prerequisites..."
212-
213-
# Check DDS_COMMANDER_BIN_LOCATION
214-
if [[ -z "${DDS_COMMANDER_BIN_LOCATION}" ]]; then
215-
echo "ERROR: DDS_COMMANDER_BIN_LOCATION environment variable is not set"
216-
echo "Please set it to point to DDS binaries directory (e.g., /opt/dds/bin)"
217-
exit 1
218-
fi
219-
220-
if [[ ! -d "${DDS_COMMANDER_BIN_LOCATION}" ]]; then
221-
echo "ERROR: DDS_COMMANDER_BIN_LOCATION points to non-existent directory: ${DDS_COMMANDER_BIN_LOCATION}"
222-
exit 1
223-
fi
224-
225-
if [[ ! -x "${DDS_COMMANDER_BIN_LOCATION}/dds-agent" ]]; then
226-
echo "ERROR: Cannot find dds-agent executable in ${DDS_COMMANDER_BIN_LOCATION}"
227-
exit 1
228-
fi
229-
230-
# Check DDS_COMMANDER_LIBS_LOCATION
231-
if [[ -z "${DDS_COMMANDER_LIBS_LOCATION}" ]]; then
232-
echo "ERROR: DDS_COMMANDER_LIBS_LOCATION environment variable is not set"
233-
echo "Please set it to point to DDS libraries directory (e.g., /opt/dds/lib)"
234-
exit 1
235-
fi
236-
237-
if [[ ! -d "${DDS_COMMANDER_LIBS_LOCATION}" ]]; then
238-
echo "ERROR: DDS_COMMANDER_LIBS_LOCATION points to non-existent directory: ${DDS_COMMANDER_LIBS_LOCATION}"
239-
exit 1
240-
fi
241-
242-
echo "Lightweight mode prerequisites validated successfully"
243-
)DELIMITER";
244-
boost::replace_all(sSrcScript, "#DDS_LIGHTWEIGHT_VALIDATION", lightweightValidation);
245-
}
246-
else
247-
{
248-
boost::replace_all(sSrcScript, "#DDS_LIGHTWEIGHT_VALIDATION", "");
249-
}
250-
251206
// Replace %DDS_JOB_ROOT_WRK_DIR%
252207
string sSandboxDir(smart_path(CUserDefaults::instance().getWrkPkgDir(submissionId)));
253208
fs::path pathJobWrkDir(sSandboxDir);

0 commit comments

Comments
 (0)