Skip to content

Commit f149609

Browse files
Add OpenFE SLURM scripts and documentation for RBFE transformations
- Introduced `quickrun.sh`, `quickrun.sbatch`, and `restart.sh` scripts for submitting and managing OpenFE RBFE transformations on Sherlock. - Updated documentation in `AGENTS.md`, `CLAUDE.md`, and `scripts/openfe/README.md` to include detailed usage instructions and requirements for the new scripts. - Enhanced `scripts/openfe/README.md` with prerequisites, output structure, and preemption handling information to assist users in effectively utilizing the scripts.
1 parent 729da70 commit f149609

5 files changed

Lines changed: 137 additions & 4 deletions

File tree

.cursor/rules/shell-scripts.mdc

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,5 +13,7 @@ alwaysApply: false
1313
- `scripts/gromacs/mdrun/` — production run scripts and MDP files
1414
- `scripts/gromacs/analysis/` — gmx analysis commands
1515
- `scripts/amber/` — AMBER/AmberTools workflows
16-
- `scripts/openfe/` — OpenFE RBFE workflows
16+
- `scripts/openfe/` — OpenFE RBFE workflows (quickrun.sh, quickrun.sbatch, restart.sh)
1717
- Sherlock (SLURM) variants go in a `sherlock/` subdirectory with `.sbatch` extension.
18+
- OpenFE scripts require >= 1.10.0 for `--resume` checkpoint support.
19+
- `quickrun.sbatch` starts CUDA MPS for Sherlock's `Exclusive_Process` GPU mode — do not remove the MPS setup block.

AGENTS.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,21 @@ Source is under `src/mdpp/` using the src-layout convention:
4747

4848
Shell scripts (analysis wrappers, runtime helpers, build scripts, etc.) live in the top-level `scripts/` directory (not packaged).
4949

50+
### OpenFE Scripts (`scripts/openfe/`)
51+
52+
SLURM submission scripts for running OpenFE RBFE transformations on Sherlock.
53+
**Requires OpenFE >= 1.10.0** for `--resume` checkpoint support.
54+
55+
| Script | Purpose |
56+
|---|---|
57+
| `quickrun.sh` | Submit all `transformations/*.json` as SLURM array jobs (`-r N` for repeats) |
58+
| `quickrun.sbatch` | Batch script: starts CUDA MPS, runs `openfe quickrun --resume` via Apptainer |
59+
| `restart.sh` | Resubmit only failed/incomplete replicas (skips queued jobs) |
60+
61+
- CUDA MPS is required for Sherlock's `Exclusive_Process` GPU mode (openmmtools needs multiple CUDA contexts).
62+
- `--resume` enables checkpoint-based resumption after preemption on `owners` partition.
63+
- Output goes to `results/<name>/replica_<id>/`.
64+
5065
Tests live in `tests/analysis/`, `tests/plots/`, and `tests/chem/`, mirroring the source tree.
5166

5267
## Mandatory Conventions

CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ scripts/ # shell scripts (NOT packaged, copy to MD working directori
4444
│ ├── postprocessing/ # gmx_postprocessing.sh
4545
│ ├── runtime/ # check_status.sh, restart.sh, extend.sh, export.sh
4646
│ └── visualization/ # pymol_movie.pml
47-
└── openfe/ # quickrun.sh, quickrun.sbatch
47+
└── openfe/ # quickrun.sh, quickrun.sbatch, restart.sh
4848
4949
tests/ # mirrors src/ layout (tests/analysis/, tests/plots/, tests/chem/)
5050
notebooks/ # Jupyter notebooks for interactive analysis

docs/guide/scripts.md

Lines changed: 36 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -83,5 +83,39 @@ SLURM batch scripts are in `sherlock/` subdirectories within each category:
8383

8484
### OpenFE
8585

86-
- `scripts/openfe/quickrun.sh` -- Batch submission wrapper for OpenFE transformations
87-
- `scripts/openfe/quickrun.sbatch` -- Apptainer-based OpenFE execution on Sherlock
86+
**Requires OpenFE >= 1.10.0** for checkpoint-based resumption (`--resume`).
87+
88+
| Script | Description |
89+
|---|---|
90+
| `quickrun.sh` | Submit all `transformations/*.json` as SLURM array jobs |
91+
| `quickrun.sbatch` | SLURM batch script: starts CUDA MPS, runs `openfe quickrun --resume` via Apptainer |
92+
| `restart.sh` | Resubmit only failed/incomplete replicas (skips queued jobs) |
93+
94+
#### Quick start
95+
96+
```bash
97+
# Copy scripts to your working directory (alongside transformations/)
98+
cp scripts/openfe/{quickrun.sh,quickrun.sbatch,restart.sh} /path/to/workdir/
99+
cd /path/to/workdir/
100+
101+
# Submit with 3 independent repeats per transformation
102+
./quickrun.sh -r 3
103+
104+
# After failures or preemption, resubmit only incomplete replicas
105+
./restart.sh
106+
```
107+
108+
#### Output structure
109+
110+
```
111+
results/<transformation_name>/replica_0/
112+
results/<transformation_name>/replica_1/
113+
results/<transformation_name>/replica_2/
114+
```
115+
116+
#### Preemption handling
117+
118+
Jobs on the `owners` partition are automatically requeued when preempted.
119+
`quickrun.sbatch` uses `--resume` so requeued jobs continue from the last
120+
checkpoint instead of starting over. CUDA MPS is started automatically to
121+
work around Sherlock's `Exclusive_Process` GPU mode.

scripts/openfe/README.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# OpenFE SLURM Scripts
2+
3+
Scripts for running OpenFE RBFE/RSFE transformations on Sherlock via Apptainer.
4+
5+
**Requires OpenFE >= 1.10.0** (for `--resume` checkpoint support).
6+
7+
## Prerequisites
8+
9+
```
10+
transformations/*.json # generated by rbfe.ipynb or openfe plan-rbfe-network
11+
```
12+
13+
Copy `quickrun.sh`, `quickrun.sbatch`, and `restart.sh` to your working directory
14+
(same level as `transformations/`).
15+
16+
## Usage
17+
18+
### Submit all transformations
19+
20+
```bash
21+
# 1 repeat per transformation (default)
22+
./quickrun.sh
23+
24+
# 3 independent repeats per transformation
25+
./quickrun.sh -r 3
26+
```
27+
28+
Each transformation is submitted as a SLURM array job. With `-r 3`, each JSON
29+
gets `sbatch --array=0-2`, producing independent replicas.
30+
31+
### Restart failed transformations
32+
33+
```bash
34+
# Auto-detect repeat count from existing results
35+
./restart.sh
36+
37+
# Or specify explicitly
38+
./restart.sh -r 3
39+
```
40+
41+
`restart.sh` checks each replica for a non-empty result JSON. It skips replicas
42+
that are already complete or still queued/running in SLURM, and resubmits only
43+
the failed ones.
44+
45+
### Preemption handling
46+
47+
Jobs on the `owners` partition are automatically requeued when preempted.
48+
`quickrun.sbatch` uses `openfe quickrun --resume`, which detects existing
49+
checkpoint data in the `-d` directory and continues from the last checkpoint
50+
instead of starting over. Stale result JSONs are removed before each run so
51+
`-o` does not conflict.
52+
53+
## Output structure
54+
55+
```
56+
results/
57+
<transformation_name>/
58+
replica_0/
59+
<transformation_name>.json # final result
60+
shared_*/ # checkpoint + simulation data
61+
replica_1/
62+
...
63+
logs/
64+
openfe_<jobid>.out
65+
openfe_<jobid>.err
66+
```
67+
68+
## Scripts
69+
70+
| Script | Purpose |
71+
|--------|---------|
72+
| `quickrun.sh` | Submit all `transformations/*.json` as SLURM array jobs |
73+
| `quickrun.sbatch` | SLURM batch script: starts MPS, runs `openfe quickrun --resume` |
74+
| `restart.sh` | Resubmit only failed/incomplete replicas (skips queued jobs) |
75+
76+
## Notes
77+
78+
- **CUDA MPS** is started automatically in `quickrun.sbatch` to work around
79+
Sherlock's `Exclusive_Process` GPU mode, which otherwise prevents openmmtools'
80+
ContextCache from holding multiple CUDA contexts.
81+
- The `--resume` flag (OpenFE >= 1.10.0) enables checkpoint-based resumption.
82+
Without it (OpenFE < 1.10.0), each run starts from scratch.

0 commit comments

Comments
 (0)