|
| 1 | +# OpenFE SLURM Scripts |
| 2 | + |
| 3 | +Scripts for running OpenFE RBFE/RSFE transformations on Sherlock via Apptainer. |
| 4 | + |
| 5 | +**Requires OpenFE >= 1.10.0** (for `--resume` checkpoint support). |
| 6 | + |
| 7 | +## Prerequisites |
| 8 | + |
| 9 | +``` |
| 10 | +transformations/*.json # generated by rbfe.ipynb or openfe plan-rbfe-network |
| 11 | +``` |
| 12 | + |
| 13 | +Copy `quickrun.sh`, `quickrun.sbatch`, and `restart.sh` to your working directory |
| 14 | +(same level as `transformations/`). |
| 15 | + |
| 16 | +## Usage |
| 17 | + |
| 18 | +### Submit all transformations |
| 19 | + |
| 20 | +```bash |
| 21 | +# 1 repeat per transformation (default) |
| 22 | +./quickrun.sh |
| 23 | + |
| 24 | +# 3 independent repeats per transformation |
| 25 | +./quickrun.sh -r 3 |
| 26 | +``` |
| 27 | + |
| 28 | +Each transformation is submitted as a SLURM array job. With `-r 3`, each JSON |
| 29 | +gets `sbatch --array=0-2`, producing independent replicas. |
| 30 | + |
| 31 | +### Restart failed transformations |
| 32 | + |
| 33 | +```bash |
| 34 | +# Auto-detect repeat count from existing results |
| 35 | +./restart.sh |
| 36 | + |
| 37 | +# Or specify explicitly |
| 38 | +./restart.sh -r 3 |
| 39 | +``` |
| 40 | + |
| 41 | +`restart.sh` checks each replica for a non-empty result JSON. It skips replicas |
| 42 | +that are already complete or still queued/running in SLURM, and resubmits only |
| 43 | +the failed ones. |
| 44 | + |
| 45 | +### Preemption handling |
| 46 | + |
| 47 | +Jobs on the `owners` partition are automatically requeued when preempted. |
| 48 | +`quickrun.sbatch` uses `openfe quickrun --resume`, which detects existing |
| 49 | +checkpoint data in the `-d` directory and continues from the last checkpoint |
| 50 | +instead of starting over. Stale result JSONs are removed before each run so |
| 51 | +`-o` does not conflict. |
| 52 | + |
| 53 | +## Output structure |
| 54 | + |
| 55 | +``` |
| 56 | +results/ |
| 57 | + <transformation_name>/ |
| 58 | + replica_0/ |
| 59 | + <transformation_name>.json # final result |
| 60 | + shared_*/ # checkpoint + simulation data |
| 61 | + replica_1/ |
| 62 | + ... |
| 63 | +logs/ |
| 64 | + openfe_<jobid>.out |
| 65 | + openfe_<jobid>.err |
| 66 | +``` |
| 67 | + |
| 68 | +## Scripts |
| 69 | + |
| 70 | +| Script | Purpose | |
| 71 | +|--------|---------| |
| 72 | +| `quickrun.sh` | Submit all `transformations/*.json` as SLURM array jobs | |
| 73 | +| `quickrun.sbatch` | SLURM batch script: starts MPS, runs `openfe quickrun --resume` | |
| 74 | +| `restart.sh` | Resubmit only failed/incomplete replicas (skips queued jobs) | |
| 75 | + |
| 76 | +## Notes |
| 77 | + |
| 78 | +- **CUDA MPS** is started automatically in `quickrun.sbatch` to work around |
| 79 | + Sherlock's `Exclusive_Process` GPU mode, which otherwise prevents openmmtools' |
| 80 | + ContextCache from holding multiple CUDA contexts. |
| 81 | +- The `--resume` flag (OpenFE >= 1.10.0) enables checkpoint-based resumption. |
| 82 | + Without it (OpenFE < 1.10.0), each run starts from scratch. |
0 commit comments