Skip to content
Merged
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
42effdb
add scripts to process raw dataset
ghar1821 Jan 15, 2026
83ceda2
editing config to set apptainer cache dir
ghar1821 Jan 15, 2026
f3019a9
editing pre-run scripts and trying to fix R methods not running.
ghar1821 Jan 16, 2026
f664c48
add h5py to setup
ghar1821 Jan 16, 2026
338aa23
reverting changes to setup
ghar1821 Jan 16, 2026
f41f6f4
separate submit scripts
ghar1821 Jan 16, 2026
66c348e
finally the first setting that works!!!!
ghar1821 Jan 16, 2026
fa1ade7
update config and settings for control methods
ghar1821 Jan 16, 2026
5902468
adjusted resources for metrics and methods
ghar1821 Jan 16, 2026
7497ea4
update cytovi to use A30 gpu
ghar1821 Jan 16, 2026
cf3d35b
add numba cache dir export to allow jit caching
ghar1821 Jan 16, 2026
35e3fdb
update cytovi implementation
ghar1821 Jan 16, 2026
bad078d
force recompute for all cytonorm
ghar1821 Jan 16, 2026
bdcbf46
add temp dir resolution for hpc
ghar1821 Jan 16, 2026
7bada43
remove transpose from harmonypy
ghar1821 Jan 16, 2026
6793f3d
adding support for hpc
ghar1821 Jan 16, 2026
aa4b07a
update temp dir again
ghar1821 Jan 16, 2026
44def10
latest config file that works reasonably well with hpc
ghar1821 Jan 16, 2026
fc4df26
add some job submit scripts for SLURM
ghar1821 Jan 16, 2026
bcc7ddb
update tmp_path for cytonorm
ghar1821 Jan 16, 2026
9dae0b6
redirect numba cache dir away from /tmp and to its own folder.
ghar1821 Jan 16, 2026
379b3dd
update batch adjust non control samples naming
ghar1821 Jan 17, 2026
d062157
fix bug in perfect integration subsetting
ghar1821 Jan 17, 2026
7627554
fix bug where we can't replace the batch column if it is not integer
ghar1821 Jan 17, 2026
5ff088b
fix bug where the donor loc are somewhat mismatched..
ghar1821 Jan 17, 2026
2864b9c
update ratio inconsistent peak where corrected data return only zero
ghar1821 Jan 17, 2026
ddc57cf
Update script.py
ghar1821 Jan 17, 2026
2312cb2
update scripts
ghar1821 Jan 17, 2026
8acaafc
Merge branch 'main' into setup_run_hpc
ghar1821 Feb 2, 2026
4e807f1
remove average batch r2 global
ghar1821 Feb 2, 2026
03cc959
add seed setting for cytovi
ghar1821 Feb 2, 2026
ca9ac0a
remove env for viash temp files
ghar1821 Feb 3, 2026
fa20205
update lisi to allow anndata write
ghar1821 Feb 3, 2026
04a4280
update cycombine
ghar1821 Feb 3, 2026
f337c99
more updates to cycombine
ghar1821 Feb 3, 2026
236fec8
minor change of script type
ghar1821 Feb 3, 2026
0706945
update cytonorm
ghar1821 Feb 3, 2026
f75d01c
fixed gaussnorm
ghar1821 Feb 3, 2026
11cab3e
fixed limma
LuLeom Feb 3, 2026
b1592c4
Fixed harmonypy and combat
LuLeom Feb 3, 2026
f4bff8d
Fixed rPCA
LuLeom Feb 3, 2026
facc520
update batchadjust and add copy to subset
ghar1821 Feb 3, 2026
f27dad7
remove cytovi and some obsolete metrics
ghar1821 Feb 4, 2026
4eaeaac
renamed shuffle control methods
ghar1821 Feb 4, 2026
0e11692
missed label change
ghar1821 Feb 4, 2026
1e2d508
reorganising scripts for hpc
ghar1821 Feb 4, 2026
06a9069
update changelog
ghar1821 Feb 4, 2026
f444132
update changelog again
ghar1821 Feb 4, 2026
ebb18a0
update changelog
ghar1821 Feb 4, 2026
e3d8951
update changelog
ghar1821 Feb 4, 2026
f2d073e
update description.
ghar1821 Feb 4, 2026
3e3af60
manually adding some dependencies for flowCore and flowStats
ghar1821 Feb 4, 2026
10c2c60
update ratio inconsistent peaks
ghar1821 Feb 5, 2026
999ce87
update inconsistent peaks
ghar1821 Feb 6, 2026
558cd5c
add print statements to subset functions
ghar1821 Feb 9, 2026
de15e28
add print statements when writing files out
ghar1821 Feb 9, 2026
6210259
add utility scripts for pulling intermediate files
ghar1821 Feb 11, 2026
9680aa3
update methods and metrics labels
ghar1821 Feb 11, 2026
e8e1339
fix bug where subsetting was not done on ilisi and fsom mapping metrics
ghar1821 Mar 11, 2026
83fe60b
Update CHANGELOG.md
ghar1821 Apr 8, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,14 @@

* Added new metric `ratio_inconsistent_peaks` (PR #114).

* Added processing scripts for Lille dataset and remove ones for CLL dataset (PR #118).

* Added config and run scripts for running the benchmark on WEHI HPC (PR #119).
Refer to the pull request on Github to see what needs to be set up.
TLDR; Setup the compute environment and the caching directory, then run the warmup job
to pull and create the apptainer images (one job = one method + one metric).
After that, run the main benchmarking job with the config file for the HPC system.

## MAJOR CHANGES

* Updated file schema (PR #18):
Expand Down Expand Up @@ -103,6 +111,14 @@

* Update CytoVI (PR #114).

* Update CytoVI to normalise using minmax scaler fitted on batch 1 post correction (PR #119).

* Update batchadjust, cytonorm to use HPC temp dir if the environment variable is set or else
default to what is set by viash. This is to prevent collision in temp files when the jobs are running (PR #119).

* Update ratio inconsistent peaks to handle edge cases where methods return only zero values
for a marker/cell type/donor combination, causing sd to be zero and division by zero (PR #119).

## MINOR CHANGES

* Enabled unit tests (PR #2).
Expand Down Expand Up @@ -134,6 +150,11 @@

* Removed EMD max from calculation (PR #113).

* Tune the resource requirement for each method (PR #119).
* Low time, mem, cpu for control methods.
* Mid time, mem, cpu for most methods, except below.
* High (or very high) time, mem, cpu for computationally ones like rPCA.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

computationally expensive maybe(?)



## BUG FIXES

Expand Down Expand Up @@ -170,3 +191,14 @@
* Fix bug in EMD vertical where sample combination was malformed (PR #113)

* Fix lisi inconsistent naming (PR #117) for issue #116.

* Fix bug in perfect integration where if batch is str (not int), it only returns control samples (PR #119).

* Fix bug in batchadjust needing "Batch_" in the sample names for non-control samples (PR #119).

* Fix bug in cytonorm to mid where recompute was set to FALSE. It is now set to TRUE (PR #119).

* Remove transpose in harmonypy as new updates to harmonypy no longer need the transpose (PR #119).

* Fix bug in get_obs_var_for_integrated to handle the cases where batch column in obs is str
and thus can't be directly overriden (new values given by get_donor_batch_map is int) (PR #119).
22 changes: 22 additions & 0 deletions scripts/create_resources/process_raw_datasets_hpc.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/bin/bash

# script to launch the process raw dataset workflow on slurm via seqera tower.
# leave the input_states to s3 bucket as the datasets raw files are stored there.

cat > /tmp/params.yaml << 'HERE'
input_states: s3://openproblems-data/resources/task_cyto_batch_integration/datasets_raw/**/state.yaml
rename_keys: 'input:output_dataset'
output_state: '$id/state.yaml'
settings: '{"output_unintegrated": "$id/unintegrated.h5ad", "output_censored_split1": "$id/censored_split1.h5ad", "output_censored_split2": "$id/censored_split2.h5ad"}'
publish_dir: /vast/scratch/users/putri.g/cytobenchmark/benchmark_out_hpc/datasets/
HERE

tw launch https://github.com/openproblems-bio/task_cyto_batch_integration.git \
--revision build/main \
--pull-latest \
--main-script target/nextflow/workflows/process_datasets/main.nf \
--workspace 80689470953249 \
--params-file /tmp/params.yaml \
--entry-name auto \
--config scripts/labels_tw_wehi.config \
--labels task_cyto_batch_integration,process_datasets
168 changes: 168 additions & 0 deletions scripts/labels_tw_wehi.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
def exitStrat(task, max_attempts = 3) {
println "Determining exit strategy for task (attempt '${task.attempt}', exit status '${task.exitStatus}')"

// if the component failed 3 times, ignore the error so the workflow can continue
// it's important 'ignore' is returned even if maxRetries is set to 3,
// otherwise the workflow will stop
if (task.attempt >= 3) {
return 'ignore'
}

return 'retry'
}

// Let Nextflow head job manages the Apptainer containers
apptainer {
enabled = true
pullTimeout = '48h'
ociAutoPull = false
cacheDir = '/vast/scratch/users/putri.g/nextflow/apptainer_cache'
envWhitelist = 'APPTAINER_CACHEDIR,APPTAINER_TMPDIR,SINGULARITY_CACHEDIR,SINGULARITY_TMPDIR,TMPDIR,NXF_HOME,NXF_TEMP,NXF_APPTAINER_CACHEDIR,PYTHONPATH,NUMBA_CACHE_DIR,NUMBA_DISABLE_JIT,HPC_VIASH_META_TEMP_DIR'
}

env {
NXF_APPTAINER_CACHEDIR = '/vast/scratch/users/putri.g/nextflow/apptainer_cache'
APPTAINER_CACHEDIR = '/vast/scratch/users/putri.g/nextflow/apptainer_cache'
APPTAINER_TMPDIR = '/vast/scratch/users/putri.g/nextflow/apptainer_tmp'
SINGULARITY_CACHEDIR = '/vast/scratch/users/putri.g/nextflow/apptainer_cache'
SINGULARITY_TMPDIR = '/vast/scratch/users/putri.g/nextflow/apptainer_tmp'
NXF_HOME = '/vast/scratch/users/putri.g/nextflow/nxf_home'
PYTHONPATH = '/root/.local/lib/python3.12/site-packages'
// Add Numba environment variables to fix caching issues in containers
NUMBA_DISABLE_JIT = '0'
}

process {
beforeScript = '''
# Create base directories (shared across tasks)
mkdir -p "$APPTAINER_CACHEDIR" "$NXF_HOME" "$HOME"

# Create task-specific temp directories
export TMPDIR="/vast/scratch/users/putri.g/nextflow/apptainer_tmp/${NXF_TASK_INDEX:-$$}"
export APPTAINER_TMPDIR="${TMPDIR}"
export SINGULARITY_TMPDIR="${TMPDIR}"
export NXF_TEMP="/vast/scratch/users/putri.g/nextflow/nxf_tmp/${NXF_TASK_INDEX:-$$}"
export HPC_VIASH_META_TEMP_DIR="${NXF_TEMP}"
export NUMBA_CACHE_DIR="/vast/scratch/users/putri.g/nextflow/numba_cache/${NXF_TASK_INDEX:-$$}"

mkdir -p "$TMPDIR" "$NXF_TEMP" "$NUMBA_CACHE_DIR"

echo "============================="
echo "Task-specific directories:"
echo "============================="
echo " TMPDIR: $TMPDIR"
echo " APPTAINER_TMPDIR: $APPTAINER_TMPDIR"
echo " SINGULARITY_TMPDIR: $SINGULARITY_TMPDIR"
echo " NXF_TEMP: $NXF_TEMP"
echo " HPC_VIASH_META_TEMP_DIR: $HPC_VIASH_META_TEMP_DIR"
echo " NUMBA_CACHE_DIR: $NUMBA_CACHE_DIR"
echo "============================="
echo "Shared directories:"
echo "============================="
echo " APPTAINER_CACHEDIR: $APPTAINER_CACHEDIR"
echo " NXF_APPTAINER_CACHEDIR: $NXF_APPTAINER_CACHEDIR"
echo " NXF_HOME: $NXF_HOME"
'''.stripIndent()
}


process {
executor = 'slurm'

// Default resources for all processes
cpus = 4
memory = { get_memory( 10.GB * task.attempt ) }
time = '48.h'
disk = 50.GB
queue = 'regular'

// Retry for exit codes that have something to do with memory issues
// always retry once
errorStrategy = { exitStrat(task) }
maxRetries = 3
maxMemory = null

// Resource labels
withLabel: lowcpu { cpus = 5 }
withLabel: midcpu { cpus = 15 }
withLabel: highcpu { cpus = 30 }
withLabel: lowmem { memory = { get_memory( 10.GB * task.attempt ) } }
withLabel: midmem { memory = { get_memory( 30.GB * task.attempt ) } }
withLabel: highmem { memory = { get_memory( 80.GB * task.attempt ) } }
withLabel: veryhighmem { memory = { get_memory( 150.GB * task.attempt ) } }
withLabel: lowtime { time = 2.h }
withLabel: midtime { time = 8.h }
withLabel: hightime { time = 12.h }
withLabel: veryhightime { time = 24.h }
withLabel: lowsharedmem {
containerOptions = { workflow.containerEngine != 'singularity' ? "--shm-size ${String.format("%.0f",task.memory.mega * 0.05)}" : ""}
}
withLabel: midsharedmem {
containerOptions = { workflow.containerEngine != 'singularity' ? "--shm-size ${String.format("%.0f",task.memory.mega * 0.1)}" : ""}
}
withLabel: highsharedmem {
containerOptions = { workflow.containerEngine != 'singularity' ? "--shm-size ${String.format("%.0f",task.memory.mega * 0.25)}" : ""}
}
withLabel: gpu {
cpus = 16
clusterOptions = '--gres=gpu:A30:1'
queue = "gpuq"
containerOptions = { workflow.containerEngine == "singularity" ? '--nv':
( workflow.containerEngine == "docker" ? '--gpus all': null ) }
}
withLabel: midgpu {
cpus = 32
clusterOptions = '--gres=gpu:A30:4'
queue = "gpuq"
containerOptions = { workflow.containerEngine == "singularity" ? '--nv':
( workflow.containerEngine == "docker" ? '--gpus all': null ) }
}
withLabel: highgpu {
cpus = 64
clusterOptions = '--gres=gpu:A30:8'
queue = "gpuq"
containerOptions = { workflow.containerEngine == "singularity" ? '--nv':
( workflow.containerEngine == "docker" ? '--gpus all': null ) }
}
withLabel: biggpu {
cpus = 16
clusterOptions = '--gres=gpu:A100:1'
queue = "gpuq"
containerOptions = { workflow.containerEngine == "singularity" ? '--nv':
( workflow.containerEngine == "docker" ? '--gpus all': null ) }
}

// make sure publishstates gets enough disk space and memory
withName:'.*publishStatesProc' {
memory = '16GB'
disk = '100GB'
}
}

def get_memory(to_compare) {
if (!process.containsKey("maxMemory") || !process.maxMemory) {
return to_compare
}

try {
if (process.containsKey("maxRetries") && process.maxRetries && task.attempt == (process.maxRetries as int)) {
return process.maxMemory
}
else if (to_compare.compareTo(process.maxMemory as nextflow.util.MemoryUnit) == 1) {
return max_memory as nextflow.util.MemoryUnit
}
else {
return to_compare
}
} catch (all) {
println "Error processing memory resources. Please check that process.maxMemory '${process.maxMemory}' and process.maxRetries '${process.maxRetries}' are valid!"
System.exit(1)
}
}

// set tracing file
trace {
enabled = true
overwrite = true
file = "${params.publish_dir}/trace.txt"
}
19 changes: 19 additions & 0 deletions scripts/nextflow_prerun_script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# paste me as pre-run script in Tower if setting up workflow run in WEHI HPC.
# load module first so the variables don't get overwritten
module load nextflow/25.04.2

# Tower pre-run script
export SHARED_SCRATCH="/vast/scratch/users/putri.g/nextflow"

export NXF_APPTAINER_CACHEDIR="$SHARED_SCRATCH/apptainer_cache"
export APPTAINER_CACHEDIR="$SHARED_SCRATCH/apptainer_cache"
export APPTAINER_TMPDIR="$SHARED_SCRATCH/apptainer_tmp"
export APPTAINER_LIBRARYDIR="$SHARED_SCRATCH/apptainer_library"
export SINGULARITY_CACHEDIR="$SHARED_SCRATCH/apptainer_cache"
export SINGULARITY_TMPDIR="$SHARED_SCRATCH/apptainer_tmp"
export TMPDIR="$SHARED_SCRATCH/apptainer_tmp"
export NXF_HOME="$SHARED_SCRATCH/nxf_home"
export NXF_TEMP="$SHARED_SCRATCH/nxf_tmp"
export HOME="$SHARED_SCRATCH/home"

mkdir -p "$APPTAINER_CACHEDIR" "$APPTAINER_TMPDIR" "$APPTAINER_LIBRARYDIR" "$NXF_HOME" "$NXF_TEMP" "$HOME"
31 changes: 31 additions & 0 deletions scripts/run_benchmark/run_full_hpc.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#!/bin/bash

# get the root of the directory
REPO_ROOT=$(git rev-parse --show-toplevel)

# ensure that the command below is run from the root of the repository
cd "$REPO_ROOT"

set -e

# generate a unique id
RUN_ID="run_$(date +%Y-%m-%d_%H-%M-%S)"
publish_dir="/vast/scratch/users/putri.g/cytobenchmark/benchmark_out_hpc/results/${RUN_ID}"

# write the parameters to file
cat > /tmp/params.yaml << HERE
input_states: /vast/scratch/users/putri.g/cytobenchmark/benchmark_out_hpc/datasets/**/state.yaml
rename_keys: 'input_censored_split1:output_censored_split1;input_censored_split2:output_censored_split2;input_unintegrated:output_unintegrated'
output_state: "state.yaml"
publish_dir: "$publish_dir"
HERE

tw launch https://github.com/openproblems-bio/task_cyto_batch_integration.git \
--revision build/setup_run_hpc \
--pull-latest \
--main-script target/nextflow/workflows/run_benchmark/main.nf \
--workspace 80689470953249 \
--params-file /tmp/params.yaml \
--entry-name auto \
--config scripts/labels_tw_wehi.config \
--labels task_cyto_batch_integration,full
34 changes: 34 additions & 0 deletions scripts/run_benchmark/run_subset_hpc.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#!/bin/bash

# run script to run only subset of methods/metrics on HPC

# get the root of the directory
REPO_ROOT=$(git rev-parse --show-toplevel)

# ensure that the command below is run from the root of the repository
cd "$REPO_ROOT"

set -e

# generate a unique id
RUN_ID="run_$(date +%Y-%m-%d_%H-%M-%S)"
publish_dir="/vast/scratch/users/putri.g/cytobenchmark/benchmark_out_hpc/results/${RUN_ID}"

# write the parameters to file
cat > /tmp/params.yaml << HERE
input_states: /vast/scratch/users/putri.g/cytobenchmark/benchmark_out_hpc/datasets/**/state.yaml
rename_keys: 'input_censored_split1:output_censored_split1;input_censored_split2:output_censored_split2;input_unintegrated:output_unintegrated'
output_state: "state.yaml"
settings: '{"metrics_include": ["lisi"], "methods_include": ["combat"]}'
publish_dir: "$publish_dir"
HERE

tw launch https://github.com/openproblems-bio/task_cyto_batch_integration.git \
--revision build/setup_run_hpc \
--pull-latest \
--main-script target/nextflow/workflows/run_benchmark/main.nf \
--workspace 80689470953249 \
--params-file /tmp/params.yaml \
--entry-name auto \
--config scripts/labels_tw_wehi.config \
--labels task_cyto_batch_integration,combat,test
Loading