You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-**train.py**: This file contains the model definition, data loading, training loop, and other essential components for training a GNN.
4
+
5
+
-**run_4.sh**: This is an example shell script for Perlmutter, demonstrating how to run a Plexus-parallelized GNN on 4 GPUs. It includes placeholders that should be replaced with appropriate values for specific experiments, such as dataset path, output directory, etc. The script can be adapted to run on different numbers of GPUs and with different datasets.
6
+
7
+
For example, the script can be launched using:
8
+
```bash
9
+
sbatch run_4.sh 1 1 4 0
10
+
```
11
+
This would execute the training with a 3D parallelism configuration of (X, Y, Z) = (1, 1, 4) for trial number 0. The trial number is often used to differentiate output files from multiple runs.
12
+
13
+
- **get_rank.sh**: This shell script is used to set the ranks forthe GPUs involvedin the distributed training process. It also limits the core dump file size to 0.
14
+
15
+
- **parse_results.py**: This Python script contains the `process_log_file` function, which can be used to parse the timing results from the output log file generated by a training run.
16
+
17
+
- `export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"`: This can be setif there are warnings about fragmentation which can cause GPU OOM issues.
NNZ - number of nonzeros in graph's adjacency matrix
27
+
D_list - list of features at each layer excluding number of classes (ex: 3 GCN layers with 128 hidden dim, 100 feature size, [100, 128, 128])
28
+
coef - coefficients to multiply the three terms of the model by to get times in ms (default coefficients don't result in meaningful times, but give an ordering of the configs
0 commit comments