(Just keeping notes for myself. Maybe someone else will find these useful as well.)
Running the script on alihlt
Version with virtualenv
Prerequisities - to be installed by admin:
- python3-virtualenv
- graphviz
- ROOT prerequsities
Running
- Install ROOT 6. Add to
~/.bashrc:
export PATH=/opt/rocm/bin:$PATH
export ALIBUILD_WORK_DIR="$HOME/alice/sw"
eval "`alienv shell-helper`"
Reload shell.
2. Add PYTHONPATH=/home/${LOGNAME}/.virtualenvs/tpcwithdnn/lib/python3.6/site-packages/:$PYTHONPATH to load.sh:89 and comment LD_LIBRARY_PATH line.
3. Copy input data from aliceml and change paths in database*.yml (/home/mkabus/data/...).
4.
alienv enter ROOT/latest
source load.sh
pip uninstall tf-nightly-gpu
pip install tensorflow-rocm
- In
utilities_dnn.py:58 replace pool_type with 1 (forcing AveragePooling3D, MaxPooling3D causes: "3D pooling doesn't support workspace index mask mode" error).
- Change
run_parallel to true in database*.yml.
- In
dnn_optimiser.py:58 set devices explicitly, for 6 devices: self.strategy = MirroredStrategy(devices=["/gpu:0", "/gpu:1", "/gpu:2", "/gpu:3", "/gpu:4", "/gpu:5"])
Debugging
Comments to a tensorflow issue
ROCM guide on HIP debugging
ROCM guide on system-level debugging
Profiling
ROCM guide
(Just keeping notes for myself. Maybe someone else will find these useful as well.)
Running the script on alihlt
Version with virtualenv
Prerequisities - to be installed by admin:
Running
~/.bashrc:Reload shell.
2. Add
PYTHONPATH=/home/${LOGNAME}/.virtualenvs/tpcwithdnn/lib/python3.6/site-packages/:$PYTHONPATHtoload.sh:89and commentLD_LIBRARY_PATHline.3. Copy input data from aliceml and change paths in
database*.yml(/home/mkabus/data/...).4.
utilities_dnn.py:58replacepool_typewith1(forcingAveragePooling3D,MaxPooling3Dcauses: "3D pooling doesn't support workspace index mask mode" error).run_paralleltotrueindatabase*.yml.dnn_optimiser.py:58set devices explicitly, for 6 devices:self.strategy = MirroredStrategy(devices=["/gpu:0", "/gpu:1", "/gpu:2", "/gpu:3", "/gpu:4", "/gpu:5"])Debugging
Comments to a tensorflow issue
ROCM guide on HIP debugging
ROCM guide on system-level debugging
Profiling
ROCM guide