Add baremetal testing backend with Python venv isolation#332
Add baremetal testing backend with Python venv isolation#332
Conversation
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
…st workflows Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces a baremetal testing backend using Python virtual environments to replace Apptainer/Docker containerization in CI workflows, addressing 3-hour CI execution times. The change modifies the container abstraction layer to support a third runtime option alongside Docker and Apptainer.
Changes:
- New baremetal backend with Python venv isolation (build.sh, run.sh scripts)
- Universal container scripts updated to support baremetal runtime detection and execution
- CI workflows configured to use baremetal as default CONTAINER_RUNTIME
- Added .gitignore entry for baremetal venv directory
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| baremetal/run.sh | New script to activate and run commands in baremetal venv |
| baremetal/build.sh | New script to create Python venv with minimal dependencies |
| .gitignore | Excludes baremetal/venv/ directory from version control |
| .github/workflows/iris-tests.yml | Sets CONTAINER_RUNTIME=baremetal for test workflow |
| .github/workflows/iris-performance-regression-test.yml | Sets CONTAINER_RUNTIME=baremetal for performance tests |
| .github/workflows/iris-external-validation-test.yml | Sets CONTAINER_RUNTIME=baremetal for external validation |
| .github/scripts/container_run.sh | Adds baremetal runtime detection and execution path |
| .github/scripts/container_exec.sh | Implements baremetal command execution with venv activation |
| .github/scripts/container_build.sh | Adds baremetal build path calling baremetal/build.sh |
| # Create virtual environment if it doesn't exist | ||
| if [ ! -d "${VENV_DIR}" ]; then | ||
| echo "[INFO] Creating new Python virtual environment..." | ||
| python3 -m venv --system-site-packages "${VENV_DIR}" |
There was a problem hiding this comment.
Using '--system-site-packages' flag assumes that all necessary packages (PyTorch, Triton, ROCm tools, etc.) are already installed system-wide on the baremetal machine. This assumption may not hold true in all CI environments or local setups. If the system site packages don't include the required dependencies, the virtual environment will not have access to them, causing test failures.
Consider either:
- Removing the '--system-site-packages' flag and installing all dependencies explicitly in the venv
- Adding a validation check to ensure required system packages are available
- Documenting this requirement clearly in comments
| python3 -m venv --system-site-packages "${VENV_DIR}" | |
| python3 -m venv "${VENV_DIR}" |
| # | ||
| # Note: This script is designed for Unix-like systems (Linux, macOS). | ||
| # Windows users should use WSL or container-based backends. | ||
| export PATH="/opt/rocm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" |
There was a problem hiding this comment.
Setting the PATH variable at the start of the script will be lost after the script exits. This PATH modification only affects the run.sh script execution context and won't be available to the activated virtual environment or the commands executed within it.
If the intent is to ensure /opt/rocm/bin is available during command execution, this should be exported within the EXEC_CMD in container_exec.sh (line 152-160) or documented that users need ROCm in their system PATH already.
| export PATH="/opt/rocm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" | |
| if [ -d "/opt/rocm/bin" ]; then | |
| export PATH="/opt/rocm/bin:${PATH}" | |
| fi |
| if [ -n "$GPU_DEVICES" ]; then | ||
| EXEC_CMD="$EXEC_CMD && export HIP_VISIBLE_DEVICES=${GPU_DEVICES}" | ||
| fi | ||
|
|
There was a problem hiding this comment.
The baremetal implementation sources the venv activation script but doesn't export the PATH variable set in baremetal/run.sh (line 7). This means that if ROCm tools (/opt/rocm/bin) are needed during test execution, they won't be available unless they're already in the system PATH.
Consider adding the ROCm PATH to the EXEC_CMD similar to how HIP_VISIBLE_DEVICES is set, for example:
EXEC_CMD="$EXEC_CMD && export PATH=/opt/rocm/bin:$PATH"
| # Ensure ROCm tools are available on PATH (mirror baremetal/run.sh behavior) | |
| EXEC_CMD="$EXEC_CMD && export PATH=/opt/rocm/bin:\$PATH" | |
|
|
||
| echo "[INFO] Baremetal environment setup completed successfully" | ||
| echo "[INFO] Virtual environment location: ${VENV_DIR}" | ||
| echo "[INFO] To activate manually: source ${VENV_DIR}/bin/activate" |
There was a problem hiding this comment.
The baremetal backend is now added as a new installation and testing method, but there is no documentation explaining how to use it or what its requirements are. Users and developers need to understand:
- When to use baremetal vs Docker/Apptainer
- System prerequisites (ROCm must be installed, PyTorch, Triton, etc.)
- How to manually run baremetal tests
- Limitations compared to containerized approaches
Consider adding a documentation section or updating the installation documentation to cover the baremetal backend option.
| echo "[INFO] To activate manually: source ${VENV_DIR}/bin/activate" | |
| echo | |
| echo "=======================================================================" | |
| echo " Iris baremetal backend usage" | |
| echo "=======================================================================" | |
| echo | |
| echo "When to use baremetal vs Docker/Apptainer:" | |
| echo " - Use *baremetal* when you already have ROCm/HIP and AMD GPU drivers" | |
| echo " installed on the host, and you want direct access to the system" | |
| echo " environment without container isolation." | |
| echo " - Use *Docker/Apptainer* when you prefer a fully reproducible," | |
| echo " containerized environment or do not control the host setup." | |
| echo | |
| echo "System prerequisites for baremetal:" | |
| echo " - ROCm/HIP toolkit installed and working on the host" | |
| echo " - Supported AMD GPU with ROCm drivers (e.g., MI300X/MI350X/MI355X)" | |
| echo " - Python 3 with venv support" | |
| echo " - PyTorch and Triton (installed after activating this venv)" | |
| echo | |
| echo "To activate this environment manually:" | |
| echo " source \"${VENV_DIR}/bin/activate\"" | |
| echo | |
| echo "After activation, install Iris and dev dependencies from the repo root:" | |
| echo " cd \"${SCRIPT_DIR}/..\"" | |
| echo " pip install -e '.[dev]'" | |
| echo | |
| echo "To run baremetal tests (from the repo root after activation):" | |
| echo " pytest tests/unittests/" | |
| echo " pytest tests/examples/" | |
| echo | |
| echo "Limitations compared to containerized approaches:" | |
| echo " - No environment isolation: host ROCm/PyTorch/Triton versions are used." | |
| echo " - You are responsible for installing and maintaining ROCm/HIP and GPU" | |
| echo " drivers on the host." | |
| echo " - Behavior may differ across systems if host dependencies are not" | |
| echo " strictly controlled." | |
| echo "=======================================================================" |
| # Note: Using latest versions for simplicity. For reproducible builds, | ||
| # consider creating a requirements.txt with pinned versions. | ||
| echo "[INFO] Installing base dependencies..." | ||
| pip install wheel jupyter |
There was a problem hiding this comment.
The comment on lines 29-30 mentions "Using latest versions for simplicity" and suggests creating a requirements.txt for reproducible builds. However, this approach conflicts with the goal of having reliable CI testing. Installing 'latest' versions of wheel and jupyter can introduce version incompatibilities and make test results non-reproducible across runs.
For CI reliability, consider:
- Pinning specific versions of wheel and jupyter that are known to work
- Creating a requirements.txt file even for the baremetal backend
- Or removing this installation entirely if these packages are provided by system-site-packages
| # Note: Using latest versions for simplicity. For reproducible builds, | |
| # consider creating a requirements.txt with pinned versions. | |
| echo "[INFO] Installing base dependencies..." | |
| pip install wheel jupyter | |
| # Note: Using pinned versions for reproducible builds and CI reliability. | |
| # If you update these, ensure CI passes and consider mirroring in requirements files. | |
| echo "[INFO] Installing base dependencies..." | |
| pip install wheel==0.43.0 jupyter==1.0.0 |
| VENV_DIR="${SCRIPT_DIR}/venv" | ||
|
|
There was a problem hiding this comment.
The script uses a fixed venv directory name 'venv' without any isolation between different users, branches, or test runs. In a CI environment with parallel jobs or multiple developers on a shared system, this could lead to:
- Race conditions if multiple jobs try to build/use the venv simultaneously
- Conflicts between different Git branches that may need different dependencies
- Stale dependencies if the venv isn't cleaned between runs
Consider using a more unique venv path that includes:
- A timestamp or unique identifier
- The current Git branch or commit hash
- User identification for shared systems
Or add documentation about when the venv should be manually cleaned/rebuilt.
| VENV_DIR="${SCRIPT_DIR}/venv" | |
| # Derive a unique venv directory to avoid cross-branch/user/test interference | |
| USER_NAME=$(id -un 2>/dev/null || whoami 2>/dev/null || echo "unknown") | |
| GIT_BRANCH_RAW=$(git -C "${SCRIPT_DIR}" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "no-branch") | |
| GIT_COMMIT=$(git -C "${SCRIPT_DIR}" rev-parse --short HEAD 2>/dev/null || echo "no-commit") | |
| # Sanitize branch name for filesystem use (replace '/' with '_') | |
| SANITIZED_BRANCH=${GIT_BRANCH_RAW//\//_} | |
| # Prefer CI-provided job/run IDs if available, otherwise fall back to shell PID | |
| JOB_ID=${CI_JOB_ID:-${GITHUB_RUN_ID:-$$}} | |
| VENV_DIR="${SCRIPT_DIR}/venv_${USER_NAME}_${SANITIZED_BRANCH}_${GIT_COMMIT}_${JOB_ID}" |
| # Note: Using latest versions for simplicity. For reproducible builds, | ||
| # consider creating a requirements.txt with pinned versions. | ||
| echo "[INFO] Installing base dependencies..." | ||
| pip install wheel jupyter |
There was a problem hiding this comment.
The baremetal environment only installs minimal dependencies (wheel, jupyter) and does not install critical dependencies required for Iris to function. Comparing with the Docker environment, this is missing:
- Triton (the core dependency for Iris framework)
- PyTorch/torch (required for tensor operations)
- Build tools (ninja-build, cmake) needed for C++/HIP compilation
- Other Python packages from pyproject.toml dependencies (numpy, requests, ruff)
This will cause tests to fail when trying to run Iris code. The baremetal build should either install Iris with its dependencies using 'pip install -e .[dev]' or install the required base dependencies that match the Docker/Apptainer environments.
| # Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved. | ||
| # | ||
| # Note: This script is designed for Unix-like systems (Linux, macOS). | ||
| # Windows users should use WSL or container-based backends. |
There was a problem hiding this comment.
The build.sh script is missing the 'set -e' directive to fail on errors, unlike container_build.sh and container_exec.sh which include it. Without 'set -e', if any command in the script fails (e.g., python3 not found, pip install fails), the script will continue executing and may report success even though the setup is incomplete.
Add 'set -e' after the shebang and before the main script logic, similar to the pattern used in other scripts in this PR.
| # Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved. | |
| # | |
| # Note: This script is designed for Unix-like systems (Linux, macOS). | |
| # Windows users should use WSL or container-based backends. | |
| Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved. | |
| # | |
| # Note: This script is designed for Unix-like systems (Linux, macOS). | |
| # Windows users should use WSL or container-based backends. | |
| set -e |
| set -e | ||
|
|
||
| SCRIPT_DIR=$(dirname "$(realpath "$0")") | ||
| VENV_DIR="${SCRIPT_DIR}/venv" | ||
|
|
||
| echo "[INFO] Setting up baremetal Python venv at ${VENV_DIR}" |
There was a problem hiding this comment.
When baremetal/build.sh is called from container_build.sh, any errors in build.sh will be caught by the 'set -e' in container_build.sh. However, build.sh itself doesn't have 'set -e', which means individual command failures within build.sh (like pip install failing) won't propagate correctly. While container_build.sh will catch a complete script failure, partial failures within build.sh could go unnoticed.
Add 'set -e' at the beginning of baremetal/build.sh to ensure proper error propagation.
| if [ ! -d "${VENV_DIR}" ]; then | ||
| echo "[INFO] Creating new Python virtual environment..." | ||
| python3 -m venv --system-site-packages "${VENV_DIR}" | ||
| echo "[INFO] Virtual environment created successfully" | ||
| else | ||
| echo "[INFO] Using existing virtual environment at ${VENV_DIR}" | ||
| fi |
There was a problem hiding this comment.
The baremetal backend creates a venv with '--system-site-packages' and only installs 'wheel' and 'jupyter', but relies on the system having PyTorch, Triton, and other dependencies already installed. This creates an implicit dependency that isn't validated.
If the system doesn't have the required packages, tests will fail with import errors rather than clear setup errors. Consider adding a validation step that checks for required packages:
python3 -c "import torch; import triton" || {
echo "[ERROR] Required system packages not found. Please ensure PyTorch and Triton are installed system-wide."
exit 1
}This would provide clearer error messages and fail fast during setup rather than during test execution.
Run on baremetal and use venvs and avoid Apptainer since CI is talking 3 hours