Skip to content

Add baremetal testing backend with Python venv isolation#332

Closed
Copilot wants to merge 14 commits intomainfrom
copilot/add-baremetal-testing-backend
Closed

Add baremetal testing backend with Python venv isolation#332
Copilot wants to merge 14 commits intomainfrom
copilot/add-baremetal-testing-backend

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Jan 25, 2026

Run on baremetal and use venvs and avoid Apptainer since CI is talking 3 hours

Copilot AI and others added 5 commits January 25, 2026 01:43
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Copilot AI changed the title [WIP] Add baremetal testing backend using python venv Add baremetal testing backend with Python venv isolation Jan 25, 2026
Copilot AI requested a review from mawad-amd January 25, 2026 01:48
Comment thread .github/scripts/container_build.sh Outdated
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Comment thread baremetal/README.md Outdated
…st workflows

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Comment thread .github/scripts/container_build.sh
Copilot AI and others added 2 commits January 25, 2026 02:51
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Comment thread baremetal/run.sh Outdated
Comment thread baremetal/run.sh Outdated
Comment thread baremetal/run.sh Outdated
@mawad-amd mawad-amd marked this pull request as ready for review January 25, 2026 03:44
@mawad-amd mawad-amd requested a review from neoblizz as a code owner January 25, 2026 03:44
Copilot AI review requested due to automatic review settings January 25, 2026 03:44
@mawad-amd mawad-amd requested a review from BKP as a code owner January 25, 2026 03:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a baremetal testing backend using Python virtual environments to replace Apptainer/Docker containerization in CI workflows, addressing 3-hour CI execution times. The change modifies the container abstraction layer to support a third runtime option alongside Docker and Apptainer.

Changes:

  • New baremetal backend with Python venv isolation (build.sh, run.sh scripts)
  • Universal container scripts updated to support baremetal runtime detection and execution
  • CI workflows configured to use baremetal as default CONTAINER_RUNTIME
  • Added .gitignore entry for baremetal venv directory

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
baremetal/run.sh New script to activate and run commands in baremetal venv
baremetal/build.sh New script to create Python venv with minimal dependencies
.gitignore Excludes baremetal/venv/ directory from version control
.github/workflows/iris-tests.yml Sets CONTAINER_RUNTIME=baremetal for test workflow
.github/workflows/iris-performance-regression-test.yml Sets CONTAINER_RUNTIME=baremetal for performance tests
.github/workflows/iris-external-validation-test.yml Sets CONTAINER_RUNTIME=baremetal for external validation
.github/scripts/container_run.sh Adds baremetal runtime detection and execution path
.github/scripts/container_exec.sh Implements baremetal command execution with venv activation
.github/scripts/container_build.sh Adds baremetal build path calling baremetal/build.sh

Comment thread baremetal/build.sh
# Create virtual environment if it doesn't exist
if [ ! -d "${VENV_DIR}" ]; then
echo "[INFO] Creating new Python virtual environment..."
python3 -m venv --system-site-packages "${VENV_DIR}"
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using '--system-site-packages' flag assumes that all necessary packages (PyTorch, Triton, ROCm tools, etc.) are already installed system-wide on the baremetal machine. This assumption may not hold true in all CI environments or local setups. If the system site packages don't include the required dependencies, the virtual environment will not have access to them, causing test failures.

Consider either:

  1. Removing the '--system-site-packages' flag and installing all dependencies explicitly in the venv
  2. Adding a validation check to ensure required system packages are available
  3. Documenting this requirement clearly in comments
Suggested change
python3 -m venv --system-site-packages "${VENV_DIR}"
python3 -m venv "${VENV_DIR}"

Copilot uses AI. Check for mistakes.
Comment thread baremetal/run.sh
#
# Note: This script is designed for Unix-like systems (Linux, macOS).
# Windows users should use WSL or container-based backends.
export PATH="/opt/rocm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the PATH variable at the start of the script will be lost after the script exits. This PATH modification only affects the run.sh script execution context and won't be available to the activated virtual environment or the commands executed within it.

If the intent is to ensure /opt/rocm/bin is available during command execution, this should be exported within the EXEC_CMD in container_exec.sh (line 152-160) or documented that users need ROCm in their system PATH already.

Suggested change
export PATH="/opt/rocm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
if [ -d "/opt/rocm/bin" ]; then
export PATH="/opt/rocm/bin:${PATH}"
fi

Copilot uses AI. Check for mistakes.
if [ -n "$GPU_DEVICES" ]; then
EXEC_CMD="$EXEC_CMD && export HIP_VISIBLE_DEVICES=${GPU_DEVICES}"
fi

Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The baremetal implementation sources the venv activation script but doesn't export the PATH variable set in baremetal/run.sh (line 7). This means that if ROCm tools (/opt/rocm/bin) are needed during test execution, they won't be available unless they're already in the system PATH.

Consider adding the ROCm PATH to the EXEC_CMD similar to how HIP_VISIBLE_DEVICES is set, for example:
EXEC_CMD="$EXEC_CMD && export PATH=/opt/rocm/bin:$PATH"

Suggested change
# Ensure ROCm tools are available on PATH (mirror baremetal/run.sh behavior)
EXEC_CMD="$EXEC_CMD && export PATH=/opt/rocm/bin:\$PATH"

Copilot uses AI. Check for mistakes.
Comment thread baremetal/build.sh

echo "[INFO] Baremetal environment setup completed successfully"
echo "[INFO] Virtual environment location: ${VENV_DIR}"
echo "[INFO] To activate manually: source ${VENV_DIR}/bin/activate"
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The baremetal backend is now added as a new installation and testing method, but there is no documentation explaining how to use it or what its requirements are. Users and developers need to understand:

  1. When to use baremetal vs Docker/Apptainer
  2. System prerequisites (ROCm must be installed, PyTorch, Triton, etc.)
  3. How to manually run baremetal tests
  4. Limitations compared to containerized approaches

Consider adding a documentation section or updating the installation documentation to cover the baremetal backend option.

Suggested change
echo "[INFO] To activate manually: source ${VENV_DIR}/bin/activate"
echo
echo "======================================================================="
echo " Iris baremetal backend usage"
echo "======================================================================="
echo
echo "When to use baremetal vs Docker/Apptainer:"
echo " - Use *baremetal* when you already have ROCm/HIP and AMD GPU drivers"
echo " installed on the host, and you want direct access to the system"
echo " environment without container isolation."
echo " - Use *Docker/Apptainer* when you prefer a fully reproducible,"
echo " containerized environment or do not control the host setup."
echo
echo "System prerequisites for baremetal:"
echo " - ROCm/HIP toolkit installed and working on the host"
echo " - Supported AMD GPU with ROCm drivers (e.g., MI300X/MI350X/MI355X)"
echo " - Python 3 with venv support"
echo " - PyTorch and Triton (installed after activating this venv)"
echo
echo "To activate this environment manually:"
echo " source \"${VENV_DIR}/bin/activate\""
echo
echo "After activation, install Iris and dev dependencies from the repo root:"
echo " cd \"${SCRIPT_DIR}/..\""
echo " pip install -e '.[dev]'"
echo
echo "To run baremetal tests (from the repo root after activation):"
echo " pytest tests/unittests/"
echo " pytest tests/examples/"
echo
echo "Limitations compared to containerized approaches:"
echo " - No environment isolation: host ROCm/PyTorch/Triton versions are used."
echo " - You are responsible for installing and maintaining ROCm/HIP and GPU"
echo " drivers on the host."
echo " - Behavior may differ across systems if host dependencies are not"
echo " strictly controlled."
echo "======================================================================="

Copilot uses AI. Check for mistakes.
Comment thread baremetal/build.sh
Comment on lines +29 to +32
# Note: Using latest versions for simplicity. For reproducible builds,
# consider creating a requirements.txt with pinned versions.
echo "[INFO] Installing base dependencies..."
pip install wheel jupyter
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on lines 29-30 mentions "Using latest versions for simplicity" and suggests creating a requirements.txt for reproducible builds. However, this approach conflicts with the goal of having reliable CI testing. Installing 'latest' versions of wheel and jupyter can introduce version incompatibilities and make test results non-reproducible across runs.

For CI reliability, consider:

  1. Pinning specific versions of wheel and jupyter that are known to work
  2. Creating a requirements.txt file even for the baremetal backend
  3. Or removing this installation entirely if these packages are provided by system-site-packages
Suggested change
# Note: Using latest versions for simplicity. For reproducible builds,
# consider creating a requirements.txt with pinned versions.
echo "[INFO] Installing base dependencies..."
pip install wheel jupyter
# Note: Using pinned versions for reproducible builds and CI reliability.
# If you update these, ensure CI passes and consider mirroring in requirements files.
echo "[INFO] Installing base dependencies..."
pip install wheel==0.43.0 jupyter==1.0.0

Copilot uses AI. Check for mistakes.
Comment thread baremetal/build.sh
Comment on lines +8 to +9
VENV_DIR="${SCRIPT_DIR}/venv"

Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script uses a fixed venv directory name 'venv' without any isolation between different users, branches, or test runs. In a CI environment with parallel jobs or multiple developers on a shared system, this could lead to:

  1. Race conditions if multiple jobs try to build/use the venv simultaneously
  2. Conflicts between different Git branches that may need different dependencies
  3. Stale dependencies if the venv isn't cleaned between runs

Consider using a more unique venv path that includes:

  • A timestamp or unique identifier
  • The current Git branch or commit hash
  • User identification for shared systems

Or add documentation about when the venv should be manually cleaned/rebuilt.

Suggested change
VENV_DIR="${SCRIPT_DIR}/venv"
# Derive a unique venv directory to avoid cross-branch/user/test interference
USER_NAME=$(id -un 2>/dev/null || whoami 2>/dev/null || echo "unknown")
GIT_BRANCH_RAW=$(git -C "${SCRIPT_DIR}" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "no-branch")
GIT_COMMIT=$(git -C "${SCRIPT_DIR}" rev-parse --short HEAD 2>/dev/null || echo "no-commit")
# Sanitize branch name for filesystem use (replace '/' with '_')
SANITIZED_BRANCH=${GIT_BRANCH_RAW//\//_}
# Prefer CI-provided job/run IDs if available, otherwise fall back to shell PID
JOB_ID=${CI_JOB_ID:-${GITHUB_RUN_ID:-$$}}
VENV_DIR="${SCRIPT_DIR}/venv_${USER_NAME}_${SANITIZED_BRANCH}_${GIT_COMMIT}_${JOB_ID}"

Copilot uses AI. Check for mistakes.
Comment thread baremetal/build.sh
# Note: Using latest versions for simplicity. For reproducible builds,
# consider creating a requirements.txt with pinned versions.
echo "[INFO] Installing base dependencies..."
pip install wheel jupyter
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The baremetal environment only installs minimal dependencies (wheel, jupyter) and does not install critical dependencies required for Iris to function. Comparing with the Docker environment, this is missing:

  • Triton (the core dependency for Iris framework)
  • PyTorch/torch (required for tensor operations)
  • Build tools (ninja-build, cmake) needed for C++/HIP compilation
  • Other Python packages from pyproject.toml dependencies (numpy, requests, ruff)

This will cause tests to fail when trying to run Iris code. The baremetal build should either install Iris with its dependencies using 'pip install -e .[dev]' or install the required base dependencies that match the Docker/Apptainer environments.

Copilot uses AI. Check for mistakes.
Comment thread baremetal/run.sh
Comment on lines +3 to +6
# Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
#
# Note: This script is designed for Unix-like systems (Linux, macOS).
# Windows users should use WSL or container-based backends.
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build.sh script is missing the 'set -e' directive to fail on errors, unlike container_build.sh and container_exec.sh which include it. Without 'set -e', if any command in the script fails (e.g., python3 not found, pip install fails), the script will continue executing and may report success even though the setup is incomplete.

Add 'set -e' after the shebang and before the main script logic, similar to the pattern used in other scripts in this PR.

Suggested change
# Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
#
# Note: This script is designed for Unix-like systems (Linux, macOS).
# Windows users should use WSL or container-based backends.
Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
#
# Note: This script is designed for Unix-like systems (Linux, macOS).
# Windows users should use WSL or container-based backends.
set -e

Copilot uses AI. Check for mistakes.
Comment thread baremetal/build.sh
Comment on lines +5 to +10
set -e

SCRIPT_DIR=$(dirname "$(realpath "$0")")
VENV_DIR="${SCRIPT_DIR}/venv"

echo "[INFO] Setting up baremetal Python venv at ${VENV_DIR}"
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When baremetal/build.sh is called from container_build.sh, any errors in build.sh will be caught by the 'set -e' in container_build.sh. However, build.sh itself doesn't have 'set -e', which means individual command failures within build.sh (like pip install failing) won't propagate correctly. While container_build.sh will catch a complete script failure, partial failures within build.sh could go unnoticed.

Add 'set -e' at the beginning of baremetal/build.sh to ensure proper error propagation.

Copilot uses AI. Check for mistakes.
Comment thread baremetal/build.sh
Comment on lines +13 to +19
if [ ! -d "${VENV_DIR}" ]; then
echo "[INFO] Creating new Python virtual environment..."
python3 -m venv --system-site-packages "${VENV_DIR}"
echo "[INFO] Virtual environment created successfully"
else
echo "[INFO] Using existing virtual environment at ${VENV_DIR}"
fi
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The baremetal backend creates a venv with '--system-site-packages' and only installs 'wheel' and 'jupyter', but relies on the system having PyTorch, Triton, and other dependencies already installed. This creates an implicit dependency that isn't validated.

If the system doesn't have the required packages, tests will fail with import errors rather than clear setup errors. Consider adding a validation step that checks for required packages:

python3 -c "import torch; import triton" || {
    echo "[ERROR] Required system packages not found. Please ensure PyTorch and Triton are installed system-wide."
    exit 1
}

This would provide clearer error messages and fail fast during setup rather than during test execution.

Copilot uses AI. Check for mistakes.
@mawad-amd mawad-amd closed this Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants