Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,5 @@

# Emacs
*~

gpu/install_gpu_driver.sh.d
79 changes: 77 additions & 2 deletions gpu/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ CUDA | Full Version | Driver | cuDNN | NCCL | Tested Dataproc Image Ver
-----| ------------ | --------- | --------- | -------| ---------------------------
11.8 | 11.8.0 | 525.147.05| 9.5.1.17 | 2.21.5 | 2.0, 2.1 (Debian/Ubuntu/Rocky); 2.2 (Ubuntu 22.04)
12.0 | 12.0.1 | 525.147.05| 8.8.1.3 | 2.16.5 | 2.0, 2.1 (Debian/Ubuntu/Rocky); 2.2 (Rocky 9, Ubuntu 22.04)
12.4 | 12.4.1 | 550.135 | 9.1.0.70 | 2.23.4 | 2.1 (Ubuntu 20.04, Rocky 8); Dataproc 2.2+
12.6 | 12.6.3 | 550.142 | 9.6.0.74 | 2.23.4 | 2.1 (Ubuntu 20.04, Rocky 8); Dataproc 2.2+
12.4 | 12.4.1 | 590.48.01| 9.1.0.70 | 2.23.4 | 2.1 (Ubuntu 20.04, Rocky 8); Dataproc 2.2+
12.6 | 12.6.3 | 590.48.01| 9.6.0.74 | 2.23.4 | 2.1 (Ubuntu 20.04, Rocky 8); Dataproc 2.2+

**Supported Operating Systems:**

Expand Down Expand Up @@ -189,6 +189,7 @@ This script accepts the following metadata parameters:
Determines preference for OS-provided vs. NVIDIA-direct drivers.
The script often prioritizes `.run` files or source builds for reliability.
* `cudnn-version`: (Optional) Specify cuDNN version (e.g., `8.9.7.29`).
* `cudnn-install-source`: (Optional) `tarball`|`package`. Default: `package` (except for `2.0-rocky8` and `2.1-rocky8` where it defaults to `tarball` to bypass CDN flakes). Determines whether cuDNN is installed via the OS package manager or extracted from the standalone NVIDIA tarball cached in GCS.
* `nccl-version`: (Optional) Specify NCCL version.
* `include-pytorch`: (Optional) `yes`|`no`. Default: `no`.
If `yes`, installs PyTorch, TensorFlow, RAPIDS, and PySpark in a Conda
Expand Down Expand Up @@ -289,6 +290,80 @@ handles metric creation and reporting.
older versions of the `report_gpu_metrics.py` service. The current script
and agent versions aim to mitigate this. If encountered, check agent logs.

## Development and Testing

For instructions on how to manually test changes to this initialization action, including iterative development on a live cluster, please see the [TESTING.md](./TESTING.md) guide.

If you are modifying this initialization action, you can use the provided test infrastructure to validate your changes locally before deploying them to production.

### Local Integration Testing (Bazel / Podman)

Before pushing any changes to GitHub, you **must** run the integration tests locally to validate your modifications against the full test matrix (`test_gpu.py`). These tests use `absl.testing.parameterized` and the `integration_tests.dataproc_test_case` framework to spin up ephemeral Dataproc clusters and validate GPU functionality (SINGLE, STANDARD, KERBEROS, MIG, etc.).

We provide a Podman wrapper to execute the Bazel test suite locally, perfectly simulating the remote CI sandbox environment.

1. **Credentials:** Ensure you have your Google Cloud Application Default Credentials (ADC) saved locally, typically at `~/.config/gcloud/application_default_credentials.json`, and copy it to `initialization-actions/key.json`.
2. **Environment:** You must have a configured `env.json` in the `gpu/` directory.

To run the full suite in the Podman container (Unfiltered):

> ⚠️ **WARNING: HIGH RESOURCE CONSUMPTION**
> An unfiltered run executes the entire test matrix (currently ~12 shards). Because the script is configured to run up to 10 jobs in parallel, this will concurrently provision up to 10 separate Dataproc clusters. This requires massive GCP quota (e.g., ~900 vCPUs and ~30 GPUs simultaneously if using `n1-standard-32` profiles) and will take 60-90 minutes.

```bash
cd initialization-actions
# Test a specific Dataproc image version against the full suite
./gpu/run-bazel-tests-with-podman.sh "2.2-ubuntu22"
```

To run a specific test filter to iterate quickly on a failure (Recommended):

```bash
cd initialization-actions

# Filter by a specific test function
./gpu/run-bazel-tests-with-podman.sh "2.2-ubuntu22" "--test_filter=test_gpu_allocation"

# Filter by another specific test function
./gpu/run-bazel-tests-with-podman.sh "2.2-ubuntu22" "--test_filter=test_install_gpu_cuda_nvidia_with_spark_job"

# Filter by the entire class
./gpu/run-bazel-tests-with-podman.sh "2.2-ubuntu22" "--test_filter=NvidiaGpuDriverTestCase"
```

### Manual Verification Scripts

If you have already provisioned a Dataproc cluster (e.g., `my-cluster`) and want to verify its GPU configuration without running the full Bazel test suite, you can use the standalone verification scripts.

```bash
# Verify using the local Python script
python3 gpu/verify_external_cluster.py \
--cluster=my-cluster \
--region=us-east4 \
--zone=us-east4-b \
--project=my-project \
--tests smi agent spark torch tf numa

# Or using the bash equivalent
export CLUSTER_NAME=my-cluster PROJECT_ID=my-project REGION=us-east4 ZONE=us-east4-b
./gpu/verify_external_gpu_cluster.sh
```

### Advanced Spark / ML Validation

For comprehensive validation of Spark RAPIDS, PyTorch, and TensorFlow on a running cluster, an external testing script is available in the associated `cloud-dataproc/gcloud` repository.

```bash
# Configure the gcloud test environment
cd ../cloud-dataproc/gcloud
source lib/env.sh # Populates environment variables from env.json

# Execute the comprehensive Spark GPU test suite against the configured cluster
./t/spark-gpu-test.sh
```

This script will remotely execute SSH commands to validate NUMA configurations, run PyTorch/TensorFlow isolated in their Conda environments, verify NVCC/cuDNN, and submit `SparkPi` and `JavaIndexToStringExample` Spark jobs configured to use the RAPIDS accelerator plugin.

## Important notes

* This initialization script will install NVIDIA GPU drivers in all nodes in
Expand Down
172 changes: 172 additions & 0 deletions gpu/TESTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# Testing the GPU Initialization Script

This document details the recommended iterative development and testing process for the `install_gpu_driver.sh` script, bypassing the slow integration runner when developing and ensuring comprehensive testing when complete.

## Fast Iterative Development (SSH/Manual)

This initialization action is designed to be **idempotent**, meaning it can be run multiple times on the same node without breaking the environment. It achieves this by writing "completion sentinels" to `/opt/install-dpgce/complete/` after successfully finishing each phase (e.g., `build-dependencies`, `nccl`, `cuda`).

To facilitate rapid iteration, we use the tooling provided in the companion `cloud-dataproc/gcloud` repository. This repo contains the test infrastructure, environment configuration (`env.json`), and lifecycle management scripts (`recreate-dpgce`, `ssh-m`, `scp-m`) necessary to provision and interact with test clusters efficiently.

When making structural or execution logic changes, you want to avoid destroying and recreating the entire Dataproc cluster during each test cycle. Instead, follow this incremental workflow:

### 1. Provision a "Bare" GPU Cluster
First, configure your target OS and versions in `cloud-dataproc/gcloud/env.json`. Then, use the `--no-init-action` flag on the recreation script to provision a cluster with GPUs attached, but *without* running any initialization actions during boot.

```bash
cd cloud-dataproc/gcloud
./bin/recreate-dpgce --gpu --no-init-action
```

### 2. Compile and Stage the Script
The `install_gpu_driver.sh` script is built from fragments. First, compile the fragments, then use the optimized `scp-m` command to transfer your local changes to the -m node. This script stages the file in the GCS temp bucket and pulls it down to `/tmp/install_gpu_driver.sh` over SSH.

```bash
cd initialization-actions
cat gpu/install_gpu_driver.sh.d/*.sh > gpu/install_gpu_driver.sh
cd ../cloud-dataproc/gcloud
./bin/scp-m ../../initialization-actions/gpu/install_gpu_driver.sh
```

### 3. Execute and Monitor (Incremental Testing)
Execute the script manually over SSH as root. Pumping the output through `tee` captures the logs identically to how Dataproc normally records initialization scripts.

**Crucially, when re-running the script to test a specific fix, you must purge the relevant completion sentinels** (and partial build directories like `nccl`) so the script doesn't skip the phase you are trying to test.

* To run the *entire* script from scratch: `sudo rm -rf /opt/install-dpgce/complete`
* To re-test only the NCCL build: `sudo rm -f /opt/install-dpgce/complete/nccl && sudo rm -rf /opt/install-dpgce/nccl`

```bash
cd cloud-dataproc/gcloud
./bin/ssh-m 'sudo rm -rf /opt/install-dpgce/complete' # Example: clear everything
cd ../../initialization-actions
./gpu/install-in-screen.sh
```

If your SSH connection drops, simply run `./gpu/install-in-screen.sh` again to instantly re-attach to the running session without losing context or interrupting the installation.

### 4. Verify with the Test Suite
Once the installation script completes without errors, run the external testing suite to ensure all Conda environments (PyTorch, TensorFlow, RAPIDS) and Spark services correctly bind to the GPU.

```bash
cd cloud-dataproc/gcloud
bash t/spark-gpu-test.sh
```

## Fast Iterative Development (SSH/Manual)

This initialization action is designed to be **idempotent**, meaning it can be run multiple times on the same node without breaking the environment. It achieves this by writing "completion sentinels" to `/opt/install-dpgce/complete/` after successfully finishing each phase (e.g., `build-dependencies`, `nccl`, `cuda`).

To facilitate rapid iteration, we use the tooling provided in the companion `cloud-dataproc/gcloud` repository. This repo contains the test infrastructure, environment configuration (`env.json`), and lifecycle management scripts (`recreate-dpgce`, `ssh-m`, `scp-m`) necessary to provision and interact with test clusters efficiently.

When making structural or execution logic changes, you want to avoid destroying and recreating the entire Dataproc cluster during each test cycle. Instead, follow this incremental workflow:

### 1. Provision a "Bare" GPU Cluster
First, configure your target OS and versions in `cloud-dataproc/gcloud/env.json`. Then, use the `--no-init-action` flag on the recreation script to provision a cluster with GPUs attached, but *without* running any initialization actions during boot.

```bash
cd ../cloud-dataproc/gcloud
# Edit env.json to set IMAGE_VERSION, REGION, ZONE, ACCELERATOR_TYPE, etc.
./bin/recreate-dpgce --gpu --no-init-action
```
*Note: `recreate-dpgce` will delete and recreate the cluster if it already exists.*

### 2. Compile, Stage, and Execute in Screen
The `install-in-screen.sh` script automates compiling the fragments, staging the script to the -m node, and running it within a detached `screen` session.

```bash
cd ../initialization-actions/gpu
./install-in-screen.sh
```

This command will:
* Concatenate scripts from `install_gpu_driver.sh.d/` into `install_gpu_driver.sh`.
* Use `../cloud-dataproc/gcloud/bin/scp-m` to upload the script to `/tmp/install_gpu_driver.sh` on the -m node.
* SSH to the -m node and start the script in a `screen` session named `gpu_install`. If the session already exists, it reattaches.

**Monitoring:**
* Logs are streamed to `/tmp/install_gpu_driver.log` on the -m node. You can tail this file via a separate SSH session:
```bash
cd ../cloud-dataproc/gcloud
./bin/ssh-m "tail -f /tmp/install_gpu_driver.log"
```
* Re-run `./install-in-screen.sh` to reattach to the screen session.

### 3. Incremental Testing & Clearing Sentinels
To re-run specific parts of the script after making fixes, you MUST clear the completion sentinels for those parts on the -m node.

* To run the *entire* script from scratch:
```bash
cd ../cloud-dataproc/gcloud
./bin/ssh-m 'sudo rm -rf /opt/install-dpgce/complete'
```
* To re-test only the NCCL build:
```bash
cd ../cloud-dataproc/gcloud
./bin/ssh-m 'sudo rm -f /opt/install-dpgce/complete/nccl && sudo rm -rf /opt/install-dpgce/nccl'
```
Then, run `./initialization-actions/gpu/install-in-screen.sh` again.

### 4. Verify with the Test Suite
Once the installation script completes without errors in the screen session, run the external testing suite from the `cloud-dataproc/gcloud` repository to ensure all Conda environments (PyTorch, TensorFlow, RAPIDS) and Spark services correctly bind to the GPU.

```bash
cd ../cloud-dataproc/gcloud
bash t/spark-gpu-test.sh
```

## Continuous Integration Testing (Bazel/Podman)

Once the manual tests pass, you **must** verify the script behaves correctly within the isolated Python `absl` test harness (`test_gpu.py`) before pushing your changes to GitHub. This validates the full matrix of installation scenarios (SINGLE, STANDARD, KERBEROS, MIG, etc.).

We use a Podman wrapper to execute the Bazel test suite locally, perfectly simulating the remote CI environment.

1. **Credentials:** Ensure your Google Cloud Application Default Credentials (ADC) are saved locally (typically `~/.config/gcloud/application_default_credentials.json`). Copy them to the root of the repository:
```bash
cp ~/.config/gcloud/application_default_credentials.json ./key.json
```

2. **Execute Full Suite (Unfiltered):** To execute the entire parameterized test matrix, run the wrapper script without a test filter.

> ⚠️ **WARNING: HIGH RESOURCE CONSUMPTION**
> An unfiltered run executes all ~12 active parameterized shards. Because the script runs with `--jobs=10`, this will concurrently provision up to 10 separate Dataproc clusters. This requires massive GCP quota (roughly ~900 vCPUs and ~30 GPUs simultaneously if using `n1-standard-32` profiles) and will take approximately 60 to 90 minutes to complete. Do not run this unless you are finalizing a major PR.

```bash
cd initialization-actions
./gpu/run-bazel-tests-with-podman.sh "2.2-ubuntu22"
```

3. **Execute Specific Tests (Recommended for Iteration):** When iterating on a specific feature or failure, always pass Bazel arguments to filter the test execution. This saves significant time and quota. You can filter by test function name or class.

*Filter by a specific test function:*
```bash
cd initialization-actions
./gpu/run-bazel-tests-with-podman.sh "2.2-ubuntu22" "--test_filter=test_gpu_allocation"
```

*Filter by a specific test function that executes spark jobs:*
```bash
cd initialization-actions
./gpu/run-bazel-tests-with-podman.sh "2.2-ubuntu22" "--test_filter=test_install_gpu_cuda_nvidia_with_spark_job"
```

*Filter by test class (runs all tests in the class):*
```bash
cd initialization-actions
./gpu/run-bazel-tests-with-podman.sh "2.2-ubuntu22" "--test_filter=NvidiaGpuDriverTestCase"
```

## Compiling the AST Splitter Tool (`split.go`)

If you need to re-split `install_gpu_driver.sh` into its `.d/` fragments (e.g. if the main script was modified instead of the fragments), we use a Go-based AST parsing tool (`split.go`) to accurately chunk the bash script.

To compile the tool locally:

```bash
cd initialization-actions/gpu
go mod init split
go get mvdan.cc/sh/v3/syntax
go build -o split_ast split.go
```

Once compiled, executing `./split_ast install_gpu_driver.sh` will parse the script and populate the `install_gpu_driver.sh.d/` directory with the chunked components.
Loading