Commit 6d00e01
committed
Docs(gpu): Update README to reflect current script capabilities
This commit comprehensively updates gpu/README.md to align with the
current features, metadata, and behavior of install_gpu_driver.sh.
Key updates to the README include:
- Default Versions & Configurations:
- Clarified that the script's internal version matrix is based on
NVIDIA's guidance (e.g., Deep Learning Frameworks Support Matrix).
- Updated example default CUDA versions for different Dataproc
image series (2.0, 2.1, 2.2+).
- Expanded the table of "Example Tested Configurations" with more
recent and relevant CUDA/Driver/cuDNN/NCCL versions and their
tested Dataproc image compatibility.
- Updated the list of "Supported Operating Systems."
- Usage Examples:
- Revised `gcloud` examples for clarity, using current best practices
(regionalized bucket paths, common GPU types).
- Added a new example demonstrating the use of `cuda-url` and
`gpu-driver-url` with HTTP/HTTPS URLs.
- Updated the MIG (Multi-Instance GPU) example to correctly show
`install_gpu_driver.sh` for base drivers and `mig.sh` (via
`dataproc:startup.script.uri`) for MIG-specific setup.
- Custom Image Creation:
- Clarified the use of `invocation-type=custom-images` metadata,
emphasizing it's set by image building tools (like
`generate_custom_image.py`) and not by end-users creating
clusters from scratch.
- Provided a simplified example for `generate_custom_image.py`.
- Feature Documentation:
- Updated "GPU Scheduling in YARN" to reflect current configurations,
including the RAPIDS Spark plugin.
- Revised the "cuDNN" section for clarity on version selection and
installation methods.
- Significantly expanded "Loading Built Kernel Module & Secure Boot"
with details on MOK key management via GCP Secret Manager, the role
of `GoogleCloudDataproc/custom-images/examples/secure-boot/create-key-pair.sh`,
and the `--no-shielded-secure-boot` workaround.
- Metadata Parameters:
- Ensured the list is comprehensive and descriptions are accurate for
all current parameters, including:
- `cuda-url` and `gpu-driver-url` (clarifying they expect HTTP/HTTPS
URLs for `curl` fetching).
- `include-pytorch` and `gpu-conda-env`.
- `container-runtime`.
- Full set of Secure Boot signing parameters.
- Corrected default values where necessary (e.g., `install-gpu-agent`
is now `true` by default).
- Verification, Reporting, and Troubleshooting:
- Updated verification commands.
- Clarified that the "Report Metrics" section now refers to the
automated agent (based on ml-on-gcp code) and that
`create_gpu_metrics.py` is no longer used by this init action.
- Revised troubleshooting tips to be more relevant to current issues.
- Important Notes:
- Added detailed "Performance & Caching" subsection explaining:
- The GCS caching mechanism (`dataproc-temp-bucket`).
- Potential long first-run times (up to 150 mins on small nodes) if
compiling from source.
- The recommendation and benefits of "pre-warming" the cache on a
larger instance (reducing init action time to ~12-20 mins in some
cases).
- The security benefit of reduced attack surface when using cached
artifacts (as build tools may not be needed).
- Updated notes on SSHD hardening and APT source management.
- Confirmed primary support for Dataproc 2.0+ images.
- Formatting and Style:
- Maintained the overall structure and line-wrapping style (aiming
for ~80 columns) of the provided baseline README (md5sum
2daece9a7841cc4f5a0997fecf68cbd7) where feasible, while ensuring
clarity and readability of the new and updated content.1 parent 40552a7 commit 6d00e01
1 file changed
Lines changed: 249 additions & 262 deletions
0 commit comments