Skip to content

Commit 6d00e01

Browse files
committed
Docs(gpu): Update README to reflect current script capabilities
This commit comprehensively updates gpu/README.md to align with the current features, metadata, and behavior of install_gpu_driver.sh. Key updates to the README include: - Default Versions & Configurations: - Clarified that the script's internal version matrix is based on NVIDIA's guidance (e.g., Deep Learning Frameworks Support Matrix). - Updated example default CUDA versions for different Dataproc image series (2.0, 2.1, 2.2+). - Expanded the table of "Example Tested Configurations" with more recent and relevant CUDA/Driver/cuDNN/NCCL versions and their tested Dataproc image compatibility. - Updated the list of "Supported Operating Systems." - Usage Examples: - Revised `gcloud` examples for clarity, using current best practices (regionalized bucket paths, common GPU types). - Added a new example demonstrating the use of `cuda-url` and `gpu-driver-url` with HTTP/HTTPS URLs. - Updated the MIG (Multi-Instance GPU) example to correctly show `install_gpu_driver.sh` for base drivers and `mig.sh` (via `dataproc:startup.script.uri`) for MIG-specific setup. - Custom Image Creation: - Clarified the use of `invocation-type=custom-images` metadata, emphasizing it's set by image building tools (like `generate_custom_image.py`) and not by end-users creating clusters from scratch. - Provided a simplified example for `generate_custom_image.py`. - Feature Documentation: - Updated "GPU Scheduling in YARN" to reflect current configurations, including the RAPIDS Spark plugin. - Revised the "cuDNN" section for clarity on version selection and installation methods. - Significantly expanded "Loading Built Kernel Module & Secure Boot" with details on MOK key management via GCP Secret Manager, the role of `GoogleCloudDataproc/custom-images/examples/secure-boot/create-key-pair.sh`, and the `--no-shielded-secure-boot` workaround. - Metadata Parameters: - Ensured the list is comprehensive and descriptions are accurate for all current parameters, including: - `cuda-url` and `gpu-driver-url` (clarifying they expect HTTP/HTTPS URLs for `curl` fetching). - `include-pytorch` and `gpu-conda-env`. - `container-runtime`. - Full set of Secure Boot signing parameters. - Corrected default values where necessary (e.g., `install-gpu-agent` is now `true` by default). - Verification, Reporting, and Troubleshooting: - Updated verification commands. - Clarified that the "Report Metrics" section now refers to the automated agent (based on ml-on-gcp code) and that `create_gpu_metrics.py` is no longer used by this init action. - Revised troubleshooting tips to be more relevant to current issues. - Important Notes: - Added detailed "Performance & Caching" subsection explaining: - The GCS caching mechanism (`dataproc-temp-bucket`). - Potential long first-run times (up to 150 mins on small nodes) if compiling from source. - The recommendation and benefits of "pre-warming" the cache on a larger instance (reducing init action time to ~12-20 mins in some cases). - The security benefit of reduced attack surface when using cached artifacts (as build tools may not be needed). - Updated notes on SSHD hardening and APT source management. - Confirmed primary support for Dataproc 2.0+ images. - Formatting and Style: - Maintained the overall structure and line-wrapping style (aiming for ~80 columns) of the provided baseline README (md5sum 2daece9a7841cc4f5a0997fecf68cbd7) where feasible, while ensuring clarity and readability of the new and updated content.
1 parent 40552a7 commit 6d00e01

1 file changed

Lines changed: 249 additions & 262 deletions

File tree

0 commit comments

Comments
 (0)