Skip to content

Commit f6c93fd

Browse files
committed
Refactor: Improve Proxy Handling and Secure Boot in GPU Install Script
This commit significantly enhances the robustness and configurability of the GPU driver installation script, particularly for environments with HTTP/HTTPS proxies and those using Secure Boot. **Key Changes:** * **Enhanced Proxy Configuration (`set_proxy`):** * Added support for `https-proxy` and `proxy-uri` metadata, providing more flexibility in proxy setups. * Improved `NO_PROXY` handling with sensible defaults (including Google APIs) and user-configurable additions. * Integrated support for custom proxy CA certificates via `http-proxy-pem-uri`, including installation into system, Java, and Conda trust stores. * Connections to the proxy now use HTTPS when a custom CA is provided. * Added proxy connection and reachability tests to fail fast on misconfiguration. * Ensures `curl`, `apt`, `dnf`, `gpg`, and Java all respect the proxy settings. * **Robust GPG Key Import (`import_gpg_keys`):** * Introduced a new function to reliably import GPG keys from URLs or keyservers, fully respecting the configured proxy and custom CA settings. * This replaces direct `curl | gpg --import` calls, making key fetching more resilient in restricted network environments. * **Secure Boot Signing Refinements:** * The `configure_dkms_certs` function now always fetches keys from Secret Manager if `private_secret_name` is set, ensuring `modulus_md5sum` is available for GCS cache paths. * Kernel module signing is now more clearly integrated into the build process. * Improved checks to ensure modules are actually signed and loadable after installation when Secure Boot is active. * **Resilient Driver Installation:** * The script now checks if the `nvidia` module can be loaded at the beginning of `install_nvidia_gpu_driver` and will re-attempt installation if it fails. * `curl` calls for downloading drivers and other artifacts now use retry flags and honor proxy settings. * **Conda Environment for PyTorch:** * Adjusted package list for Conda environment, removing TensorFlow to streamline. * Added specific workarounds for Debian 10, using `conda` instead of `mamba` and disabling SSL verification. * **Documentation Updates (`gpu/README.md`):** * Added details on the new proxy metadata: `https-proxy`, `proxy-uri`, `no-proxy`. * Created a new section "Enhanced Proxy Support" explaining the features. * Updated `http-proxy-pem-uri` description. * Added proxy considerations to the "Troubleshooting" section. These changes aim to make the GPU initialization action more reliable across a wider range of network environments and improve the Secure Boot workflow.
1 parent 9c2983d commit f6c93fd

2 files changed

Lines changed: 633 additions & 266 deletions

File tree

gpu/README.md

Lines changed: 25 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ Specifying a supported value for the `cuda-version` metadata variable
1515
will select compatible values for Driver, cuDNN, and NCCL from the script's
1616
internal matrix. Default CUDA versions are typically:
1717

18+
* Dataproc 1.5: `11.6.2`
1819
* Dataproc 2.0: `12.1.1`
1920
* Dataproc 2.1: `12.4.1`
2021
* Dataproc 2.2 & 2.3: `12.6.3`
@@ -191,20 +192,19 @@ This script accepts the following metadata parameters:
191192
* `cudnn-version`: (Optional) Specify cuDNN version (e.g., `8.9.7.29`).
192193
* `nccl-version`: (Optional) Specify NCCL version.
193194
* `include-pytorch`: (Optional) `yes`|`no`. Default: `no`.
194-
If `yes`, installs PyTorch, TensorFlow, RAPIDS, and PySpark in a Conda
195+
If `yes`, installs PyTorch, Numba, RAPIDS, and PySpark in a Conda
195196
environment.
196197
* `gpu-conda-env`: (Optional) Name for the PyTorch Conda environment.
197198
Default: `dpgce`.
198199
* `container-runtime`: (Optional) E.g., `docker`, `containerd`, `crio`.
199200
For NVIDIA Container Toolkit configuration. Auto-detected if not specified.
200-
* `http-proxy`: (Optional) URL of an HTTP proxy for downloads.
201+
* `http-proxy`: (Optional) Proxy address and port for HTTP requests (e.g., `your-proxy.com:3128`).
202+
* `https-proxy`: (Optional) Proxy address and port for HTTPS requests (e.g., `your-proxy.com:3128`). Defaults to `http-proxy` if not set.
203+
* `proxy-uri`: (Optional) A single proxy URI for both HTTP and HTTPS. Overridden by `http-proxy` or `https-proxy` if they are set.
204+
* `no-proxy`: (Optional) Comma or space-separated list of hosts/domains to bypass the proxy. Defaults include localhost, metadata server, and Google APIs. User-provided values are appended to the defaults.
201205
* `http-proxy-pem-uri`: (Optional) A `gs://` path to the
202-
PEM-encoded certificate file used by the proxy specified in
203-
`http-proxy`. This is needed if the proxy uses TLS and its
204-
certificate is not already trusted by the cluster's default trust
205-
store (e.g., if it's a self-signed certificate or signed by an
206-
internal CA). The script will install this certificate into the
207-
system and Java trust stores.
206+
PEM-encoded CA certificate file for the proxy specified in
207+
`http-proxy`/`https-proxy`. Required if the proxy uses TLS with a certificate not in the default system trust store. This certificate will be added to the system, Java, and Conda trust stores, and proxy connections will use HTTPS.
208208
* `invocation-type`: (For Custom Images) Set to `custom-images` by image
209209
building tools. Not typically set by end-users creating clusters.
210210
* **Secure Boot Signing Parameters:** Used if Secure Boot is enabled and
@@ -217,6 +217,20 @@ This script accepts the following metadata parameters:
217217
modulus_md5sum=<md5sum-of-your-mok-key-modulus>
218218
```
219219

220+
### Enhanced Proxy Support
221+
222+
This script includes robust support for environments requiring an HTTP/HTTPS proxy:
223+
224+
* **Configuration:** Use the `http-proxy`, `https-proxy`, or `proxy-uri` metadata to specify your proxy server (host:port).
225+
* **Custom CA Certificates:** If your proxy uses a custom CA (e.g., self-signed), provide the CA certificate in PEM format via the `http-proxy-pem-uri` metadata (as a `gs://` path). The script will:
226+
* Install the CA into the system trust store (`update-ca-certificates` or `update-ca-trust`).
227+
* Add the CA to the Java cacerts trust store.
228+
* Configure Conda to use the system trust store.
229+
* Switch proxy communications to use HTTPS.
230+
* **Tool Configuration:** The script automatically configures `curl`, `apt`, `dnf`, `gpg`, and Java to use the specified proxy settings and custom CA if provided.
231+
* **Bypass:** The `no-proxy` metadata allows specifying hosts to bypass the proxy. Defaults include `localhost`, the metadata server, `.google.com`, and `.googleapis.com` to ensure essential services function correctly.
232+
* **Verification:** The script performs connection tests to the proxy and attempts to reach external sites (google.com, nvidia.com) through the proxy to validate the configuration before proceeding with downloads.
233+
220234
### Loading Built Kernel Module & Secure Boot
221235

222236
When the script needs to build NVIDIA kernel modules from source (e.g., using
@@ -280,6 +294,7 @@ handles metric creation and reporting.
280294
* **Installation Failures:** Examine the initialization action log on the
281295
affected node, typically `/var/log/dataproc-initialization-script-0.log`
282296
(or a similar name if multiple init actions are used).
297+
* **Network/Proxy Issues:** If using a proxy, double-check the `http-proxy`, `https-proxy`, `proxy-uri`, `no-proxy`, and `http-proxy-pem-uri` metadata settings. Ensure the proxy allows access to NVIDIA domains, GitHub, and package repositories. Check the init action log for curl errors or proxy test failures.
283298
* **GPU Agent Issues:** If the agent was installed (`install-gpu-agent=true`),
284299
check its service logs using `sudo journalctl -u gpu-utilization-agent.service`.
285300
* **Driver Load or Secure Boot Problems:** Review `dmesg` output and
@@ -298,7 +313,7 @@ handles metric creation and reporting.
298313
* The script extensively caches downloaded artifacts (drivers, CUDA `.run`
299314
files) and compiled components (kernel modules, NCCL, Conda environments)
300315
to a GCS bucket. This bucket is typically specified by the
301-
`dataproc-temp-bucket` cluster property or metadata.
316+
`dataproc-temp-bucket` cluster property or metadata. Downloads and cache operations are proxy-aware.
302317
* **First Run / Cache Warming:** Initial runs on new configurations (OS,
303318
kernel, or driver version combinations) that require source compilation
304319
(e.g., for NCCL or kernel modules when no pre-compiled version is
@@ -324,4 +339,4 @@ handles metric creation and reporting.
324339
Debian-based systems, including handling of archived backports repositories
325340
to ensure dependencies can be met.
326341
* Tested primarily with Dataproc 2.0+ images. Support for older Dataproc
327-
1.5 images is limited.
342+
1.5 images is limited.

0 commit comments

Comments
 (0)