Commit 29a0df9
committed
[gpu] strict driver and cuda version assignment
Roll forward GoogleCloudDataproc#1275
gpu/install_gpu_driver.sh
* updated supported versions
* moved all code into functions, which are called at the footer of
the installer
* install cuda and driver exclusively from run files
* extract cuda and driver version from urls if supplied
* support supplying cuda version as x.y.z instead of just x.y
* build nccl from source
* poll dpkg lock status for up to 60 seconds
* cache build artifacts from kernel driver and nccl
* use consistent arguments to curl
* create is_complete and mark_complete functions to allow re-running
* Tested more CUDA minor versions
* Printing warnings when combination provided is known to fail
* only install build dependencies on build cache miss
* added optional pytorch install option
* renamed metadata attribute cert_modulus_md5sum to modulus_md5sum
* verified that proprietary kernel drivers work with older dataproc images
* clear dkms key immediately after use
* cache .run files to GCS to reduce fetches from origin
* Install nvidia container toolkit and select container runtime
* tested installer on clusters without GPUs attached
* fixed a problem with ops agent not installing ; using venv
* Older CapacityScheduler does not permit use of gpu resources ;
switch to FairScheduler on 2.0 and below
* caching result of nvidia-smi in spark.executor.resource.gpu.discoveryScript
* setting some reasonable defaults in /etc/spark/conf.dist/spark-defaults.conf
* Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS
* Hold all NVIDIA-related packages from upgrading unintenionally
* skipping proxy setup if http-proxy metadata not set
* added function to check secure-boot and os version compatability
* harden sshd config
* install spark rapids acceleration libraries
gpu/manual-test-runner.sh
* order commands correctly
gpu/run-bazel-tests.sh
* do not retry flakey tests
gpu/test_gpu.py
* clearer test skipping logic
* added instructions on how to test pyspark
* remove skip of rocky9 tests1 parent 292d67f commit 29a0df9
6 files changed
Lines changed: 1527 additions & 511 deletions
File tree
- gpu
- integration_tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
18 | | - | |
19 | | - | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
20 | 22 | | |
21 | 23 | | |
22 | 24 | | |
23 | 25 | | |
24 | 26 | | |
25 | 27 | | |
26 | 28 | | |
27 | | - | |
28 | | - | |
| 29 | + | |
| 30 | + | |
29 | 31 | | |
30 | 32 | | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
31 | 37 | | |
32 | 38 | | |
33 | 39 | | |
| |||
0 commit comments