Skip to content

Commit 37d11df

Browse files
author
Sebastien Loisel
committed
Add CUDA extension, fix docs, configure CI for MPICH_jll
- Add LinearAlgebraMPICUDAExt.jl with cu()/cpu() conversions and CuDSSFactorizationMPI for distributed sparse direct solves via NCCL - Add codecov.yml to exclude GPU extensions from coverage - Document cuDSS MGMN bug (status=5 on narrow-bandwidth matrices) - Fix CUDA setup docs: requires CUDA, NCCL, CUDSS_jll - Fix MPI.Init() placement in docs (after using statements) - Configure CI to use MPICH_jll to avoid MUMPS hang - Add commit message policy to CLAUDE.md
1 parent d135d93 commit 37d11df

20 files changed

Lines changed: 1385 additions & 222 deletions

.github/workflows/CI.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,15 @@ jobs:
3333

3434
- uses: julia-actions/cache@v2
3535

36+
- name: Configure MPI
37+
run: |
38+
julia --project=. -e '
39+
using Pkg
40+
Pkg.add("MPIPreferences")
41+
using MPIPreferences
42+
MPIPreferences.use_jll_binary("MPICH_jll")
43+
'
44+
3645
- name: Build package
3746
uses: julia-actions/julia-buildpkg@v1
3847

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ Plots/
99

1010
# Julia
1111
Manifest.toml
12+
LocalPreferences.toml
1213

1314
# Documentation build
1415
docs/build/

CLAUDE.md

Lines changed: 53 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,10 @@
22

33
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
44

5+
## Commit Messages
6+
7+
Do NOT add "Co-Authored-By" lines or any other self-attribution to commit messages. Do NOT advertise Claude or Anthropic in commits. Keep commit messages focused on describing the changes only.
8+
59
## Build and Test Commands
610

711
```bash
@@ -29,15 +33,26 @@ mpiexec -n 4 julia --project=. test/test_factorization.jl
2933
julia --project=. -e 'using Pkg; Pkg.precompile()'
3034
```
3135

36+
## MPI Configuration
37+
38+
By default, MPI.jl uses MPItrampoline_jll. On some Linux clusters, this causes MUMPS to hang during the solve phase. If you experience hangs with multi-rank MUMPS tests, switch to MPICH_jll:
39+
40+
```julia
41+
using MPIPreferences
42+
MPIPreferences.use_jll_binary("MPICH_jll")
43+
```
44+
45+
This creates/updates `LocalPreferences.toml` (which is gitignored). Restart Julia after changing MPI preferences.
46+
3247
## GPU Support
3348

34-
GPU acceleration is supported via Metal.jl (macOS) as a package extension.
49+
GPU acceleration is supported via Metal.jl (macOS) or CUDA.jl (Linux/Windows) as package extensions.
3550

3651
### Type Parameters
3752

38-
- `VectorMPI{T,AV}` where `AV` is `Vector{T}` (CPU) or `MtlVector{T}` (GPU)
39-
- `MatrixMPI{T,AM}` where `AM` is `Matrix{T}` (CPU) or `MtlMatrix{T}` (GPU)
40-
- `SparseMatrixMPI{T,Ti,AV}` where `AV` is `Vector{T}` (CPU) or `MtlVector{T}` (GPU) for the `nzval` array
53+
- `VectorMPI{T,AV}` where `AV` is `Vector{T}` (CPU), `MtlVector{T}` (Metal), or `CuVector{T}` (CUDA)
54+
- `MatrixMPI{T,AM}` where `AM` is `Matrix{T}` (CPU), `MtlMatrix{T}` (Metal), or `CuMatrix{T}` (CUDA)
55+
- `SparseMatrixMPI{T,Ti,AV}` where `AV` is `Vector{T}` (CPU), `MtlVector{T}`, or `CuVector{T}` for the `nzval` array
4156
- Type aliases: `VectorMPI_CPU{T}`, `MatrixMPI_CPU{T}`, `SparseMatrixMPI_CPU{T,Ti}` for CPU-backed types
4257

4358
### Creating Zero Arrays
@@ -55,15 +70,20 @@ A = zeros(MatrixMPI_CPU{Float64}, 50, 30)
5570
S = zeros(SparseMatrixMPI{Float64,Int,Vector{Float64}}, 100, 100)
5671
S = zeros(SparseMatrixMPI_CPU{Float64,Int}, 100, 100)
5772

58-
# GPU zero arrays (requires Metal.jl loaded)
73+
# GPU zero arrays (requires Metal.jl or CUDA.jl loaded)
5974
using Metal
6075
v_gpu = zeros(VectorMPI{Float32,MtlVector{Float32}}, 100)
6176
A_gpu = zeros(MatrixMPI{Float32,MtlMatrix{Float32}}, 50, 30)
77+
78+
# Or with CUDA
79+
using CUDA
80+
v_gpu = zeros(VectorMPI{Float64,CuVector{Float64}}, 100)
81+
A_gpu = zeros(MatrixMPI{Float64,CuMatrix{Float64}}, 50, 30)
6282
```
6383

6484
### CPU Staging
6585

66-
MPI communication always uses CPU buffers (no Metal-aware MPI exists). GPU data is staged through CPU:
86+
MPI communication always uses CPU buffers (no GPU-aware MPI). GPU data is staged through CPU:
6787

6888
1. GPU vector data copied to CPU staging buffer
6989
2. MPI communication on CPU buffers
@@ -84,7 +104,32 @@ Sparse matrices remain on CPU (Julia's `SparseMatrixCSC` doesn't support GPU arr
84104
### Extension Files
85105

86106
- `ext/LinearAlgebraMPIMetalExt.jl` - Metal extension with `mtl()` and `cpu()` functions
87-
- Loaded automatically when `using Metal` before `using LinearAlgebraMPI`
107+
- `ext/LinearAlgebraMPICUDAExt.jl` - CUDA extension with `cu()` and `cpu()` functions, plus cuDSS multi-GPU solver
108+
- Loaded automatically when `using Metal` or `using CUDA` before `using LinearAlgebraMPI`
109+
110+
### CUDA-Specific: cuDSS Multi-GPU Solver
111+
112+
The CUDA extension includes `CuDSSFactorizationMPI` for distributed sparse direct solves using NVIDIA's cuDSS library with NCCL inter-GPU communication:
113+
114+
```julia
115+
using CUDA, MPI
116+
MPI.Init()
117+
using LinearAlgebraMPI
118+
119+
# Each MPI rank should use a different GPU
120+
CUDA.device!(MPI.Comm_rank(MPI.COMM_WORLD) % length(CUDA.devices()))
121+
122+
# Create factorization (LDLT for symmetric, LU for general)
123+
F = cudss_ldlt(A) # or cudss_lu(A)
124+
x = F \ b
125+
finalize!(F) # Required: clean up cuDSS resources
126+
```
127+
128+
**Important cuDSS notes:**
129+
- Requires cuDSS 0.4+ with MGMN (Multi-GPU Multi-Node) support
130+
- NCCL communicator is bootstrapped automatically from MPI
131+
- `finalize!(F)` must be called to avoid MPI desync during cleanup
132+
- Known issue: tridiagonal matrices with 3+ rows per rank may fail (cuDSS bug reported to NVIDIA)
88133

89134
### Writing Unified CPU/GPU Functions
90135

@@ -105,7 +150,7 @@ end
105150

106151
2. `_to_target_backend(v::Vector, ::Type{AV})` - Convert CPU index vector to target type:
107152
- `Type{Vector{T}}`: returns `v` directly (no copy)
108-
- `Type{MtlVector{T}}`: returns GPU copy
153+
- `Type{MtlVector{T}}` or `Type{CuVector{T}}`: returns GPU copy
109154

110155
**Pattern for result construction (unified):**
111156
```julia

Project.toml

Lines changed: 34 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,39 +1,59 @@
1+
authors = ["S. Loisel"]
12
name = "LinearAlgebraMPI"
23
uuid = "5bdd2be4-ae34-42ef-8b36-f4c85d48f377"
34
version = "0.1.9"
4-
authors = ["S. Loisel"]
5+
6+
[compat]
7+
Adapt = "4"
8+
Blake3Hash = "0.3"
9+
CUDA = "5"
10+
CUDSS_jll = "0.7"
11+
KernelAbstractions = "0.9"
12+
MPI = "0.20"
13+
MPIPreferences = "0.1.11"
14+
MUMPS = "1.5"
15+
Metal = "1.9.1"
16+
NCCL = "0.1"
17+
PrecompileTools = "1"
18+
StaticArrays = "1"
19+
julia = "1.10"
520

621
[deps]
722
Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
823
Blake3Hash = "8f478455-a32d-4928-b0e4-72b19a7d5574"
924
KernelAbstractions = "63c18a36-062a-441e-b654-da1e3ab1ce7c"
1025
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
1126
MPI = "da04e1cc-30fd-572f-bb4f-1f8673147195"
27+
MPIPreferences = "3da0fdf6-3ccc-4f1b-acd9-58baa6c99267"
1228
MUMPS = "55d2b088-9f4e-11e9-26c0-150b02ea6a46"
13-
Metal = "dde4c033-4e86-420c-a63e-0dd931031962"
1429
PrecompileTools = "aea7be01-6a6a-4083-8856-8a6e6704d82a"
1530
SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
1631
StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"
1732

1833
[extensions]
34+
LinearAlgebraMPICUDAExt = ["CUDA", "NCCL", "CUDSS_jll"]
1935
LinearAlgebraMPIMetalExt = "Metal"
2036

21-
[compat]
22-
Adapt = "4"
23-
Blake3Hash = "0.3"
24-
KernelAbstractions = "0.9"
25-
MPI = "0.20"
26-
MUMPS = "1.5"
27-
Metal = "1.9.1"
28-
PrecompileTools = "1"
29-
StaticArrays = "1"
30-
julia = "1.10"
31-
3237
[extras]
3338
BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
39+
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
40+
CUDSS_jll = "4889d778-9329-5762-9fec-0578a5d30366"
3441
Metal = "dde4c033-4e86-420c-a63e-0dd931031962"
42+
NCCL = "3fe64909-d7a1-4096-9b7d-7a0f12cf0f6b"
3543
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
3644
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
3745

46+
[preferences.MPIPreferences]
47+
__clear__ = ["libmpi", "abi", "mpiexec", "cclibs", "preloads_env_switch"]
48+
_format = "1.0"
49+
binary = "MPItrampoline_jll"
50+
preloads = []
51+
3852
[targets]
39-
test = ["Random", "Test"]
53+
test = ["CUDA", "CUDSS_jll", "NCCL", "Random", "Test"]
54+
55+
[weakdeps]
56+
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
57+
CUDSS_jll = "4889d778-9329-5762-9fec-0578a5d30366"
58+
Metal = "dde4c033-4e86-420c-a63e-0dd931031962"
59+
NCCL = "3fe64909-d7a1-4096-9b7d-7a0f12cf0f6b"

README.md

Lines changed: 57 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,8 @@ Distributed sparse matrix and vector operations using MPI for Julia. This packag
2020
- **Matrix addition/subtraction** (`A + B`, `A - B`)
2121
- **Vector operations**: norms, reductions, arithmetic with automatic partition alignment
2222
- Support for both `Float64` and `ComplexF64` element types
23-
- **GPU acceleration** via Metal.jl (macOS) with automatic CPU staging for MPI
23+
- **GPU acceleration** via Metal.jl (macOS) or CUDA.jl (Linux/Windows) with automatic CPU staging for MPI
24+
- **Multi-GPU sparse direct solver** via cuDSS with NCCL communication (CUDA only)
2425

2526
## Installation
2627

@@ -66,11 +67,11 @@ F = ldlt(A_sym_dist) # LDLT factorization
6667
x_sol = solve(F, y) # Solve A_sym * x_sol = y
6768
```
6869

69-
## GPU Support (Metal)
70+
## GPU Support
7071

71-
LinearAlgebraMPI supports GPU acceleration on macOS via Metal.jl. GPU support is optional - Metal.jl is loaded as a weak dependency.
72+
LinearAlgebraMPI supports GPU acceleration via Metal.jl (macOS) or CUDA.jl (Linux/Windows). GPU support is optional - extensions are loaded as weak dependencies.
7273

73-
### Converting between CPU and GPU
74+
### Metal (macOS)
7475

7576
```julia
7677
using Metal # Load Metal BEFORE MPI for GPU detection
@@ -93,34 +94,69 @@ z_gpu = x_gpu + x_gpu # Vector addition on GPU
9394
y_cpu = cpu(y_gpu)
9495
```
9596

96-
### Creating GPU vectors directly
97+
### CUDA (Linux/Windows)
9798

9899
```julia
99-
using Metal
100+
using CUDA # Load CUDA BEFORE MPI
101+
using MPI
102+
MPI.Init()
103+
using LinearAlgebraMPI
104+
105+
# Convert to GPU
106+
x_cpu = VectorMPI(rand(1000))
107+
x_gpu = cu(x_cpu) # Returns VectorMPI with CuVector storage
108+
109+
# GPU operations work transparently
110+
y_gpu = A * x_gpu
111+
z_gpu = x_gpu + x_gpu
112+
113+
# Convert back to CPU
114+
y_cpu = cpu(y_gpu)
115+
```
116+
117+
### cuDSS Multi-GPU Solver (CUDA only)
118+
119+
For multi-GPU distributed sparse direct solves, use `CuDSSFactorizationMPI`:
120+
121+
```julia
122+
using CUDA, MPI
123+
MPI.Init()
124+
using LinearAlgebraMPI
125+
126+
# Each rank uses one GPU
127+
CUDA.device!(MPI.Comm_rank(MPI.COMM_WORLD) % length(CUDA.devices()))
100128

101-
# Create GPU vector from local data
102-
local_data = MtlVector(Float32.(rand(100)))
103-
v_gpu = VectorMPI_local(local_data)
129+
# Create distributed sparse matrix
130+
A = SparseMatrixMPI{Float64}(make_spd_matrix(1000))
131+
b = VectorMPI(rand(1000))
132+
133+
# Multi-GPU factorization using cuDSS + NCCL
134+
F = cudss_ldlt(A) # or cudss_lu(A)
135+
x = F \ b
136+
finalize!(F) # Clean up cuDSS resources
104137
```
105138

139+
**Requirements**: cuDSS 0.4+ with MGMN (Multi-GPU Multi-Node) support, NCCL for inter-GPU communication.
140+
106141
### How it works
107142

108-
- **Vectors**: `VectorMPI{T,AV}` where `AV` is `Vector{T}` (CPU) or `MtlVector{T}` (GPU)
143+
- **Vectors**: `VectorMPI{T,AV}` where `AV` is `Vector{T}` (CPU), `MtlVector{T}` (Metal), or `CuVector{T}` (CUDA)
109144
- **Sparse matrices**: `SparseMatrixMPI{T,Ti,AV}` where `AV` determines storage for nonzero values
110-
- **Dense matrices**: `MatrixMPI{T,AM}` where `AM` is `Matrix{T}` (CPU) or `MtlMatrix{T}` (GPU)
111-
- **MPI communication**: Always uses CPU buffers (no Metal-aware MPI exists)
112-
- **Element type**: Metal requires `Float32` (no `Float64` support)
145+
- **Dense matrices**: `MatrixMPI{T,AM}` where `AM` is `Matrix{T}`, `MtlMatrix{T}`, or `CuMatrix{T}`
146+
- **MPI communication**: Always uses CPU buffers (staged automatically)
147+
- **Element types**: Metal requires `Float32`; CUDA supports `Float32` and `Float64`
113148

114149
### Supported GPU operations
115150

116-
| Operation | GPU Support |
117-
|-----------|-------------|
118-
| `v + w`, `v - w` | Native GPU |
119-
| `α * v` (scalar) | Native GPU |
120-
| `A * x` (sparse) | CPU staging |
121-
| `A * x` (dense) | CPU staging |
122-
| `transpose(A) * x` | CPU staging |
123-
| Broadcasting (`abs.(v)`) | Native GPU |
151+
| Operation | Metal | CUDA |
152+
|-----------|-------|------|
153+
| `v + w`, `v - w` | Native | Native |
154+
| `α * v` (scalar) | Native | Native |
155+
| `A * x` (sparse) | CPU staging | CPU staging |
156+
| `A * x` (dense) | CPU staging | CPU staging |
157+
| `transpose(A) * x` | CPU staging | CPU staging |
158+
| Broadcasting (`abs.(v)`) | Native | Native |
159+
| `cudss_lu(A)`, `cudss_ldlt(A)` | N/A | Multi-GPU native |
124160

125161
## Running with MPI
126162

codecov.yml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# Codecov configuration
2+
# https://docs.codecov.com/docs/codecov-yaml
3+
4+
coverage:
5+
status:
6+
project:
7+
default:
8+
target: auto
9+
threshold: 1%
10+
patch:
11+
default:
12+
target: auto
13+
14+
ignore:
15+
- "ext/**/*" # GPU extensions (Metal, CUDA) - no GPU runners on CI

0 commit comments

Comments
 (0)