perf(leiden): replace per-row GrB_Col_extract with CSR walk by gkorland · Pull Request #406 · GraphBLAS/LAGraph

gkorland · 2026-05-04T21:17:09Z

Summary

Addresses Dr. Davis's review feedback that the initial Leiden implementation "doesn't exploit much of GraphBLAS" and is therefore slow. The dominant cost was per-node GraphBLAS overhead inside the Phase 1 / Phase 2 hot loops; this PR removes it without changing algorithm semantics.

Changes

CSR materialization per level. Once per outer aggregation level, materialize A_cur into portable CSR arrays (Ap, Aj, Ax) via a single GrB_Matrix_extractTuples_FP64 + counting-sort scatter. The Phase 1 / Phase 2 inner loops now walk Aj[Ap[i]..Ap[i+1]) directly instead of calling GrB_Col_extract + GrB_Vector_nvals + GrB_Vector_extractTuples_FP64 for every node × every inner iteration × both phases (the dominant cost).
k_arr from CSR. Compute degree directly from CSR row sums; remove the per-level GrB_Matrix_reduce_Monoid + per-element GrB_Vector_extractElement_FP64 loop.
GrB_Matrix_build_FP64 for S_mat. Single-call construction instead of n_cur GrB_Matrix_setElement_FP64 calls + implicit GrB_Matrix_wait. Existing scratch buffers (nbrs_j, dirty_list, nbrs_v) are reused for the build inputs.
GrB_Vector_build_INT64 for output. Single-call construction instead of n GrB_Vector_setElement_INT64 calls (also fixed in the empty-graph fallback).
CSR/tuple buffers grow on demand and are released through LG_FREE_WORK on every error path.

Algorithm semantics (node iteration order, tie-breaking with > vs >=, self-loop skipping, modularity formula, Leiden parent-community refinement constraint) are preserved.

Testing

test_leiden still passes:

====== karate.mtx ======
  Modularity Q = 0.418803  (K_ref <= 34)
[ OK ]
SUCCESS: All unit tests have passed.

Q ≈ 0.42 on Zachary's karate club matches published Louvain/Leiden benchmarks for that graph.

Memory / Performance Impact

Per-level work in the local-move and refinement phases drops from O(n_cur · iters · GrB-call-overhead) to O(nnz · iters) with no GraphBLAS overhead in the hot loop.
Additional memory: 6 arrays of size max(nnz) (Ap, Aj, Ax, I_tup, J_tup, X_tup) plus an n-sized cursor and an iota index array. Buffers grow on demand and are reused across levels.
No new GraphBLAS feature dependencies (no GxB unpack/pack); uses only standard GrB_* API.

Related Issues

Follow-up to #401 (community: implement Leiden community detection).

Address Dr. Davis's review feedback that the initial Leiden implementation 'doesn't exploit much of GraphBLAS' and is therefore slow. The dominant cost was per-node GraphBLAS overhead inside the Phase 1 / Phase 2 hot loops. Optimizations (algorithm semantics, tie-breaking, self-loop handling and modularity formula are all unchanged): * Once per outer aggregation level, materialize A_cur into portable CSR arrays (Ap, Aj, Ax) via a single GrB_Matrix_extractTuples_FP64 + counting-sort scatter. The Phase 1 / Phase 2 inner loops now walk Aj[Ap[i]..Ap[i+1]) directly instead of calling GrB_Col_extract + GrB_Vector_nvals + GrB_Vector_extractTuples_FP64 for every node on every iteration of both phases. * Compute k_arr[i] from CSR row sums; remove the per-level GrB_Matrix_reduce_Monoid + per-element GrB_Vector_extractElement_FP64 loop. * Build S_mat via a single GrB_Matrix_build_FP64 instead of n_cur GrB_Matrix_setElement_FP64 calls + an implicit GrB_Matrix_wait. Existing scratch buffers (nbrs_j, dirty_list, nbrs_v) are reused for the build inputs. * Build the output GrB_Vector via GrB_Vector_build_INT64 instead of n GrB_Vector_setElement_INT64 calls (also fixed in the empty-graph fallback). * CSR/tuple buffers grow on demand and are released through LG_FREE_WORK on every error path. Asymptotic per-level cost in the local-move and refinement phases drops from O(n_cur * iters * GrB-call-overhead) to O(nnz * iters) with no GraphBLAS overhead in the hot loop. test_leiden still passes (Q = 0.4188 on karate, well above the 0.37 acceptance threshold). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

GomezGab

From what I can tell, the algorithm doesn't exploit GraphBLAS for speed/parallelism on the most intensive parts. I don't yet understand Leiden enough to give advice as to how to effectively exploit GraphBLAS in these places.

@GomezGab

Address review feedback from @GomezGab on PR GraphBLAS#406: * Replace GrB_Matrix_extractTuples_FP64 + counting-sort scatter with GxB_unload_Matrix_into_Container, giving direct pointer access to the matrix's internal CSR arrays (Ap/Aj/Ax) with no copy. Force sparse + row-major + non-iso + 64-bit indices via GrB_Matrix_set_INT32 hints + GrB_wait(GrB_MATERIALIZE), then unload through the container's p/i/x vectors with GxB_Vector_unload. Reload the container before the Phase-3 mxm so A_cur is once again a usable GraphBLAS matrix; null our raw-array pointers after reload to avoid double-free. * Compute degrees via GrB_Matrix_reduce_Monoid with GrB_PLUS_FP64 accum on a pre-zero-filled k_vec (handles isolated rows), then GxB_Vector_unload to obtain k_arr. k_vec is recreated at size n_cur each outer level. * Use GxB_Matrix_build_Scalar with a shared GrB_Scalar (=1.0) for the S_mat indicator matrix; reuse iota for row indices and reinterpret-cast c_ref (int64_t*) directly as GrB_Index* (values are non-negative community labels), removing the per-level value array and the temporary c_ref copy. * Use GxB_Vector_load with move semantics to hand the o_comm buffer directly to the output vector (with GxB_FULL sparsity), avoiding the final GrB_Vector_build_INT64 copy. * Duplicate G->A as GrB_FP64 once at start (after the m == 0 early return) so A_cur is always owned, FP64-typed, and may be unloaded. Required because G->A may be any type (BOOL on pattern-only Matrix Market inputs, INT*, FP32, ...); typecasting once removes the need for per-edge type handling in the inner loops. test_leiden on karate.mtx: Q = 0.418803, identical to baseline. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The previous commit unconditionally used GxB_Container, GxB_Vector_load, GxB_Vector_unload, and GxB_load_Matrix_from_Container, which are only available in SuiteSparse:GraphBLAS v10.0.0 and newer. CI tests against both v9.0.0 and v10.2.0, so the v9 build failed with implicit-declaration errors. Wrap the Container-based fast path in '#if LAGR_LEIDEN_USE_CONTAINER' (defined to 1 when GxB_IMPLEMENTATION >= GxB_VERSION(10,0,0), else 0) and provide a v9 fallback that materializes CSR via GrB_Matrix_extractTuples_FP64 + counting-sort scatter (the same approach as commit 9a1daf0). Both code paths share: FP64 typecast of G->A into A_agg, GrB_Matrix reduce for degrees, GxB_Matrix_build_Scalar for the indicator matrix S_mat (build_Scalar exists in v9), and the rest of the algorithm. The macro can be overridden with -DLAGR_LEIDEN_USE_CONTAINER=0/1 to exercise either path on the same SuiteSparse install (used to verify the fallback compiles on the v10 dev machine). test_leiden on karate.mtx: Q = 0.418803 (unchanged from baseline). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

GomezGab requested changes May 4, 2026

View reviewed changes

Comment thread experimental/algorithm/LAGraph_Leiden.c

Comment thread experimental/algorithm/LAGraph_Leiden.c Outdated

Comment thread experimental/algorithm/LAGraph_Leiden.c Outdated

Comment thread experimental/algorithm/LAGraph_Leiden.c Outdated

gkorland requested a review from GomezGab May 5, 2026 06:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(leiden): replace per-row GrB_Col_extract with CSR walk#406

perf(leiden): replace per-row GrB_Col_extract with CSR walk#406
gkorland wants to merge 3 commits intoGraphBLAS:v1.3.xfrom
gkorland:perf/leiden-graphblas-csr

gkorland commented May 4, 2026

Uh oh!

GomezGab left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gkorland commented May 4, 2026

Summary

Changes

Testing

Memory / Performance Impact

Related Issues

Uh oh!

GomezGab left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants