perf(leiden): replace per-row GrB_Col_extract with CSR walk#406
Open
gkorland wants to merge 3 commits intoGraphBLAS:v1.3.xfrom
Open
perf(leiden): replace per-row GrB_Col_extract with CSR walk#406gkorland wants to merge 3 commits intoGraphBLAS:v1.3.xfrom
gkorland wants to merge 3 commits intoGraphBLAS:v1.3.xfrom
Conversation
Address Dr. Davis's review feedback that the initial Leiden implementation 'doesn't exploit much of GraphBLAS' and is therefore slow. The dominant cost was per-node GraphBLAS overhead inside the Phase 1 / Phase 2 hot loops. Optimizations (algorithm semantics, tie-breaking, self-loop handling and modularity formula are all unchanged): * Once per outer aggregation level, materialize A_cur into portable CSR arrays (Ap, Aj, Ax) via a single GrB_Matrix_extractTuples_FP64 + counting-sort scatter. The Phase 1 / Phase 2 inner loops now walk Aj[Ap[i]..Ap[i+1]) directly instead of calling GrB_Col_extract + GrB_Vector_nvals + GrB_Vector_extractTuples_FP64 for every node on every iteration of both phases. * Compute k_arr[i] from CSR row sums; remove the per-level GrB_Matrix_reduce_Monoid + per-element GrB_Vector_extractElement_FP64 loop. * Build S_mat via a single GrB_Matrix_build_FP64 instead of n_cur GrB_Matrix_setElement_FP64 calls + an implicit GrB_Matrix_wait. Existing scratch buffers (nbrs_j, dirty_list, nbrs_v) are reused for the build inputs. * Build the output GrB_Vector via GrB_Vector_build_INT64 instead of n GrB_Vector_setElement_INT64 calls (also fixed in the empty-graph fallback). * CSR/tuple buffers grow on demand and are released through LG_FREE_WORK on every error path. Asymptotic per-level cost in the local-move and refinement phases drops from O(n_cur * iters * GrB-call-overhead) to O(nnz * iters) with no GraphBLAS overhead in the hot loop. test_leiden still passes (Q = 0.4188 on karate, well above the 0.37 acceptance threshold). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GomezGab
requested changes
May 4, 2026
Collaborator
GomezGab
left a comment
There was a problem hiding this comment.
From what I can tell, the algorithm doesn't exploit GraphBLAS for speed/parallelism on the most intensive parts. I don't yet understand Leiden enough to give advice as to how to effectively exploit GraphBLAS in these places.
Address review feedback from @GomezGab on PR GraphBLAS#406: * Replace GrB_Matrix_extractTuples_FP64 + counting-sort scatter with GxB_unload_Matrix_into_Container, giving direct pointer access to the matrix's internal CSR arrays (Ap/Aj/Ax) with no copy. Force sparse + row-major + non-iso + 64-bit indices via GrB_Matrix_set_INT32 hints + GrB_wait(GrB_MATERIALIZE), then unload through the container's p/i/x vectors with GxB_Vector_unload. Reload the container before the Phase-3 mxm so A_cur is once again a usable GraphBLAS matrix; null our raw-array pointers after reload to avoid double-free. * Compute degrees via GrB_Matrix_reduce_Monoid with GrB_PLUS_FP64 accum on a pre-zero-filled k_vec (handles isolated rows), then GxB_Vector_unload to obtain k_arr. k_vec is recreated at size n_cur each outer level. * Use GxB_Matrix_build_Scalar with a shared GrB_Scalar (=1.0) for the S_mat indicator matrix; reuse iota for row indices and reinterpret-cast c_ref (int64_t*) directly as GrB_Index* (values are non-negative community labels), removing the per-level value array and the temporary c_ref copy. * Use GxB_Vector_load with move semantics to hand the o_comm buffer directly to the output vector (with GxB_FULL sparsity), avoiding the final GrB_Vector_build_INT64 copy. * Duplicate G->A as GrB_FP64 once at start (after the m == 0 early return) so A_cur is always owned, FP64-typed, and may be unloaded. Required because G->A may be any type (BOOL on pattern-only Matrix Market inputs, INT*, FP32, ...); typecasting once removes the need for per-edge type handling in the inner loops. test_leiden on karate.mtx: Q = 0.418803, identical to baseline. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The previous commit unconditionally used GxB_Container, GxB_Vector_load, GxB_Vector_unload, and GxB_load_Matrix_from_Container, which are only available in SuiteSparse:GraphBLAS v10.0.0 and newer. CI tests against both v9.0.0 and v10.2.0, so the v9 build failed with implicit-declaration errors. Wrap the Container-based fast path in '#if LAGR_LEIDEN_USE_CONTAINER' (defined to 1 when GxB_IMPLEMENTATION >= GxB_VERSION(10,0,0), else 0) and provide a v9 fallback that materializes CSR via GrB_Matrix_extractTuples_FP64 + counting-sort scatter (the same approach as commit 9a1daf0). Both code paths share: FP64 typecast of G->A into A_agg, GrB_Matrix reduce for degrees, GxB_Matrix_build_Scalar for the indicator matrix S_mat (build_Scalar exists in v9), and the rest of the algorithm. The macro can be overridden with -DLAGR_LEIDEN_USE_CONTAINER=0/1 to exercise either path on the same SuiteSparse install (used to verify the fallback compiles on the v10 dev machine). test_leiden on karate.mtx: Q = 0.418803 (unchanged from baseline). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses Dr. Davis's review feedback that the initial Leiden implementation "doesn't exploit much of GraphBLAS" and is therefore slow. The dominant cost was per-node GraphBLAS overhead inside the Phase 1 / Phase 2 hot loops; this PR removes it without changing algorithm semantics.
Changes
A_curinto portable CSR arrays (Ap,Aj,Ax) via a singleGrB_Matrix_extractTuples_FP64+ counting-sort scatter. The Phase 1 / Phase 2 inner loops now walkAj[Ap[i]..Ap[i+1])directly instead of callingGrB_Col_extract+GrB_Vector_nvals+GrB_Vector_extractTuples_FP64for every node × every inner iteration × both phases (the dominant cost).GrB_Matrix_reduce_Monoid+ per-elementGrB_Vector_extractElement_FP64loop.GrB_Matrix_build_FP64for S_mat. Single-call construction instead ofn_curGrB_Matrix_setElement_FP64calls + implicitGrB_Matrix_wait. Existing scratch buffers (nbrs_j,dirty_list,nbrs_v) are reused for the build inputs.GrB_Vector_build_INT64for output. Single-call construction instead ofnGrB_Vector_setElement_INT64calls (also fixed in the empty-graph fallback).LG_FREE_WORKon every error path.Algorithm semantics (node iteration order, tie-breaking with
>vs>=, self-loop skipping, modularity formula, Leiden parent-community refinement constraint) are preserved.Testing
test_leidenstill passes:Q ≈ 0.42 on Zachary's karate club matches published Louvain/Leiden benchmarks for that graph.
Memory / Performance Impact
O(n_cur · iters · GrB-call-overhead)toO(nnz · iters)with no GraphBLAS overhead in the hot loop.max(nnz)(Ap,Aj,Ax,I_tup,J_tup,X_tup) plus ann-sized cursor and aniotaindex array. Buffers grow on demand and are reused across levels.GrB_*API.Related Issues
Follow-up to #401 (community: implement Leiden community detection).