Skip to content

perf(leiden): replace per-row GrB_Col_extract with CSR walk#406

Open
gkorland wants to merge 3 commits intoGraphBLAS:v1.3.xfrom
gkorland:perf/leiden-graphblas-csr
Open

perf(leiden): replace per-row GrB_Col_extract with CSR walk#406
gkorland wants to merge 3 commits intoGraphBLAS:v1.3.xfrom
gkorland:perf/leiden-graphblas-csr

Conversation

@gkorland
Copy link
Copy Markdown

@gkorland gkorland commented May 4, 2026

Summary

Addresses Dr. Davis's review feedback that the initial Leiden implementation "doesn't exploit much of GraphBLAS" and is therefore slow. The dominant cost was per-node GraphBLAS overhead inside the Phase 1 / Phase 2 hot loops; this PR removes it without changing algorithm semantics.

Changes

  • CSR materialization per level. Once per outer aggregation level, materialize A_cur into portable CSR arrays (Ap, Aj, Ax) via a single GrB_Matrix_extractTuples_FP64 + counting-sort scatter. The Phase 1 / Phase 2 inner loops now walk Aj[Ap[i]..Ap[i+1]) directly instead of calling GrB_Col_extract + GrB_Vector_nvals + GrB_Vector_extractTuples_FP64 for every node × every inner iteration × both phases (the dominant cost).
  • k_arr from CSR. Compute degree directly from CSR row sums; remove the per-level GrB_Matrix_reduce_Monoid + per-element GrB_Vector_extractElement_FP64 loop.
  • GrB_Matrix_build_FP64 for S_mat. Single-call construction instead of n_cur GrB_Matrix_setElement_FP64 calls + implicit GrB_Matrix_wait. Existing scratch buffers (nbrs_j, dirty_list, nbrs_v) are reused for the build inputs.
  • GrB_Vector_build_INT64 for output. Single-call construction instead of n GrB_Vector_setElement_INT64 calls (also fixed in the empty-graph fallback).
  • CSR/tuple buffers grow on demand and are released through LG_FREE_WORK on every error path.

Algorithm semantics (node iteration order, tie-breaking with > vs >=, self-loop skipping, modularity formula, Leiden parent-community refinement constraint) are preserved.

Testing

test_leiden still passes:

====== karate.mtx ======
  Modularity Q = 0.418803  (K_ref <= 34)
[ OK ]
SUCCESS: All unit tests have passed.

Q ≈ 0.42 on Zachary's karate club matches published Louvain/Leiden benchmarks for that graph.

Memory / Performance Impact

  • Per-level work in the local-move and refinement phases drops from O(n_cur · iters · GrB-call-overhead) to O(nnz · iters) with no GraphBLAS overhead in the hot loop.
  • Additional memory: 6 arrays of size max(nnz) (Ap, Aj, Ax, I_tup, J_tup, X_tup) plus an n-sized cursor and an iota index array. Buffers grow on demand and are reused across levels.
  • No new GraphBLAS feature dependencies (no GxB unpack/pack); uses only standard GrB_* API.

Related Issues

Follow-up to #401 (community: implement Leiden community detection).

Address Dr. Davis's review feedback that the initial Leiden
implementation 'doesn't exploit much of GraphBLAS' and is therefore
slow.  The dominant cost was per-node GraphBLAS overhead inside the
Phase 1 / Phase 2 hot loops.

Optimizations (algorithm semantics, tie-breaking, self-loop handling
and modularity formula are all unchanged):

* Once per outer aggregation level, materialize A_cur into portable
  CSR arrays (Ap, Aj, Ax) via a single GrB_Matrix_extractTuples_FP64
  + counting-sort scatter.  The Phase 1 / Phase 2 inner loops now walk
  Aj[Ap[i]..Ap[i+1]) directly instead of calling GrB_Col_extract +
  GrB_Vector_nvals + GrB_Vector_extractTuples_FP64 for every node on
  every iteration of both phases.

* Compute k_arr[i] from CSR row sums; remove the per-level
  GrB_Matrix_reduce_Monoid + per-element GrB_Vector_extractElement_FP64
  loop.

* Build S_mat via a single GrB_Matrix_build_FP64 instead of n_cur
  GrB_Matrix_setElement_FP64 calls + an implicit GrB_Matrix_wait.
  Existing scratch buffers (nbrs_j, dirty_list, nbrs_v) are reused
  for the build inputs.

* Build the output GrB_Vector via GrB_Vector_build_INT64 instead of
  n GrB_Vector_setElement_INT64 calls (also fixed in the empty-graph
  fallback).

* CSR/tuple buffers grow on demand and are released through
  LG_FREE_WORK on every error path.

Asymptotic per-level cost in the local-move and refinement phases
drops from O(n_cur * iters * GrB-call-overhead) to O(nnz * iters)
with no GraphBLAS overhead in the hot loop.

test_leiden still passes (Q = 0.4188 on karate, well above the 0.37
acceptance threshold).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@GomezGab GomezGab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, the algorithm doesn't exploit GraphBLAS for speed/parallelism on the most intensive parts. I don't yet understand Leiden enough to give advice as to how to effectively exploit GraphBLAS in these places.

Comment thread experimental/algorithm/LAGraph_Leiden.c
Comment thread experimental/algorithm/LAGraph_Leiden.c Outdated
Comment thread experimental/algorithm/LAGraph_Leiden.c Outdated
Comment thread experimental/algorithm/LAGraph_Leiden.c Outdated
Address review feedback from @GomezGab on PR GraphBLAS#406:

* Replace GrB_Matrix_extractTuples_FP64 + counting-sort scatter with
  GxB_unload_Matrix_into_Container, giving direct pointer access to the
  matrix's internal CSR arrays (Ap/Aj/Ax) with no copy.  Force
  sparse + row-major + non-iso + 64-bit indices via GrB_Matrix_set_INT32
  hints + GrB_wait(GrB_MATERIALIZE), then unload through the container's
  p/i/x vectors with GxB_Vector_unload.  Reload the container before the
  Phase-3 mxm so A_cur is once again a usable GraphBLAS matrix; null our
  raw-array pointers after reload to avoid double-free.

* Compute degrees via GrB_Matrix_reduce_Monoid with GrB_PLUS_FP64 accum
  on a pre-zero-filled k_vec (handles isolated rows), then GxB_Vector_unload
  to obtain k_arr.  k_vec is recreated at size n_cur each outer level.

* Use GxB_Matrix_build_Scalar with a shared GrB_Scalar (=1.0) for the
  S_mat indicator matrix; reuse iota for row indices and reinterpret-cast
  c_ref (int64_t*) directly as GrB_Index* (values are non-negative
  community labels), removing the per-level value array and the temporary
  c_ref copy.

* Use GxB_Vector_load with move semantics to hand the o_comm buffer
  directly to the output vector (with GxB_FULL sparsity), avoiding the
  final GrB_Vector_build_INT64 copy.

* Duplicate G->A as GrB_FP64 once at start (after the m == 0 early
  return) so A_cur is always owned, FP64-typed, and may be unloaded.
  Required because G->A may be any type (BOOL on pattern-only Matrix
  Market inputs, INT*, FP32, ...); typecasting once removes the need
  for per-edge type handling in the inner loops.

test_leiden on karate.mtx: Q = 0.418803, identical to baseline.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@gkorland gkorland requested a review from GomezGab May 5, 2026 06:17
The previous commit unconditionally used GxB_Container, GxB_Vector_load,
GxB_Vector_unload, and GxB_load_Matrix_from_Container, which are only
available in SuiteSparse:GraphBLAS v10.0.0 and newer.  CI tests against
both v9.0.0 and v10.2.0, so the v9 build failed with implicit-declaration
errors.

Wrap the Container-based fast path in
'#if LAGR_LEIDEN_USE_CONTAINER' (defined to 1 when
GxB_IMPLEMENTATION >= GxB_VERSION(10,0,0), else 0) and provide a v9
fallback that materializes CSR via GrB_Matrix_extractTuples_FP64 +
counting-sort scatter (the same approach as commit 9a1daf0).

Both code paths share: FP64 typecast of G->A into A_agg, GrB_Matrix
reduce for degrees, GxB_Matrix_build_Scalar for the indicator matrix
S_mat (build_Scalar exists in v9), and the rest of the algorithm.

The macro can be overridden with -DLAGR_LEIDEN_USE_CONTAINER=0/1 to
exercise either path on the same SuiteSparse install (used to verify
the fallback compiles on the v10 dev machine).

test_leiden on karate.mtx: Q = 0.418803 (unchanged from baseline).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants