output for do_gap_tc_gpu for twitter

DrTimothyAldenDavis · DrTimothyAldenDavis · commit 475343730ff5 · 2025-10-17T12:59:43.000-04:00
diff --git a/o_do_gap_tc_gpu b/o_do_gap_tc_gpu
@@ -0,0 +1,176 @@
+mit d-14-6-2 $ ./do_gap_tc_gpu grb
+ 
+======================================================================
+GAP benchmarks using LAGraph+GraphBLAS: Triangle Counting
+======================================================================
+d-14-6-2
+OMP_PLACES=cores
+OMP_PROC_BIND=spread
+Matrix input file format:  grb
+GAP matrices located in:   ../../../GAP
+GB_cuda_get_device_count: 2, cudaError_t: 0
+GB_cuda_init: ngpus: 2
+
+Device: 0: memory: 34072559616 SMs: 80 compute: 7.0
+
+Device: 1: memory: 34072559616 SMs: 80 compute: 7.0
+CUDA_VISIBLE_DEVICES = 0,1
+getting cuda visible devices
+Found device_id 0
+Found device_id 1
+devices.size is 2
+cuda warmup 0
+cuda warmup 0 OK
+cuda warmup 1
+cuda warmup 1 OK
+JIT init, device 0
+library: SuiteSparse:GraphBLAS v10.2.0 [FIXME, 2025]
+# of trials: 5
+threads to test:  80
+matrix: ../../../GAP/GAP-twitter/GAP-twitter.grb
+[.grb]
+Reading binary file: ../../../GAP/GAP-twitter/GAP-twitter.grb
+ [ GrB_set 
+   0.0133 sec ]
+ [ GrB_set 
+   0.000835 sec ]
+ [ GrB_set 
+   0.664 sec ]
+ [ GrB_set 
+   0.0717 sec ]
+A converted to 32-bit
+ [ GrB_assign (C iso assign) (pending: 0) Method 05e: (C empty)<M,struct> = scalar 
+   0.226 sec ]
+ [ GrB_select (iso select) (select sparse on cuda) 
+blockdim1: 512 chunksize1: 4096
+blockdim2: 256 chunksize2: 1024
+(jit: cuda load) 
+select sparse phase1: 1.6 sec (gpu: Map, with cumsum)
+select sparse phase2: 0.00121746 sec (cpu: ChunkSum of Map)
+select sparse phase3: 1.25434 sec (gpu: create Ci,Cx,Ck)
+select sparse phase4: 0.0143758 sec (gpu: Ck_Delta, with cumsum)
+select sparse phase5: 0.00237009 sec (cpu: ChunkSum for Ck_Delta
+select sparse phase6: 0.0311198 sec (gpu: Cp,Ch)
+(hyper to sparse) 
+   2.98 sec ]
+ [ GrB_transpose (iso transpose) (80-thread atomic bucket transpose) 
+   8.62 sec ]
+ [ GrB_eWiseMult (iso wait:B 0 zombies, 0 pending, jumbled) (wait: unjumble only) emult:(S<.>=S.*S) (iso emult) 
+   1.46 sec ]
+ [ GrB_Matrix_nvals 
+   1.16e-06 sec ]
+ [ GrB_Matrix_nvals 
+   6.1e-08 sec ]
+ [ GrB_Matrix_nvals 
+   3.4e-08 sec ]
+ [ GrB_Matrix_nvals 
+   3.19e-08 sec ]
+ [ GrB_eWiseMult emult:(S<.>=S.*S) (iso emult) 
+   0.396 sec ]
+ [ GrB_Matrix_nvals 
+   4.31e-07 sec ]
+forcing G-> to be symmetric (via A = A+A')
+ [ GrB_eWiseAdd add:(S<.>=S+S) (iso add) 
+   0.849 sec ]
+read time: 58.1269
+ [ GrB_assign (C iso assign) (pending: 0) Method 21: (C full) = scalar 
+   2.94e-05 sec ]
+ [ GrB_mxv C=A'*B, dot_product (dot2) (nthreads: 80 naslice 2560 nbslice 1) (dot B = S'*F) (jit: cpu load) 
+   0.0268 sec ]
+ [ GrB_Matrix_nvals 
+   2.66e-07 sec ]
+
+warmup method: Sandia_ULT: sum ((U*L') .* U)   sort: none
+
+ [ GrB_select (iso select) 
+   0.257 sec ]
+ [ GrB_select (iso select) 
+   0.367 sec ]
+ [ GrB_mxm C<M>=A'*B, masked_dot_product (dot3) (S{S} = S'*S)  work:2.34829e+10 GPUs:0 nthreads 80 ntasks 2560 (jit: compile and load) (jit compile:)
+sh -c "/usr/bin/gcc -DGB_JIT_RUNTIME=1  -Wundef  -Wno-strict-aliasing  -std=c11 -lm -Wno-pragmas  -fexcess-precision=fast  -fcx-limited-range  -fno-math-errno  -fwrapv  -O3 -DNDEBUG -fPIC  -fopenmp -I'/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/src' -I'/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/src/template' -I'/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/src/include'  -o '/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/c/99/GB_jit__AxB_dot3__fff4611800280055.o' -c '/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/c/99/GB_jit__AxB_dot3__fff4611800280055.c'   2>&1   ; /usr/bin/gcc  -Wundef  -Wno-strict-aliasing  -std=c11 -lm -Wno-pragmas  -fexcess-precision=fast  -fcx-limited-range  -fno-math-errno  -fwrapv  -O3 -DNDEBUG -fPIC  -fopenmp  -shared  -o '/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/lib/99/libGB_jit__AxB_dot3__fff4611800280055.so' '/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/c/99/GB_jit__AxB_dot3__fff4611800280055.o'  -lm -ldl -lgomp -lpthread   2>&1  "
+ 
+   127 sec ]
+ [ GrB_reduce  work:1.20251e+09 gpus:0 (jit: compile and load) (jit compile:)
+sh -c "/usr/bin/gcc -DGB_JIT_RUNTIME=1  -Wundef  -Wno-strict-aliasing  -std=c11 -lm -Wno-pragmas  -fexcess-precision=fast  -fcx-limited-range  -fno-math-errno  -fwrapv  -O3 -DNDEBUG -fPIC  -fopenmp -I'/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/src' -I'/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/src/template' -I'/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/src/include'  -o '/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/c/6b/GB_jit__reduce__1488d.o' -c '/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/c/6b/GB_jit__reduce__1488d.c'   2>&1   ; /usr/bin/gcc  -Wundef  -Wno-strict-aliasing  -std=c11 -lm -Wno-pragmas  -fexcess-precision=fast  -fcx-limited-range  -fno-math-errno  -fwrapv  -O3 -DNDEBUG -fPIC  -fopenmp  -shared  -o '/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/lib/6b/libGB_jit__reduce__1488d.so' '/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/c/6b/GB_jit__reduce__1488d.o'  -lm -ldl -lgomp -lpthread   2>&1  "
+
+   0.449 sec ]
+Sandia_ULT (dot) time: 127.379
+# of triangles: 34824916864
+Sandia_ULT: sum ((U*L') .* U)   sort: none
+nthreads:  80 time:   128.008497 rate:  18.79 (Sandia_ULT, one trial)
+
+ [ GrB_select (iso select) (select sparse on cuda) 
+blockdim1: 512 chunksize1: 4096
+blockdim2: 256 chunksize2: 1024
+(jit: cuda load) 
+select sparse phase1: 2.54589 sec (gpu: Map, with cumsum)
+select sparse phase2: 0.00104322 sec (cpu: ChunkSum of Map)
+select sparse phase3: 1.03406 sec (gpu: create Ci,Cx,Ck)
+select sparse phase4: 0.0183033 sec (gpu: Ck_Delta, with cumsum)
+select sparse phase5: 0.00183061 sec (cpu: ChunkSum for Ck_Delta
+select sparse phase6: 0.0835983 sec (gpu: Cp,Ch)
+(hyper to sparse) 
+   3.72 sec ]
+ [ GrB_select (iso select) (select sparse on cuda) 
+blockdim1: 512 chunksize1: 4096
+blockdim2: 256 chunksize2: 1024
+(jit: cuda load) 
+select sparse phase1: 4.44203 sec (gpu: Map, with cumsum)
+select sparse phase2: 0.00125335 sec (cpu: ChunkSum of Map)
+select sparse phase3: 1.23286 sec (gpu: create Ci,Cx,Ck)
+select sparse phase4: 0.0124803 sec (gpu: Ck_Delta, with cumsum)
+select sparse phase5: 0.00181425 sec (cpu: ChunkSum for Ck_Delta
+select sparse phase6: 0.0726959 sec (gpu: Cp,Ch)
+(hyper to sparse) 
+   5.79 sec ]
+ [ GrB_mxm C<M>=A'*B, masked_dot_product (dot3) (S{S} = S'*S)  work:2.34829e+10 GPUs:2 (GPU dot3) dot3 using cuda device 0
+(GPU C created and copied from M) (jit: cuda load) 
+mnz: 1202513046
+number_of_blocks_1: 10240
+threads_per_block: 32
+
+zombies: 0
+bucket 1: 87486104
+bucket 2: 359533770
+bucket 3: 755493172
+mnz: 1202513046 in buckets : 1202513046
+
+   62.7 sec ]
+ [ GrB_reduce  work:1.20251e+09 gpus:2 has_cheeseburger 1
+(cuda reduce launch 320 threads in 14680 blocks)(jit: cuda load) 
+   0.0184 sec ]
+Sandia_ULT (dot) time: 62.6785
+# of triangles: 34824916864 (GPU)
+Sandia_ULT: sum ((U*L') .* U)   sort: none
+nthreads:  80 time:    72.199374 rate:  33.31 (Sandia_ULT, one trial)
+
+Method: GPU: 0 Sandia_ULT: sum ((U*L') .* U)   sort: none
+
+ [ GrB_select (iso select) 
+   0.762 sec ]
+ [ GrB_select (iso select) 
+   0.464 sec ]
+ [ GrB_mxm C<M>=A'*B, masked_dot_product (dot3) (S{S} = S'*S)  work:2.34829e+10 GPUs:0 nthreads 80 ntasks 2560 
+   122 sec ]
+ [ GrB_reduce  work:1.20251e+09 gpus:0 
+   0.141 sec ]
+Sandia_ULT (dot) time: 122.234
+trial  0:   123.460617 sec rate  19.48  # triangles: 3.48249e+10
+
+ [ GrB_select (iso select) 
+   0.245 sec ]
+ [ GrB_select (iso select) 
+   0.24 sec ]
+ [ GrB_mxm C<M>=A'*B, masked_dot_product (dot3) (S{S} = S'*S)  work:2.34829e+10 GPUs:0 nthreads 80 ntasks 2560 
+   122 sec ]
+ [ GrB_reduce  work:1.20251e+09 gpus:0 
+   0.14 sec ]
+Sandia_ULT (dot) time: 122.484
+trial  1:   122.969385 sec rate  19.56  # triangles: 3.48249e+10
+
+ [ GrB_select (iso select) 
+   0.247 sec ]
+ [ GrB_select (iso select) 
+   0.237 sec ]
+ [ GrB_mxm C<M>=A'*B, masked_dot_product (dot3) (S{S} = S'*S)  work:2.34829e+10 GPUs:0 nthreads 80 ntasks 2560 ^C^C
+mit d-14-6-2 $ ^C