Skip to content

Commit 4753437

Browse files
output for do_gap_tc_gpu for twitter
1 parent 68460ae commit 4753437

1 file changed

Lines changed: 176 additions & 0 deletions

File tree

o_do_gap_tc_gpu

Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
mit d-14-6-2 $ ./do_gap_tc_gpu grb
2+
3+
======================================================================
4+
GAP benchmarks using LAGraph+GraphBLAS: Triangle Counting
5+
======================================================================
6+
d-14-6-2
7+
OMP_PLACES=cores
8+
OMP_PROC_BIND=spread
9+
Matrix input file format: grb
10+
GAP matrices located in: ../../../GAP
11+
GB_cuda_get_device_count: 2, cudaError_t: 0
12+
GB_cuda_init: ngpus: 2
13+
14+
Device: 0: memory: 34072559616 SMs: 80 compute: 7.0
15+
16+
Device: 1: memory: 34072559616 SMs: 80 compute: 7.0
17+
CUDA_VISIBLE_DEVICES = 0,1
18+
getting cuda visible devices
19+
Found device_id 0
20+
Found device_id 1
21+
devices.size is 2
22+
cuda warmup 0
23+
cuda warmup 0 OK
24+
cuda warmup 1
25+
cuda warmup 1 OK
26+
JIT init, device 0
27+
library: SuiteSparse:GraphBLAS v10.2.0 [FIXME, 2025]
28+
# of trials: 5
29+
threads to test: 80
30+
matrix: ../../../GAP/GAP-twitter/GAP-twitter.grb
31+
[.grb]
32+
Reading binary file: ../../../GAP/GAP-twitter/GAP-twitter.grb
33+
[ GrB_set
34+
0.0133 sec ]
35+
[ GrB_set
36+
0.000835 sec ]
37+
[ GrB_set
38+
0.664 sec ]
39+
[ GrB_set
40+
0.0717 sec ]
41+
A converted to 32-bit
42+
[ GrB_assign (C iso assign) (pending: 0) Method 05e: (C empty)<M,struct> = scalar
43+
0.226 sec ]
44+
[ GrB_select (iso select) (select sparse on cuda)
45+
blockdim1: 512 chunksize1: 4096
46+
blockdim2: 256 chunksize2: 1024
47+
(jit: cuda load)
48+
select sparse phase1: 1.6 sec (gpu: Map, with cumsum)
49+
select sparse phase2: 0.00121746 sec (cpu: ChunkSum of Map)
50+
select sparse phase3: 1.25434 sec (gpu: create Ci,Cx,Ck)
51+
select sparse phase4: 0.0143758 sec (gpu: Ck_Delta, with cumsum)
52+
select sparse phase5: 0.00237009 sec (cpu: ChunkSum for Ck_Delta
53+
select sparse phase6: 0.0311198 sec (gpu: Cp,Ch)
54+
(hyper to sparse)
55+
2.98 sec ]
56+
[ GrB_transpose (iso transpose) (80-thread atomic bucket transpose)
57+
8.62 sec ]
58+
[ GrB_eWiseMult (iso wait:B 0 zombies, 0 pending, jumbled) (wait: unjumble only) emult:(S<.>=S.*S) (iso emult)
59+
1.46 sec ]
60+
[ GrB_Matrix_nvals
61+
1.16e-06 sec ]
62+
[ GrB_Matrix_nvals
63+
6.1e-08 sec ]
64+
[ GrB_Matrix_nvals
65+
3.4e-08 sec ]
66+
[ GrB_Matrix_nvals
67+
3.19e-08 sec ]
68+
[ GrB_eWiseMult emult:(S<.>=S.*S) (iso emult)
69+
0.396 sec ]
70+
[ GrB_Matrix_nvals
71+
4.31e-07 sec ]
72+
forcing G-> to be symmetric (via A = A+A')
73+
[ GrB_eWiseAdd add:(S<.>=S+S) (iso add)
74+
0.849 sec ]
75+
read time: 58.1269
76+
[ GrB_assign (C iso assign) (pending: 0) Method 21: (C full) = scalar
77+
2.94e-05 sec ]
78+
[ GrB_mxv C=A'*B, dot_product (dot2) (nthreads: 80 naslice 2560 nbslice 1) (dot B = S'*F) (jit: cpu load)
79+
0.0268 sec ]
80+
[ GrB_Matrix_nvals
81+
2.66e-07 sec ]
82+
83+
warmup method: Sandia_ULT: sum ((U*L') .* U) sort: none
84+
85+
[ GrB_select (iso select)
86+
0.257 sec ]
87+
[ GrB_select (iso select)
88+
0.367 sec ]
89+
[ GrB_mxm C<M>=A'*B, masked_dot_product (dot3) (S{S} = S'*S) work:2.34829e+10 GPUs:0 nthreads 80 ntasks 2560 (jit: compile and load) (jit compile:)
90+
sh -c "/usr/bin/gcc -DGB_JIT_RUNTIME=1 -Wundef -Wno-strict-aliasing -std=c11 -lm -Wno-pragmas -fexcess-precision=fast -fcx-limited-range -fno-math-errno -fwrapv -O3 -DNDEBUG -fPIC -fopenmp -I'/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/src' -I'/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/src/template' -I'/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/src/include' -o '/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/c/99/GB_jit__AxB_dot3__fff4611800280055.o' -c '/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/c/99/GB_jit__AxB_dot3__fff4611800280055.c' 2>&1 ; /usr/bin/gcc -Wundef -Wno-strict-aliasing -std=c11 -lm -Wno-pragmas -fexcess-precision=fast -fcx-limited-range -fno-math-errno -fwrapv -O3 -DNDEBUG -fPIC -fopenmp -shared -o '/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/lib/99/libGB_jit__AxB_dot3__fff4611800280055.so' '/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/c/99/GB_jit__AxB_dot3__fff4611800280055.o' -lm -ldl -lgomp -lpthread 2>&1 "
91+
92+
127 sec ]
93+
[ GrB_reduce work:1.20251e+09 gpus:0 (jit: compile and load) (jit compile:)
94+
sh -c "/usr/bin/gcc -DGB_JIT_RUNTIME=1 -Wundef -Wno-strict-aliasing -std=c11 -lm -Wno-pragmas -fexcess-precision=fast -fcx-limited-range -fno-math-errno -fwrapv -O3 -DNDEBUG -fPIC -fopenmp -I'/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/src' -I'/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/src/template' -I'/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/src/include' -o '/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/c/6b/GB_jit__reduce__1488d.o' -c '/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/c/6b/GB_jit__reduce__1488d.c' 2>&1 ; /usr/bin/gcc -Wundef -Wno-strict-aliasing -std=c11 -lm -Wno-pragmas -fexcess-precision=fast -fcx-limited-range -fno-math-errno -fwrapv -O3 -DNDEBUG -fPIC -fopenmp -shared -o '/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/lib/6b/libGB_jit__reduce__1488d.so' '/home/gridsan/tdavis/.SuiteSparse/GrB10.2.0/c/6b/GB_jit__reduce__1488d.o' -lm -ldl -lgomp -lpthread 2>&1 "
95+
96+
0.449 sec ]
97+
Sandia_ULT (dot) time: 127.379
98+
# of triangles: 34824916864
99+
Sandia_ULT: sum ((U*L') .* U) sort: none
100+
nthreads: 80 time: 128.008497 rate: 18.79 (Sandia_ULT, one trial)
101+
102+
[ GrB_select (iso select) (select sparse on cuda)
103+
blockdim1: 512 chunksize1: 4096
104+
blockdim2: 256 chunksize2: 1024
105+
(jit: cuda load)
106+
select sparse phase1: 2.54589 sec (gpu: Map, with cumsum)
107+
select sparse phase2: 0.00104322 sec (cpu: ChunkSum of Map)
108+
select sparse phase3: 1.03406 sec (gpu: create Ci,Cx,Ck)
109+
select sparse phase4: 0.0183033 sec (gpu: Ck_Delta, with cumsum)
110+
select sparse phase5: 0.00183061 sec (cpu: ChunkSum for Ck_Delta
111+
select sparse phase6: 0.0835983 sec (gpu: Cp,Ch)
112+
(hyper to sparse)
113+
3.72 sec ]
114+
[ GrB_select (iso select) (select sparse on cuda)
115+
blockdim1: 512 chunksize1: 4096
116+
blockdim2: 256 chunksize2: 1024
117+
(jit: cuda load)
118+
select sparse phase1: 4.44203 sec (gpu: Map, with cumsum)
119+
select sparse phase2: 0.00125335 sec (cpu: ChunkSum of Map)
120+
select sparse phase3: 1.23286 sec (gpu: create Ci,Cx,Ck)
121+
select sparse phase4: 0.0124803 sec (gpu: Ck_Delta, with cumsum)
122+
select sparse phase5: 0.00181425 sec (cpu: ChunkSum for Ck_Delta
123+
select sparse phase6: 0.0726959 sec (gpu: Cp,Ch)
124+
(hyper to sparse)
125+
5.79 sec ]
126+
[ GrB_mxm C<M>=A'*B, masked_dot_product (dot3) (S{S} = S'*S) work:2.34829e+10 GPUs:2 (GPU dot3) dot3 using cuda device 0
127+
(GPU C created and copied from M) (jit: cuda load)
128+
mnz: 1202513046
129+
number_of_blocks_1: 10240
130+
threads_per_block: 32
131+
132+
zombies: 0
133+
bucket 1: 87486104
134+
bucket 2: 359533770
135+
bucket 3: 755493172
136+
mnz: 1202513046 in buckets : 1202513046
137+
138+
62.7 sec ]
139+
[ GrB_reduce work:1.20251e+09 gpus:2 has_cheeseburger 1
140+
(cuda reduce launch 320 threads in 14680 blocks)(jit: cuda load)
141+
0.0184 sec ]
142+
Sandia_ULT (dot) time: 62.6785
143+
# of triangles: 34824916864 (GPU)
144+
Sandia_ULT: sum ((U*L') .* U) sort: none
145+
nthreads: 80 time: 72.199374 rate: 33.31 (Sandia_ULT, one trial)
146+
147+
Method: GPU: 0 Sandia_ULT: sum ((U*L') .* U) sort: none
148+
149+
[ GrB_select (iso select)
150+
0.762 sec ]
151+
[ GrB_select (iso select)
152+
0.464 sec ]
153+
[ GrB_mxm C<M>=A'*B, masked_dot_product (dot3) (S{S} = S'*S) work:2.34829e+10 GPUs:0 nthreads 80 ntasks 2560
154+
122 sec ]
155+
[ GrB_reduce work:1.20251e+09 gpus:0
156+
0.141 sec ]
157+
Sandia_ULT (dot) time: 122.234
158+
trial 0: 123.460617 sec rate 19.48 # triangles: 3.48249e+10
159+
160+
[ GrB_select (iso select)
161+
0.245 sec ]
162+
[ GrB_select (iso select)
163+
0.24 sec ]
164+
[ GrB_mxm C<M>=A'*B, masked_dot_product (dot3) (S{S} = S'*S) work:2.34829e+10 GPUs:0 nthreads 80 ntasks 2560
165+
122 sec ]
166+
[ GrB_reduce work:1.20251e+09 gpus:0
167+
0.14 sec ]
168+
Sandia_ULT (dot) time: 122.484
169+
trial 1: 122.969385 sec rate 19.56 # triangles: 3.48249e+10
170+
171+
[ GrB_select (iso select)
172+
0.247 sec ]
173+
[ GrB_select (iso select)
174+
0.237 sec ]
175+
[ GrB_mxm C<M>=A'*B, masked_dot_product (dot3) (S{S} = S'*S) work:2.34829e+10 GPUs:0 nthreads 80 ntasks 2560 ^C^C
176+
mit d-14-6-2 $ ^C

0 commit comments

Comments
 (0)