DistributedCC Experiments

DistributedStreamingCC Results : July 11th

Cluster Stats:

Octopus Main: 1 c5n.18xlarge
- 72 Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
- 192 GiB of RAM
- Network Bandwidth: 100 Gbits/s
Octopus Workers: 22 c5.4xlarge
- 16 Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
- 32 GiB of RAM
- Network Bandwidth: up to 10 Gbits/s

Queries In Stream Experiment

Used 80 Octopus Workers for a total of 1280 worker threads

Results as of now are not so good. Other things to try

Repeating the stream multiple times so there are not so few updates between queries
Making gutters smaller to reduce number of updates to be flushed? -- If bottlenecked on CacheTree could help.
Somehow measure in more detail where time is being spent. Are we bottlenecked on CacheTree, (De)Serialization, WorkQueue, applying sketch deltas, context switching, or network latency?
- Another idea would be to run a test where we insert one update for each node in the graph and measure how long that takes to flush. If the flush latency is high it likely rules out the CacheTree as the bottleneck.
- Run same test using the single machine version of the code with the same gutter size to compare? -- Rules out everything but network latency
- One thing to note: With 1000 queries we are only doing 1 million upds/s which is worse than single machine performance (suggests bottleneck is serialization or network) (do a single machine experiment to confirm this)

With DSU


Performing 0 queries during stream	34.3s, 130 million upds/sec (DSU not used, query latency ~8.6s, Flush ~8.3s)
Performing 2 queries during stream	39.9s, 112 million upds/sec (DSU not used, query latency ~9s, Flush ~8.1s)
Performing 10 queries during stream	71.7s, 62.39 million upds/sec (DSU not used, query latency ~6s, Flush ~5.4s)
Performing 20 queries during stream	120.8s, 37.05 million upds/sec (DSU not used, query latency ~5.5s, Flush ~5s)
Performing 1000 queries during stream	NOT FINISHED, ~1 million upds/sec (DSU not used, query latency ~4.7s, Flush ~4.1s)

Without DSU

x/20 Proportion of Stream	Kron17 - Zero Queries During Stream	Kron17 - Two Queries	Kron17 - Ten Queries	Kron17 - Twenty Queries
2
Query Latency	N/A	N/A	6.441	6.104
Flush Latency	N/A	N/A	5.956	5.466
CC Alg Latency	N/A	N/A	0.484	0.637
4
Query Latency	N/A	N/A	6.230	5.593
Flush Latency	N/A	N/A	5.517	5.035
CC Alg Latency	N/A	N/A	0.712	0.557
6
Query Latency	N/A	N/A	5.860	5.3222
Flush Latency	N/A	N/A	5.536	4.98861
CC Alg Latency	N/A	N/A	0.324	0.333403
8
Query Latency	N/A	N/A	5.846	5.37989
Flush Latency	N/A	N/A	5.552	5.06548
CC Alg Latency	N/A	N/A	0.293	0.314094
10
Query Latency	N/A	8.858	5.813	5.41574
Flush Latency	N/A	8.145	5.532	5.09462
CC Alg Latency	N/A	0.707	0.280	0.320883
12
Query Latency	N/A	N/A	5.772	5.42054
Flush Latency	N/A	N/A	5.494	5.09636
CC Alg Latency	N/A	N/A	0.277	0.32406
14
Query Latency	N/A	N/A	5.755	5.26304
Flush Latency	N/A	N/A	5.472	4.97346
CC Alg Latency	N/A	N/A	0.282	0.289557
16
Query Latency	N/A	N/A	5.828	5.33229
Flush Latency	N/A	N/A	5.549	4.99865
CC Alg Latency	N/A	N/A	0.279	0.333362
18
Query Latency	N/A	N/A	5.725	5.31135
Flush Latency	N/A	N/A	5.450	4.96498
CC Alg Latency	N/A	N/A	0.275	0.346306
20
Query Latency	N/A	8.949	5.693	5.36582
Flush Latency	N/A	8.185	5.417	5.03582
CC Alg Latency	N/A	0.759	0.276	0.329855

Overall Ingestion Rate	133.07	111.73	63.10	37.387
Ingestion Time	33.6	40.5	70.9	119.6

Final Query
Query Latency	N/A	0.378	0.386	0.354649
Flush Latency	N/A	0.006	0.006	0.00660373
CC Alg Latency	0.2436	0.372	0.380	0.348037

DistributedStreamingCC Results : April 4th

Cluster Stats:

Octopus Main: 1 c5n.9xlarge
- 36 Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
- 92 GiB of RAM
- Network Bandwidth: 50 Gbits/s
Octopus Workers: 22 c5.4xlarge
- 16 Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
- 32 GiB of RAM
- Network Bandwidth: up to 10 Gbits/s

EBS (Disk) Stats:

80 GiB general purpose2 SSD rated at 240 IOPS
Reads about 15 million graph updates/s from binary graph streams

Kron17 results

20 inserter threads, preloading the file

Worker Processes	Worker Machines	Ins/sec (millions)	Marginal Rate Increase
16	1	1.7323	N/A
32	2	3.5519	1.8196
48	3	5.3600	1.8081
64	4	7.1553	1.7953
80	5	8.9378	1.7825
96	6	10.675	1.7372
112	7	12.586	1.911
128	8	14.193	1.607
144	9	15.926	1.733
160	10	17.642	1.716
176	11	19.347	1.705
192	12	21.133	1.786
208	13	22.810	1.677
224	14	24.462	1.652
240	15	26.093	1.631
256	16	27.728	1.635
272	17	29.351	1.623
288	18	30.964	1.613
304	19	32.543	1.579
320	20	34.148	1.605
336	21	35.573	1.425
352	22	37.071	1.498

DistributedStreamingCC Results : March 26th

Cluster Stats:

4 c6i.4xlarge EC2 instances all in the same cluster placement group
16 Xeon Platinum
32 GiB of RAM

EBS (Disk) Stats:

80 GiB general purpose2 SSD rated at 240 IOPS
Reads about 15 million graph updates/s from binary graph streams

DistributedStreamingCC: Kron16, cold file cache, WorkerCluster::num_batches=512

Used comamnd sync; echo 3 | sudo tee -a /proc/sys/vm/drop_caches to clear file cache

machines, worker_proc	1, 16	2, 16	4, 16	2, 32	4, 48
ingestion (million/s)	1.867	2.014	2.570	3.818	5.411
CC algorithm time (s)	0.43	0.14	0.14	0.14	0.14
memory usage (main)	7.70 GiB	7.70 GiB	7.70 GiB	8.84 GiB	9.8 GiB
memory usage (worker)	112 MiB	148 MiB	148 MiB	148 MiB	138 MiB

DistributedStreamingCC: Kron16, pre-populated file cache, WorkerCluster::num_batches=512

Used command cat /mnt/ssd1/kron_16_stream_binary > /dev/null to prepopulate

machines, worker_proc	1, 16	2, 16	4, 16	2, 32	4, 48
ingestion (million/s)	1.869	2.019	2.584	3.829	5.463
CC algorithm time (s)	0.45	0.14	0.14	0.14	0.14

Kron17 results

4 machines, 48 workers, num_batches=512

Kron17 dataset	cold cache	pre-pop
ingestion (million/s)	5.550	5.593
CC algorithm time (s)	0.31	0.43
memory usage (main)	18.5 GiB	N/A
memory usage (worker)	167 MiB	N/A

pre-populating has less affect than it might have otherwise because we can't fit the entire file in RAM much less sketches and the file.

Networking Performance

Tools Used

ping to measure round trip time (RTT)
iperf to measure network throughput
To install iperf:

sudo amazon-linux-extras install -y epel
sudo yum install -y iperf

Cluster Setups Compared

1. EC2 c6i.4xlarge

All instances within a cluster placement group.
These instances have 16 cores, 32 GiB of RAM, and network bandwith of 12.5 Gbits/s

2. EMR c6g.8xlarge

These instances have 32 cores, 64 GiB of RAM, and network bandwidth of 12 Gbits/s

Latency

Ran ICMP ping request to each worker to measure RTT. Sent 5 packets to each worker and report minimum and maximum latency here.

EC2 RTT	EMR RTT
.096-.157 ms	.129-.158 ms

Throughput

Using default tcp options

Workload	EC2 Throughput (Gbits/sec)	EMR Throughput (Gbits/sec)
Point to Point	9.03-9.09	4.97
One to Many(3)	12.39 (4.03, 4.23, 4.13)	11.94 (4.00, 3.99, 3.95)
Many(3) to One	12.38 (3.03, 6.36, 2.99)	11.94 (4.04, 3.98, 3.92)

Some differences in iperf output

On EC2 the tcp window is 128 KB whereas on EMR it varies between 426 KB when a server and 1.43-1.87 MB. This could be a consequence of some of the distributed computing optimizations applied to EMR.

Maximum achievable bandwidth = TCP window size / latency in seconds = Throughput in bits/s
128 KB / 0.00013 = 7.88 Gbits/sec

Octopus Setup

Cluster

Octopus main: c6i.12xlarge, 48 Xeon Platinum, 96 GiB of RAM, 18.75 Gbits/s
Octopus Workers: c5a.4xlarge, 16 AMD EPYC 7R32, 32 GiB of RAM, up to 10 Gbits/s

NOTE: These workers are probably still too beefy. We should also try lower-end cpus.

Latency

Ran ICMP ping request to each worker to measure RTT. Sent 5 packets to each worker and report minimum and maximum latency here.

Min = 0.115, 0.103, 0.124, 0.094, 0.196, 0.117
Max = 0.194, 0.199, 0.246, 0.249, 0.133, 0.228

Overall: 0.094 - 0.249 ms

Throughput

Using default tcp options

Workload	EC2 Throughput (Gbits/sec)
Point to Point	9.37-9.38
One to Many(6)	18.6 (2.37, 2.38, 2.35, 4.67, 2.09, 4.74)
Many(6) to One	18.6 (1.23, 3.77, 3.78, 3.49, 3.85, 2.48)

More about EMR Performance

With this cluster of 4 c6g.8xlarge instances. Each instance having 32 cores and 64 GiB of RAM.

Kron16 performance with 48 workers

Kron16 dataset	cold cache	pre-pop
ingestion (million/s)	3.672	3.566
CC algorithm time (s)	0.098	0.098

Performance seems to be bottlenecking on the main node at least for pre-pop. Lots of workers waiting on work queue. This makes sense as our benchmarks indicate that these instance types don't push data through the StandAloneGutters very fast.

DistributedCC Experiments

DistributedStreamingCC Results : July 11th

Cluster Stats:

Queries In Stream Experiment

Results as of now are not so good. Other things to try

With DSU

Without DSU

DistributedStreamingCC Results : April 4th

Cluster Stats:

EBS (Disk) Stats:

Kron17 results

DistributedStreamingCC Results : March 26th

Cluster Stats:

EBS (Disk) Stats:

DistributedStreamingCC: Kron16, cold file cache, WorkerCluster::num_batches=512

DistributedStreamingCC: Kron16, pre-populated file cache, WorkerCluster::num_batches=512

Kron17 results

Networking Performance

Tools Used

Cluster Setups Compared

1. EC2 c6i.4xlarge

2. EMR c6g.8xlarge

Latency

Throughput

Some differences in iperf output

Octopus Setup

Cluster

Latency

Throughput

More about EMR Performance

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally