DistributedCC Experiments

DistributedStreamingCC Results : March 26th

Cluster Stats:

4 c6i.4xlarge EC2 instances all in the same cluster placement group
16 Xeon Platinum
32 GiB of RAM

EBS (Disk) Stats:

80 GiB general purpose2 SSD rated at 240 IOPS
Reads about 15 million graph updates/s from binary graph streams

DistributedStreamingCC: Kron16, cold file cache, WorkerCluster::num_batches=512

Used comamnd sync; echo 3 | sudo tee -a /proc/sys/vm/drop_caches to clear file cache

machines, worker_proc	1, 16	2, 16	4, 16	2, 32	4, 48
ingestion (million/s)	1.867	2.014	2.570	3.818	5.411
CC algorithm time (s)	0.43	0.14	0.14	0.14	0.14
memory usage (main)	7.70 GiB	7.70 GiB	7.70 GiB	8.84 GiB	9.8 GiB
memory usage (worker)	112 MiB	148 MiB	148 MiB	148 MiB	138 MiB

DistributedStreamingCC: Kron16, pre-populated file cache, WorkerCluster::num_batches=512

Used command cat /mnt/ssd1/kron_16_stream_binary > /dev/null to prepopulate

machines, worker_proc	1, 16	2, 16	4, 16	2, 32	4, 48
ingestion (million/s)	1.869	2.019	2.584	3.829	5.463
CC algorithm time (s)	0.45	0.14	0.14	0.14	0.14

Kron17 results

4 machines, 48 workers, num_batches=512

Kron17 dataset	cold cache	pre-pop
ingestion (million/s)	5.550	5.593
CC algorithm time (s)	0.31	0.43
memory usage (main)	18.5 GiB	N/A
memory usage (worker)	167 MiB	N/A

pre-populating has less affect than it might have otherwise because we can't fit the entire file in RAM much less sketches and the file.

Networking Performance

Tools Used

ping to measure round trip time (RTT)
iperf to measure network throughput
To install iperf:

sudo amazon-linux-extras install -y epel
sudo yum install -y iperf

Cluster Setups Compared

1. EC2 c6i.4xlarge

All instances within a cluster placement group.
These instances have 16 cores, 32 GiB of RAM, and network bandwith of 12.5 Gbits/s

2. EMR c6g.8xlarge

These instances have 32 cores, 64 GiB of RAM, and network bandwidth of 12 Gbits/s

Latency

Ran ICMP ping request to each worker to measure RTT. Sent 5 packets to each worker and report minimum and maximum latency here.

EC2 RTT	EMR RTT
.096-.157 ms	.129-.158 ms

Throughput

Using default tcp options

Workload	EC2 Throughput (Gbits/sec)	EMR Throughput (Gbits/sec)
Point to Point	9.03-9.09	4.97
One to Many(3)	12.39 (4.03, 4.23, 4.13)	11.94 (4.00, 3.99, 3.95)
Many(3) to One	12.38 (3.03, 6.36, 2.99)	11.94 (4.04, 3.98, 3.92)

Some differences in iperf output

On EC2 the tcp window is 128 KB whereas on EMR it varies between 426 KB when a server and 1.43-1.87 MB. This could be a consequence of some of the distributed computing optimizations applied to EMR.

Maximum achievable bandwidth = TCP window size / latency in seconds = Throughput in bits/s
128 KB / 0.00013 = 7.88 Gbits/sec

More about EMR Performance

With this cluster of 4 c6g.8xlarge instances. Each instance having 32 cores and 64 GiB of RAM.

Kron16 performance with 48 workers

Kron16 dataset	cold cache	pre-pop
ingestion (million/s)	3.672	3.566
CC algorithm time (s)	0.098	0.098

Performance seems to be bottlenecking on the main node at least for pre-pop. Lots of workers waiting on work queue. This makes sense as our benchmarks indicate that these instance types don't push data through the StandAloneGutters very fast.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DistributedCC Experiments

DistributedStreamingCC Results : March 26th

Cluster Stats:

EBS (Disk) Stats:

DistributedStreamingCC: Kron16, cold file cache, WorkerCluster::num_batches=512

DistributedStreamingCC: Kron16, pre-populated file cache, WorkerCluster::num_batches=512

Kron17 results

Networking Performance

Tools Used

Cluster Setups Compared

1. EC2 c6i.4xlarge

2. EMR c6g.8xlarge

Latency

Throughput

Some differences in iperf output

More about EMR Performance

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally