Skip to content

DistributedCC Experiments

Evan West edited this page Mar 28, 2022 · 16 revisions

DistributedStreamingCC Results : March 26th

Cluster Stats:

  • 4 c6i.4xlarge EC2 instances all in the same cluster placement group
  • 16 Xeon Platinum
  • 32 GiB of RAM

EBS (Disk) Stats:

  • 80 GiB general purpose2 SSD rated at 240 IOPS
  • Reads about 15 million graph updates/s from binary graph streams

DistributedStreamingCC: Kron16, cold file cache, WorkerCluster::num_batches=512

Used comamnd sync; echo 3 | sudo tee -a /proc/sys/vm/drop_caches to clear file cache

machines, worker_proc 1, 16 2, 16 4, 16 2, 32 4, 48
ingestion (million/s) 1.867 2.014 2.570 3.818 5.411
CC algorithm time (s) 0.43 0.14 0.14 0.14 0.14
memory usage (main) 7.70 GiB 7.70 GiB 7.70 GiB 8.84 GiB 9.8 GiB
memory usage (worker) 112 MiB 148 MiB 148 MiB 148 MiB 138 MiB

DistributedStreamingCC: Kron16, pre-populated file cache, WorkerCluster::num_batches=512

Used command cat /mnt/ssd1/kron_16_stream_binary > /dev/null to prepopulate

machines, worker_proc 1, 16 2, 16 4, 16 2, 32 4, 48
ingestion (million/s) 1.869 2.019 2.584 3.829 5.463
CC algorithm time (s) 0.45 0.14 0.14 0.14 0.14

Kron17 results

4 machines, 48 workers, num_batches=512

Kron17 dataset cold cache pre-pop
ingestion (million/s) 5.550 5.593
CC algorithm time (s) 0.31 0.43
memory usage (main) 18.5 GiB N/A
memory usage (worker) 167 MiB N/A

pre-populating has less affect than it might have otherwise because we can't fit the entire file in RAM much less sketches and the file.

Networking Performance

Tools Used

  • ping to measure round trip time (RTT)
  • iperf to measure network throughput
    To install iperf:
sudo amazon-linux-extras install -y epel
sudo yum install -y iperf

Cluster Setups Compared

1. EC2 c6i.4xlarge

All instances within a cluster placement group.
These instances have 16 cores, 32 GiB of RAM, and network bandwith of 12.5 Gbits/s

2. EMR c6g.8xlarge

These instances have 32 cores, 64 GiB of RAM, and network bandwidth of 12 Gbits/s

Latency

Ran ICMP ping request to each worker to measure RTT. Sent 5 packets to each worker and report minimum and maximum latency here.

EC2 RTT EMR RTT
.096-.157 ms .129-.158 ms

Throughput

Using default tcp options

Workload EC2 Throughput (Gbits/sec) EMR Throughput (Gbits/sec)
Point to Point 9.03-9.09 4.97
One to Many(3) 12.39 (4.03, 4.23, 4.13) 11.94 (4.00, 3.99, 3.95)
Many(3) to One 12.38 (3.03, 6.36, 2.99) 11.94 (4.04, 3.98, 3.92)

Some differences in iperf output

On EC2 the tcp window is 128 KB whereas on EMR it varies between 426 KB when a server and 1.43-1.87 MB. This could be a consequence of some of the distributed computing optimizations applied to EMR.

Maximum achievable bandwidth = TCP window size / latency in seconds = Throughput in bits/s
128 KB / 0.00013 = 7.88 Gbits/sec

More about EMR Performance

With this cluster of 4 c6g.8xlarge instances. Each instance having 32 cores and 64 GiB of RAM.

Kron16 performance with 48 workers

Kron16 dataset cold cache pre-pop
ingestion (million/s) 3.672 3.566
CC algorithm time (s) 0.098 0.098

Performance seems to be bottlenecking on the main node at least for pre-pop. Lots of workers waiting on work queue. This makes sense as our benchmarks indicate that these instance types don't push data through the StandAloneGutters very fast.

Clone this wiki locally