-
Notifications
You must be signed in to change notification settings - Fork 5
DistributedCC Experiments
- Octopus Main: 1 c5n.18xlarge
- 72 Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
- 192 GiB of RAM
- Network Bandwidth: 100 Gbits/s
- Octopus Workers: 22 c5.4xlarge
- 16 Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
- 32 GiB of RAM
- Network Bandwidth: up to 10 Gbits/s
Used 80 Octopus Workers for a total of 1280 worker threads
- Repeating the stream multiple times so there are not so few updates between queries
- Making gutters smaller to reduce number of updates to be flushed? -- If bottlenecked on CacheTree could help.
- Somehow measure in more detail where time is being spent. Are we bottlenecked on CacheTree, (De)Serialization, WorkQueue, applying sketch deltas, context switching, or network latency?
- Another idea would be to run a test where we insert one update for each node in the graph and measure how long that takes to flush. If the flush latency is high it likely rules out the CacheTree as the bottleneck.
- Run same test using the single machine version of the code with the same gutter size to compare? -- Rules out everything but network latency
- One thing to note: With 1000 queries we are only doing 1 million upds/s which is worse than single machine performance (suggests bottleneck is serialization or network) (do a single machine experiment to confirm this)
| Performing 0 queries during stream | 34.3s, 130 million upds/sec (DSU not used, query latency ~8.6s, Flush ~8.3s) |
| Performing 2 queries during stream | 39.9s, 112 million upds/sec (DSU not used, query latency ~9s, Flush ~8.1s) |
| Performing 10 queries during stream | 71.7s, 62.39 million upds/sec (DSU not used, query latency ~6s, Flush ~5.4s) |
| Performing 20 queries during stream | 120.8s, 37.05 million upds/sec (DSU not used, query latency ~5.5s, Flush ~5s) |
| Performing 1000 queries during stream | NOT FINISHED, ~1 million upds/sec (DSU not used, query latency ~4.7s, Flush ~4.1s) |
| x/20 Proportion of Stream | Kron17 - Zero Queries During Stream | Kron17 - Two Queries | Kron17 - Ten Queries | Kron17 - Twenty Queries |
|---|---|---|---|---|
| 2 | ||||
| Query Latency | N/A | N/A | 6.441 | 6.104 |
| Flush Latency | N/A | N/A | 5.956 | 5.466 |
| CC Alg Latency | N/A | N/A | 0.484 | 0.637 |
| 4 | ||||
| Query Latency | N/A | N/A | 6.230 | 5.593 |
| Flush Latency | N/A | N/A | 5.517 | 5.035 |
| CC Alg Latency | N/A | N/A | 0.712 | 0.557 |
| 6 | ||||
| Query Latency | N/A | N/A | 5.860 | 5.3222 |
| Flush Latency | N/A | N/A | 5.536 | 4.98861 |
| CC Alg Latency | N/A | N/A | 0.324 | 0.333403 |
| 8 | ||||
| Query Latency | N/A | N/A | 5.846 | 5.37989 |
| Flush Latency | N/A | N/A | 5.552 | 5.06548 |
| CC Alg Latency | N/A | N/A | 0.293 | 0.314094 |
| 10 | ||||
| Query Latency | N/A | 8.858 | 5.813 | 5.41574 |
| Flush Latency | N/A | 8.145 | 5.532 | 5.09462 |
| CC Alg Latency | N/A | 0.707 | 0.280 | 0.320883 |
| 12 | ||||
| Query Latency | N/A | N/A | 5.772 | 5.42054 |
| Flush Latency | N/A | N/A | 5.494 | 5.09636 |
| CC Alg Latency | N/A | N/A | 0.277 | 0.32406 |
| 14 | ||||
| Query Latency | N/A | N/A | 5.755 | 5.26304 |
| Flush Latency | N/A | N/A | 5.472 | 4.97346 |
| CC Alg Latency | N/A | N/A | 0.282 | 0.289557 |
| 16 | ||||
| Query Latency | N/A | N/A | 5.828 | 5.33229 |
| Flush Latency | N/A | N/A | 5.549 | 4.99865 |
| CC Alg Latency | N/A | N/A | 0.279 | 0.333362 |
| 18 | ||||
| Query Latency | N/A | N/A | 5.725 | 5.31135 |
| Flush Latency | N/A | N/A | 5.450 | 4.96498 |
| CC Alg Latency | N/A | N/A | 0.275 | 0.346306 |
| 20 | ||||
| Query Latency | N/A | 8.949 | 5.693 | 5.36582 |
| Flush Latency | N/A | 8.185 | 5.417 | 5.03582 |
| CC Alg Latency | N/A | 0.759 | 0.276 | 0.329855 |
| Overall Ingestion Rate | 133.07 | 111.73 | 63.10 | 37.387 |
| Ingestion Time | 33.6 | 40.5 | 70.9 | 119.6 |
| Final Query | ||||
| Query Latency | N/A | 0.378 | 0.386 | 0.354649 |
| Flush Latency | N/A | 0.006 | 0.006 | 0.00660373 |
| CC Alg Latency | 0.2436 | 0.372 | 0.380 | 0.348037 |
- Octopus Main: 1 c5n.9xlarge
- 36 Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
- 92 GiB of RAM
- Network Bandwidth: 50 Gbits/s
- Octopus Workers: 22 c5.4xlarge
- 16 Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
- 32 GiB of RAM
- Network Bandwidth: up to 10 Gbits/s
- 80 GiB general purpose2 SSD rated at 240 IOPS
- Reads about 15 million graph updates/s from binary graph streams
20 inserter threads, preloading the file
| Worker Processes | Worker Machines | Ins/sec (millions) | Marginal Rate Increase |
|---|---|---|---|
| 16 | 1 | 1.7323 | N/A |
| 32 | 2 | 3.5519 | 1.8196 |
| 48 | 3 | 5.3600 | 1.8081 |
| 64 | 4 | 7.1553 | 1.7953 |
| 80 | 5 | 8.9378 | 1.7825 |
| 96 | 6 | 10.675 | 1.7372 |
| 112 | 7 | 12.586 | 1.911 |
| 128 | 8 | 14.193 | 1.607 |
| 144 | 9 | 15.926 | 1.733 |
| 160 | 10 | 17.642 | 1.716 |
| 176 | 11 | 19.347 | 1.705 |
| 192 | 12 | 21.133 | 1.786 |
| 208 | 13 | 22.810 | 1.677 |
| 224 | 14 | 24.462 | 1.652 |
| 240 | 15 | 26.093 | 1.631 |
| 256 | 16 | 27.728 | 1.635 |
| 272 | 17 | 29.351 | 1.623 |
| 288 | 18 | 30.964 | 1.613 |
| 304 | 19 | 32.543 | 1.579 |
| 320 | 20 | 34.148 | 1.605 |
| 336 | 21 | 35.573 | 1.425 |
| 352 | 22 | 37.071 | 1.498 |
- 4 c6i.4xlarge EC2 instances all in the same cluster placement group
- 16 Xeon Platinum
- 32 GiB of RAM
- 80 GiB general purpose2 SSD rated at 240 IOPS
- Reads about 15 million graph updates/s from binary graph streams
Used comamnd sync; echo 3 | sudo tee -a /proc/sys/vm/drop_caches to clear file cache
| machines, worker_proc | 1, 16 | 2, 16 | 4, 16 | 2, 32 | 4, 48 |
|---|---|---|---|---|---|
| ingestion (million/s) | 1.867 | 2.014 | 2.570 | 3.818 | 5.411 |
| CC algorithm time (s) | 0.43 | 0.14 | 0.14 | 0.14 | 0.14 |
| memory usage (main) | 7.70 GiB | 7.70 GiB | 7.70 GiB | 8.84 GiB | 9.8 GiB |
| memory usage (worker) | 112 MiB | 148 MiB | 148 MiB | 148 MiB | 138 MiB |
Used command cat /mnt/ssd1/kron_16_stream_binary > /dev/null to prepopulate
| machines, worker_proc | 1, 16 | 2, 16 | 4, 16 | 2, 32 | 4, 48 |
|---|---|---|---|---|---|
| ingestion (million/s) | 1.869 | 2.019 | 2.584 | 3.829 | 5.463 |
| CC algorithm time (s) | 0.45 | 0.14 | 0.14 | 0.14 | 0.14 |
4 machines, 48 workers, num_batches=512
| Kron17 dataset | cold cache | pre-pop |
|---|---|---|
| ingestion (million/s) | 5.550 | 5.593 |
| CC algorithm time (s) | 0.31 | 0.43 |
| memory usage (main) | 18.5 GiB | N/A |
| memory usage (worker) | 167 MiB | N/A |
pre-populating has less affect than it might have otherwise because we can't fit the entire file in RAM much less sketches and the file.
- ping to measure round trip time (RTT)
- iperf to measure network throughput
To install iperf:
sudo amazon-linux-extras install -y epel
sudo yum install -y iperf
All instances within a cluster placement group.
These instances have 16 cores, 32 GiB of RAM, and network bandwith of 12.5 Gbits/s
These instances have 32 cores, 64 GiB of RAM, and network bandwidth of 12 Gbits/s
Ran ICMP ping request to each worker to measure RTT. Sent 5 packets to each worker and report minimum and maximum latency here.
| EC2 RTT | EMR RTT |
|---|---|
| .096-.157 ms | .129-.158 ms |
Using default tcp options
| Workload | EC2 Throughput (Gbits/sec) | EMR Throughput (Gbits/sec) |
|---|---|---|
| Point to Point | 9.03-9.09 | 4.97 |
| One to Many(3) | 12.39 (4.03, 4.23, 4.13) | 11.94 (4.00, 3.99, 3.95) |
| Many(3) to One | 12.38 (3.03, 6.36, 2.99) | 11.94 (4.04, 3.98, 3.92) |
On EC2 the tcp window is 128 KB whereas on EMR it varies between 426 KB when a server and 1.43-1.87 MB. This could be a consequence of some of the distributed computing optimizations applied to EMR.
Maximum achievable bandwidth = TCP window size / latency in seconds = Throughput in bits/s
128 KB / 0.00013 = 7.88 Gbits/sec
Octopus main: c6i.12xlarge, 48 Xeon Platinum, 96 GiB of RAM, 18.75 Gbits/s
Octopus Workers: c5a.4xlarge, 16 AMD EPYC 7R32, 32 GiB of RAM, up to 10 Gbits/s
NOTE: These workers are probably still too beefy. We should also try lower-end cpus.
Ran ICMP ping request to each worker to measure RTT. Sent 5 packets to each worker and report minimum and maximum latency here.
Min = 0.115, 0.103, 0.124, 0.094, 0.196, 0.117
Max = 0.194, 0.199, 0.246, 0.249, 0.133, 0.228
Overall: 0.094 - 0.249 ms
Using default tcp options
| Workload | EC2 Throughput (Gbits/sec) |
|---|---|
| Point to Point | 9.37-9.38 |
| One to Many(6) | 18.6 (2.37, 2.38, 2.35, 4.67, 2.09, 4.74) |
| Many(6) to One | 18.6 (1.23, 3.77, 3.78, 3.49, 3.85, 2.48) |
With this cluster of 4 c6g.8xlarge instances. Each instance having 32 cores and 64 GiB of RAM.
Kron16 performance with 48 workers
| Kron16 dataset | cold cache | pre-pop |
|---|---|---|
| ingestion (million/s) | 3.672 | 3.566 |
| CC algorithm time (s) | 0.098 | 0.098 |
Performance seems to be bottlenecking on the main node at least for pre-pop. Lots of workers waiting on work queue. This makes sense as our benchmarks indicate that these instance types don't push data through the StandAloneGutters very fast.