Skip to content

Commit 8eb2549

Browse files
committed
Synchronize SSH keys
Bobber relies on SSH keys that are baked into the images to enable multi-node communication. This forces users to build the image on one machine, save the image locally, copy it to all remote nodes, and load the copied image on those hosts. This process is long and tedious, but by replacing it with a synchronization method, makes it possible to run the build on each host and not need to copy images remotely. Signed-Off-By: Robert Clark <roclark@nvidia.com>
1 parent bd12a77 commit 8eb2549

6 files changed

Lines changed: 117 additions & 9 deletions

File tree

README.md

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -105,9 +105,8 @@ pip3 install https://github.com/NVIDIA/Bobber/releases/download/v6.1.1/nvidia_bo
105105

106106
## Build Bobber container (includes OSU Tests, NCCL Tests, fio, mdtest, DALI RN50 Pipeline, and the base NGC TensorFlow container)
107107
The Bobber application includes a built-in mechanism to build the Docker
108-
container where all tests will be run. This command should be run on a single
109-
system in the cluster as it will be copied in a future step. For single-node
110-
tests, run the command on the node to be tested.
108+
container where all tests will be run. Run the following command on all nodes
109+
that will be tested.
111110

112111
```bash
113112
$ bobber build
@@ -123,13 +122,13 @@ $ docker images | grep nvidia/bobber
123122
nvidia/bobber 6.1.1 c697a75ee482 36 minutes ago 12.4GB
124123
```
125124

126-
## Save container
125+
## Synchronize container keys
127126
Bobber relies on shared SSH keys to communicate between containers via MPI. This
128-
is done by generating an SSH key in the image during build time and using that
129-
same container on all hosts. This requires saving the image to a local tarball
130-
and transferring the image to all other nodes. The `export` command saves the
131-
image as a local tarball. Run the command on the node from the previous step
132-
where the Docker image is located.
127+
is done by generating an SSH key on a single node and copying that key to all
128+
other containers in the cluster. Bobber includes a tool to automatically
129+
synchronize keys amongst all containers on all nodes. To make the process more
130+
seamless, it is highly recommended to create passwordless SSH keys to
131+
communicate between the hosts in order to automatically copy the keys.
133132

134133
If running on a single node, this step is not required.
135134

bobber/bin/sync-keys.sh

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
#!/bin/bash
2+
set -e
3+
# Pass the list of hosts in as a string
4+
hosts=$1
5+
# Optionally pass a username to login to remote nodes
6+
user=$2
7+
8+
# Generate a new RSA key locally for SSH to share across the cluster
9+
mkdir -p /tmp/bobber
10+
rm -f /tmp/bobber/*
11+
ssh-keygen -t rsa -b 4096 -f /tmp/bobber/id_rsa -N ""
12+
13+
echo "Copying keys to containers on all hosts"
14+
echo "For remote hosts, if passwordless-ssh is not configured, you will be prompted for the password for all nodes"
15+
16+
if [[ $hosts=="localhost" || -z "$hosts" ]]; then
17+
docker cp /tmp/bobber/id_rsa bobber:/root/.ssh/id_rsa
18+
docker cp /tmp/bobber/id_rsa.pub bobber:/root/.ssh/authorized_keys
19+
fi
20+
21+
# Copy the key to the container
22+
for host in ${hosts//,/ }; do
23+
if [ ! -z "$user" ]; then
24+
scp -r /tmp/bobber $user@$host:/tmp/
25+
ssh $user@$host 'docker cp /tmp/bobber/id_rsa bobber:/root/.ssh/id_rsa && docker cp /tmp/bobber/id_rsa.pub bobber:/root/.ssh/authorized_keys && rm /tmp/bobber/id_rsa*'
26+
else
27+
scp -r /tmp/bobber $host:/tmp/
28+
ssh $host 'docker cp /tmp/bobber/id_rsa bobber:/root/.ssh/id_rsa && docker cp /tmp/bobber/id_rsa.pub bobber:/root/.ssh/authorized_keys && rm /tmp/bobber/id_rsa*'
29+
fi
30+
done
31+
32+
# Cleanup the local key
33+
rm -f /tmp/bobber/id_rsa*

bobber/bobber.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,10 +20,12 @@
2020
RUN_STG_BW,
2121
RUN_STG_IOPS,
2222
RUN_STG_META,
23+
SYNC,
2324
SYSTEMS
2425
)
2526
from bobber.lib.analysis import parse_results
2627
from bobber.lib.system.file_handler import create_directory
28+
from bobber.lib.system.shell import copy_keys
2729
from bobber.lib.tests import run_tests
2830
from typing import NoReturn
2931

@@ -227,6 +229,21 @@ def parse_args(version: str) -> Namespace:
227229
'binary')
228230
load.add_argument('filename', help='Filename of local *.tar file of '
229231
'the image to load')
232+
233+
# Options specific to synchronizing SSH keys in containers
234+
sync = commands.add_parser(SYNC, help='Create SSH keys and add them to all'
235+
' Bobber containers in a cluster. Requires the '
236+
'container to be running on all nodes using '
237+
'"bobber cast".')
238+
sync.add_argument('--hosts', help='A comma-separated list of hostnames or '
239+
'IP address of the nodes to add SSH keys to. Required '
240+
'for multi-node tests. If left empty, it is assumed '
241+
'that keys should only be copied to the container on '
242+
'the local node.', type=str, default='')
243+
sync.add_argument('--user', help='Optionally specify a user to use to '
244+
'login to remote hosts to copy keys to containers. If '
245+
'left blank, will use the currently logged-in user.',
246+
type=str, default='')
230247
return parser.parse_args()
231248

232249

@@ -360,6 +377,8 @@ def execute_command(args: Namespace, version: str) -> NoReturn:
360377
bobber.lib.docker.cast(args.storage_path, args.ignore_gpu, version)
361378
elif args.command == LOAD:
362379
bobber.lib.docker.load(args.filename)
380+
elif args.command == SYNC:
381+
copy_keys(args.hosts, args.user)
363382
else:
364383
# Update the version to be used in filenames
365384
version_underscore = version.replace('.', '_')

bobber/lib/constants.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
RUN_STG_BW = 'run-stg-bw'
1111
RUN_STG_IOPS = 'run-stg-iops'
1212
RUN_STG_META = 'run-stg-meta'
13+
SYNC = 'sync'
1314

1415
DGX_A100_SINGLE = {
1516
'gpus': 8,

bobber/lib/docker/management.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -234,3 +234,24 @@ def execute(self, command: str, environment: Optional[dict] = None,
234234
print(result.output)
235235
except StopIteration:
236236
break
237+
238+
def running(self):
239+
"""
240+
Determine if the Bobber container is running on the system.
241+
242+
Check to see if the current version of the Bobber container is running
243+
on the local machine and return the status. This method can be used to
244+
determine whether or not to run a command that depends on the container
245+
being launched.
246+
247+
Returns
248+
-------
249+
boolean
250+
Returns `True` when the container is running and `False` when not.
251+
"""
252+
try:
253+
bobber = self.client.containers.get('bobber')
254+
except docker.errors.NotFound:
255+
return False
256+
else:
257+
return True

bobber/lib/system/shell.py

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# SPDX-License-Identifier: MIT
2+
import subprocess
3+
import sys
4+
from bobber.lib.docker import manager
5+
from typing import NoReturn
6+
7+
8+
def copy_keys(hosts: str, user: str) -> NoReturn:
9+
"""
10+
Generate and copy SSH keys to all hosts.
11+
12+
Launch a shell script included with the package which generates a local SSH
13+
key that is copied to all Bobber containers on all nodes to allow
14+
passwordless communication for MPI.
15+
16+
Parameters
17+
----------
18+
hosts : string
19+
A comma-separated list as a ``string`` representing all hosts, such as
20+
'host1,host2,host3,...'.
21+
user : string
22+
A ``string`` of the user to use to login to remote hosts as, if
23+
necessary.
24+
"""
25+
if not manager.running():
26+
print('Bobber container is not running. Please ensure Bobber is '
27+
'running on all nodes using the "bobber cast" command before '
28+
'running "bobber sync".')
29+
sys.exit(-1)
30+
try:
31+
subprocess.run(['bobber/bin/sync-keys.sh', hosts, user], check=True)
32+
except subprocess.CalledProcessError:
33+
print('Error synchronizing keys. See output from the sync script '
34+
'above.')
35+
sys.exit(-1)

0 commit comments

Comments
 (0)