Skip to content

Commit 022656a

Browse files
committed
feat: support for flux framework as hpc manager
Flux supports the majority of MPI flavors/variants, and can be used to bootstrap MPI as a plugin. It adds other features for scheduling and topology that can be used for simulations and ai/ml jobs. This changeset adds the plugin implementation, including the plugin module, tests, and an example with a small README to serve as documentation for the time being. Signed-off-by: vsoch <vsoch@users.noreply.github.com>
1 parent 1fe3bd3 commit 022656a

32 files changed

+1921
-9
lines changed

api/openapi-spec/swagger.json

Lines changed: 30 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

api/python_api/kubeflow_trainer_api/models/__init__.py

Lines changed: 1 addition & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_flux_ml_policy_source.py

Lines changed: 91 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_hpcml_policy_source.py

Lines changed: 91 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_ml_policy.py

Lines changed: 7 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_ml_policy_source.py

Lines changed: 7 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

build.sh

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
#!/bin/bash
2+
kind delete cluster
3+
kind create cluster
4+
make generate
5+
make manifests
6+
docker build -t ghcr.io/kubeflow/trainer/trainer-controller-manager -f ./cmd/trainer-controller-manager/Dockerfile .
7+
kind load docker-image ghcr.io/kubeflow/trainer/trainer-controller-manager
8+
kubectl apply --server-side -k ./manifests/overlays/manager
9+
sleep 20
10+
kubectl apply -f examples/flux/

charts/kubeflow-trainer/crds/trainer.kubeflow.org_clustertrainingruntimes.yaml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,23 @@ spec:
4747
description: mlPolicy provides the ML-specific parameters for the
4848
model training.
4949
properties:
50+
flux:
51+
description: flux policy source defines policy only for Flux
52+
properties:
53+
numProcPerNode:
54+
anyOf:
55+
- type: integer
56+
- type: string
57+
default: auto
58+
description: |-
59+
numProcPerNode is the number of processes per node.
60+
This is defined a level up on the MLPolicy directly.
61+
x-kubernetes-int-or-string: true
62+
x-kubernetes-validations:
63+
- message: NumProcPerNode must be equal to auto, cpu, gpu,
64+
or int value
65+
rule: self > 0 || self in ['auto', 'cpu', 'gpu']
66+
type: object
5067
mpi:
5168
description: mpi defines the configuration for the MPI Runtime.
5269
properties:

charts/kubeflow-trainer/crds/trainer.kubeflow.org_trainingruntimes.yaml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,23 @@ spec:
4747
description: mlPolicy provides the ML-specific parameters for the
4848
model training.
4949
properties:
50+
flux:
51+
description: flux policy source defines policy only for Flux
52+
properties:
53+
numProcPerNode:
54+
anyOf:
55+
- type: integer
56+
- type: string
57+
default: auto
58+
description: |-
59+
numProcPerNode is the number of processes per node.
60+
This is defined a level up on the MLPolicy directly.
61+
x-kubernetes-int-or-string: true
62+
x-kubernetes-validations:
63+
- message: NumProcPerNode must be equal to auto, cpu, gpu,
64+
or int value
65+
rule: self > 0 || self in ['auto', 'cpu', 'gpu']
66+
type: object
5067
mpi:
5168
description: mpi defines the configuration for the MPI Runtime.
5269
properties:

0 commit comments

Comments
 (0)