-
Notifications
You must be signed in to change notification settings - Fork 868
feat: support for Flux Framework as HPC manager #3064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds support for Flux Framework as an HPC workload manager plugin for the Kubeflow Trainer. Flux Framework provides sophisticated resource management, supports multiple MPI variants, and enables distributed HPC workloads in Kubernetes environments.
Key Changes
- Implements a new Flux plugin that integrates with the Kubeflow Trainer runtime framework
- Adds automatic Flux installation via init containers and configuration management through ConfigMaps and Secrets
- Provides support for both batch execution and interactive HPC cluster modes
Reviewed changes
Copilot reviewed 33 out of 33 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/runtime/framework/plugins/flux/*.go | Core plugin implementation including broker configuration, curve certificate generation, hostlist management, and command extraction |
| pkg/runtime/framework/plugins/flux/*_test.go | Comprehensive test coverage for plugin functionality |
| pkg/runtime/framework/plugins/registry.go | Registers the Flux plugin in the framework |
| pkg/runtime/runtime.go | Extends RuntimePolicy to include FluxPolicySource |
| pkg/apis/trainer/v1alpha1/trainingruntime_types.go | Adds FluxMLPolicySource type definition with numProcPerNode parameter |
| pkg/apis/trainer/v1alpha1/zz_generated.* | Generated code for deepcopy, openapi specs, and API types |
| pkg/client/applyconfiguration/**/*.go | Generated apply configurations for Flux types |
| manifests/base/crds/*.yaml | Updated CRDs to include Flux policy configuration |
| charts/kubeflow-trainer/crds/*.yaml | Updated Helm chart CRDs |
| examples/flux/*.yaml | Example runtime and TrainJob configurations demonstrating LAMMPS workload |
| examples/flux/README.md | Comprehensive documentation for using the Flux plugin |
| api/python_api/**/*.py | Python API updates to support Flux policy types |
| api/openapi-spec/swagger.json | OpenAPI specification updates |
| build.sh | Development helper script (should be removed per PR description) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| from __future__ import annotations | ||
| import pprint | ||
| import re # noqa: F401 |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 're' is not used.
| import re # noqa: F401 |
|
|
||
| from __future__ import annotations | ||
| import pprint | ||
| import re # noqa: F401 |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 're' is not used.
| import re # noqa: F401 |
Pull Request Test Coverage Report for Build 20694722111Details
💛 - Coveralls |
051af8c to
3b23be6
Compare
Flux supports the majority of MPI flavors/variants, and can be used to bootstrap MPI as a plugin. It adds other features for scheduling and topology that can be used for simulations and ai/ml jobs. This changeset adds the plugin implementation, including the plugin module, tests, and an example with a small README to serve as documentation for the time being. Signed-off-by: vsoch <vsoch@users.noreply.github.com>
This pull request adds Flux Framework as a plugin to the Kubeflow Trainer. 🌀
Overview
Flux supports the majority of MPI flavors/variants, and can be used to bootstrap MPI as a plugin. It adds other features for scheduling and topology that can be used for simulations and ai/ml jobs. This changeset adds the plugin implementation, including the plugin module, tests, and an example with a small README to serve as documentation for the example.
What this PR does / why we need it:
See https://github.com/kubeflow/trainer/tree/master/docs/proposals/2841-flux-hpc. To summarize, Flux Framework supports more MPI variants out of the box than the current MPI plugin. It brings more scheduling features, topology awareness, higher throughput, and dynamism / elasticity of the scheduler and jobs. See https://github.com/kubeflow/trainer/tree/master/docs/proposals/2841-flux-hpc#motivation. For full provenance / history, here is the initial discussion in the Kubeflow Trainer meeting.
Which issue(s) this PR fixes
Fixes #2841 (and note here, we should follow up with discussion on next steps for scoped issues)
Checklist:
@andreyvelich some notes for you.
build.shis not intended to be added. But I had a hard time figuring out "How do I develop this locally?" and that is the solution I came up with. I'm wondering if we can put that somewhere for future developers to make it easier.ApplyConfiguration. If I made a mistake in design or process please tell me directly, and give a pointer to a correct way to go about it.Here is the first completion of LAMMPS. When you remove the command it turns into an interactive minicluster (fairly simple / straight-forward I think).
Thanks in advance for the review! I won't be able to finish the PR work tonight (figuring out the linting still) but I'll pick up tomorrow after some sleep. Really excited about this.
cc @milroy