Skip to content

Conversation

@vsoch
Copy link
Contributor

@vsoch vsoch commented Jan 4, 2026

This pull request adds Flux Framework as a plugin to the Kubeflow Trainer. 🌀

Overview

Flux supports the majority of MPI flavors/variants, and can be used to bootstrap MPI as a plugin. It adds other features for scheduling and topology that can be used for simulations and ai/ml jobs. This changeset adds the plugin implementation, including the plugin module, tests, and an example with a small README to serve as documentation for the example.

What this PR does / why we need it:

See https://github.com/kubeflow/trainer/tree/master/docs/proposals/2841-flux-hpc. To summarize, Flux Framework supports more MPI variants out of the box than the current MPI plugin. It brings more scheduling features, topology awareness, higher throughput, and dynamism / elasticity of the scheduler and jobs. See https://github.com/kubeflow/trainer/tree/master/docs/proposals/2841-flux-hpc#motivation. For full provenance / history, here is the initial discussion in the Kubeflow Trainer meeting.

Which issue(s) this PR fixes
Fixes #2841 (and note here, we should follow up with discussion on next steps for scoped issues)

Checklist:

  • Docs included if any changes are user facing

@andreyvelich some notes for you.

  • I am going to need a pointer for the docs process. It looks like docs live separately in https://github.com/kubeflow/website? Let me know if you'd like the PR there before / after / at the same time as this one.
  • The file build.sh is not intended to be added. But I had a hard time figuring out "How do I develop this locally?" and that is the solution I came up with. I'm wondering if we can put that somewhere for future developers to make it easier.
  • This is my first time developing for Kubeflow, and using ApplyConfiguration. If I made a mistake in design or process please tell me directly, and give a pointer to a correct way to go about it.
  • I am exposing the variables we talked about (network, queue policy) as envars that can be defined in the training job. I think this is a simple and reasonable approach in that it is flexible, but let me know if there is another idea for discussion.

Here is the first completion of LAMMPS. When you remove the command it turns into an interactive minicluster (fairly simple / straight-forward I think).

image image

Thanks in advance for the review! I won't be able to finish the PR work tonight (figuring out the linting still) but I'll pick up tomorrow after some sleep. Really excited about this.

cc @milroy

Copilot AI review requested due to automatic review settings January 4, 2026 09:11
@google-oss-prow google-oss-prow bot requested a review from jinchihe January 4, 2026 09:11
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for Flux Framework as an HPC workload manager plugin for the Kubeflow Trainer. Flux Framework provides sophisticated resource management, supports multiple MPI variants, and enables distributed HPC workloads in Kubernetes environments.

Key Changes

  • Implements a new Flux plugin that integrates with the Kubeflow Trainer runtime framework
  • Adds automatic Flux installation via init containers and configuration management through ConfigMaps and Secrets
  • Provides support for both batch execution and interactive HPC cluster modes

Reviewed changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
pkg/runtime/framework/plugins/flux/*.go Core plugin implementation including broker configuration, curve certificate generation, hostlist management, and command extraction
pkg/runtime/framework/plugins/flux/*_test.go Comprehensive test coverage for plugin functionality
pkg/runtime/framework/plugins/registry.go Registers the Flux plugin in the framework
pkg/runtime/runtime.go Extends RuntimePolicy to include FluxPolicySource
pkg/apis/trainer/v1alpha1/trainingruntime_types.go Adds FluxMLPolicySource type definition with numProcPerNode parameter
pkg/apis/trainer/v1alpha1/zz_generated.* Generated code for deepcopy, openapi specs, and API types
pkg/client/applyconfiguration/**/*.go Generated apply configurations for Flux types
manifests/base/crds/*.yaml Updated CRDs to include Flux policy configuration
charts/kubeflow-trainer/crds/*.yaml Updated Helm chart CRDs
examples/flux/*.yaml Example runtime and TrainJob configurations demonstrating LAMMPS workload
examples/flux/README.md Comprehensive documentation for using the Flux plugin
api/python_api/**/*.py Python API updates to support Flux policy types
api/openapi-spec/swagger.json OpenAPI specification updates
build.sh Development helper script (should be removed per PR description)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


from __future__ import annotations
import pprint
import re # noqa: F401
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 're' is not used.

Suggested change
import re # noqa: F401

Copilot uses AI. Check for mistakes.

from __future__ import annotations
import pprint
import re # noqa: F401
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 're' is not used.

Suggested change
import re # noqa: F401

Copilot uses AI. Check for mistakes.
@coveralls
Copy link

coveralls commented Jan 4, 2026

Pull Request Test Coverage Report for Build 20694722111

Details

  • 493 of 561 (87.88%) changed or added relevant lines in 6 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+6.9%) to 58.328%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/runtime/framework/plugins/registry.go 0 1 0.0%
pkg/runtime/runtime.go 0 1 0.0%
pkg/runtime/framework/plugins/flux/curve.go 54 56 96.43%
pkg/runtime/framework/plugins/flux/flux.go 412 476 86.55%
Files with Coverage Reduction New Missed Lines %
pkg/runtime/runtime.go 1 57.69%
Totals Coverage Status
Change from base Build 20682374279: 6.9%
Covered Lines: 1730
Relevant Lines: 2966

💛 - Coveralls

@vsoch vsoch force-pushed the plugin/flux branch 3 times, most recently from 051af8c to 3b23be6 Compare January 4, 2026 14:55
Flux supports the majority of MPI flavors/variants, and can be used
to bootstrap MPI as a plugin. It adds other features for scheduling
and topology that can be used for simulations and ai/ml jobs.
This changeset adds the plugin implementation, including the
plugin module, tests, and an example with a small README to
serve as documentation for the time being.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Flux Framework as a plugin for HPC and MPI bootstrap

2 participants