Skip to content

KEP-897: First-Class MLflow Integration for Experiment Tracking in Kubeflow#892

Open
mprahl wants to merge 4 commits intokubeflow:masterfrom
mprahl:experiment-tracking
Open

KEP-897: First-Class MLflow Integration for Experiment Tracking in Kubeflow#892
mprahl wants to merge 4 commits intokubeflow:masterfrom
mprahl:experiment-tracking

Conversation

@mprahl
Copy link
Copy Markdown
Contributor

@mprahl mprahl commented Aug 1, 2025

GitHub issue: #897

Instead of building a new experiment tracking backend inside Kubeflow, the KEP proposes that Kubeflow deeply integrate with MLflow as a strong open-source option with an active community. The proposal focuses on making MLflow Kubernetes-native for Kubeflow through donation of the Kubernetes plugins, alignment with Kubeflow Profiles and multi-tenancy, a supported MLflow image and deployment path, and a UI strategy based on either launching out to MLflow or embedding it in the dashboard.

@google-oss-prow google-oss-prow Bot requested a review from johnugeorge August 1, 2025 19:13
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign juliusvonkohout for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment thread proposals/892-experiment-tracking/README.md Outdated
@andreyvelich
Copy link
Copy Markdown
Member

andreyvelich commented Aug 8, 2025

Thank you for driving this @mprahl! Please can you create a tracking issue under kubeflow/community, so you can get the KEP number ?

It would be also good to also mention the history as I mentioned here: #783

As previously discussed in

Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc WGs to review
@kubeflow/wg-pipeline-leads @kubeflow/wg-data-leads @kubeflow/wg-automl-leads @kubeflow/wg-data-leads @kubeflow/kubeflow-steering-committee @kubeflow/wg-manifests-leads @kubeflow/wg-notebooks-leads

Comment thread proposals/892-experiment-tracking/README.md Outdated
Comment thread proposals/892-experiment-tracking/README.md Outdated
Comment thread proposals/892-experiment-tracking/README.md Outdated
Comment thread proposals/892-experiment-tracking/README.md Outdated
Comment thread proposals/892-experiment-tracking/README.md Outdated
Comment thread proposals/892-experiment-tracking/README.md Outdated
Comment thread proposals/892-experiment-tracking/README.md Outdated
Comment thread proposals/892-experiment-tracking/README.md Outdated
@mprahl mprahl force-pushed the experiment-tracking branch from 1b99df6 to 56dd509 Compare August 12, 2025 17:05
@mprahl mprahl changed the title KEP: Propose centralized experiment tracking in Kubeflow KEP-897: Propose centralized experiment tracking in Kubeflow Aug 12, 2025
@mprahl mprahl force-pushed the experiment-tracking branch from 56dd509 to 52bd338 Compare August 12, 2025 17:48
@mprahl mprahl requested a review from andreyvelich August 12, 2025 17:48
Comment thread proposals/897-experiment-tracking/README.md Outdated
Comment thread proposals/897-experiment-tracking/README.md
@mprahl
Copy link
Copy Markdown
Contributor Author

mprahl commented Sep 5, 2025

I'm closing this KEP because my team no longer has capacity to take this on. If others want to pursue this, feel free to fork the KEP and I'll be happy to review and advise. 😄

@juliusvonkohout
Copy link
Copy Markdown
Member

@mprahl may we keep it open for now? Just to have it tracked.

The stalebot will close it anyway if there is no activity on this topic

@andreyvelich
Copy link
Copy Markdown
Member

I agree with @juliusvonkohout!

Maybe we should put out a call for contributors to help us add Experiment Tracking support via MLFlow for Kubeflow sub-projects.
This feels like a really important capability that many of our users are asking for, and moving it forward would have a big impact on usability and Kubeflow adoption.

cc @kubeflow/wg-training-leads @kubeflow/wg-pipeline-leads @kubeflow/kubeflow-steering-committee @kubeflow/wg-manifests-leads @kubeflow/wg-notebooks-leads @kubeflow/wg-data-leads @kubeflow/kubeflow-sdk-team @kubeflow/kubeflow-outreach-committee @jbottum

@tarilabs
Copy link
Copy Markdown
Member

Rather than tying it strictly to MlFlow implementation choice, I believe it would be very helpful to add an SPI (strongly inspired to MlFlow Exp/Run to begin with) so that if one day you want to tie other integration in this area you could.

Not to dispute MlFlow king popularity, but in other community discussions other alternatives have also their market-share, so an SPI would allow to prepare the ground for as well additional contributor, to what Andrey just said.

What would be the @kubeflow/kubeflow-steering-committee pov on this?

@andreyvelich
Copy link
Copy Markdown
Member

I fully agree - designing an extensible architecture makes sense, since it will let us easily swap between experiment tracking solutions (e.g., MLflow, W&B, or even custom option).
My only question is: in the short to medium term, what approach should we take to deliver the most value to users?

@tarilabs
Copy link
Copy Markdown
Member

tarilabs commented Sep 26, 2025

My only question is: in the short to medium term, what approach should we take to deliver the most value to users?

Very IMHO an SPI that is 1:1 to the MlFlow API (with MlFlow integration as its implementation) in the short term.
I'm aware is very limiting and naive, but at least forces to identify where the boundary for this integration lies. In turn, it should indeed make it easier to "direct" contributors/GSoC students if they want to integrate W&B (found the #892 (comment) ! 😄 ) or other tracking system, next.

@rareddy
Copy link
Copy Markdown
Contributor

rareddy commented Sep 26, 2025

Experiment tracking is heavily dependent on Registry and UI to support it for visualizations, and tracking models and versions and metrics. What are thoughts on that when speak out this SPI based integration?

If we say SPI enables them to capture data and lets the users use the native tools they integrated with, for example using MlFlow UI separately? My next question is how do we foresee we bring back the champion model back into Kubeflow Model Registry for deployment or management? or do we need to? For me, this defines the scope of Model registry activities too going forward. Thoughts?

@franciscojavierarceo
Copy link
Copy Markdown
Contributor

I approve this KEP, great work @mprahl

Collaborating with the MLFlow community would be wonderful. 👏

Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this @mprahl, overall looks awesome.
Kubeflow users wanted to have this capability since 2019 🚀

cc @kubeflow/kubeflow-trainer-team @akshaychitneni @nabuskey @kubeflow/wg-data-leads @kubeflow/kubeflow-sdk-team @kubeflow/kubeflow-kale-team @kubeflow/wg-notebooks-leads appreciate your review too!

Comment thread proposals/897-experiment-tracking/README.md Outdated
Comment thread proposals/897-experiment-tracking/README.md
Comment thread proposals/897-experiment-tracking/README.md Outdated
Comment thread proposals/897-experiment-tracking/README.md Outdated

### Donated Kubernetes Plugins

Kubeflow should accept donation of
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this.
Shall we discuss maintainers?

Comment thread proposals/897-experiment-tracking/README.md Outdated
Comment thread proposals/897-experiment-tracking/README.md Outdated
Comment thread proposals/897-experiment-tracking/README.md
Comment thread proposals/897-experiment-tracking/README.md
Comment thread proposals/897-experiment-tracking/README.md Outdated
Comment thread proposals/897-experiment-tracking/README.md Outdated
Comment thread proposals/897-experiment-tracking/README.md Outdated
Comment thread proposals/897-experiment-tracking/README.md Outdated
Comment thread proposals/897-experiment-tracking/README.md Outdated
Tighten the experiment tracking KEP around shared MLflow conventions,
trusted-ingress authorization, and follow-up terminology decisions so reviewers
can evaluate one consistent direction.

Signed-off-by: mprahl <mprahl@users.noreply.github.com>
Comment thread proposals/897-experiment-tracking/README.md Outdated
Comment thread proposals/897-experiment-tracking/README.md
Comment thread proposals/897-experiment-tracking/README.md Outdated
Comment thread proposals/897-experiment-tracking/README.md
@DaoDaoNoCode
Copy link
Copy Markdown

@mprahl Hello, our team is interested in contributing to the community and would like to work on this! cc @bobbravo2

@mprahl
Copy link
Copy Markdown
Contributor Author

mprahl commented Apr 10, 2026

@mprahl Hello, our team is interested in contributing to the community and would like to work on this! cc @bobbravo2

Thanks! Specifically, this team is willing to contribute to the UI embedded work. 😄

@mprahl mprahl force-pushed the experiment-tracking branch from e1d6fe2 to 6bf21e8 Compare April 10, 2026 15:18
Clarify namespace mapping, UI scope, and terminology alignment in the experiment tracking KEP while keeping the proposed MLflow deployment and auth model consistent.

Signed-off-by: mprahl <mprahl@users.noreply.github.com>
Comment on lines +453 to +458
## Alternatives

### Expand Model Registry

Kubeflow could continue with the earlier direction of enhancing Model Registry to become the experiment tracking backend
for the platform.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the role of kubeflow model registry in the kubeflow ecosystem, given the direction this KEP? This may warrant a KEP of it's own, but I want to mention it here, and hear and high-level thoughts.

For example, does kf model registry continue to serve model lifecycle needs separately from mlflow's experiment tracking and model registry features? Or does mlflow's own model registry eventually replace kf model registry? Maybe the tooling/infra around kubeflow model registry and kubeflow sdk serve to integrate various components, extending mlflow's capability and usefulness within kubeflow.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think coexistence is the most pragmatic near-term path.

My current view is that MLflow can cover the experiment tracking experience and optionally the model registry experience if admin wants to, while Kubeflow Model Registry can continue to cover registry-oriented and deployment-oriented capabilities. I'd rather get community feedback on how Kubeflow is deployed with MLflow before making any further decisions unless the Kubeflow Model Registry working group has a strong opinion.

For now, I could see us extracting pieces like the catalog experience and the KServe storage adapter into a more backend-agnostic repo so they can work with either MLflow or Kubeflow Model Registry. That would reduce coupling and let us evaluate, based on real usage, whether those capabilities should remain shared, move more toward MLflow, or stay primarily aligned with Model Registry.

@kubeflow/wg-data-leads do you have thoughts?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extracting pieces like the catalog experience and the KServe storage adapter into a more backend-agnostic repo so they can work with either MLflow or Kubeflow Model Registry

+1 , Model Catalog, MCP Catalog and in general Catalog capabilities are already isolated, so that part in my view is already sort of covered, although it will help the rename currently in action following:

What's not yet in-action in my view are:

extracting the capabilities of:

  • Storage (especially OCI ModelCar)
  • Signature

into self-sufficient capabilities possibly in KF SDK would be nice,

and making the async-upload job work not only using KF MR as a metadata store, but also MLflow as metadata store (of the Models)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One part that need clarification from the authors of this KEP @mprahl (imho) is once MLflow is integrated as model metadata storage in Kubeflow, how the deployment tracking would work (in MLflow)?

That would allow to assess how to best proceed with the Isvc reconciler capabilities of KF MR; that needs an indication how to remap metadata that are currently mapped to KF MR on top of MLflow

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whether those capabilities should remain shared, move more toward MLflow, or stay primarily aligned with Model Registry.

@mprahl with the two MR solutions, wondering if you have any suggestions to pull model metadata from mlflow to show in KF MR for better user experience. If there is plugin we can write may be worth while IMO.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarilabs @rareddy Good questions. I think this is out of scope for this KEP. The intent here is to define MLflow as the first-class experiment tracking integration for Kubeflow, including the shared platform contract around tenancy, auth, deployment, UI hand-off, and alignment with MLflow's GenAI direction.

My view is that Kubeflow Model Registry can continue to cover model registry and model deployment capabilities. This KEP does not currently propose deeper metadata or deployment-tracking integration between MLflow and Kubeflow Model Registry. If, after adoption, we see a strong need for that, I think it would be reasonable optional follow-up work for the Kubeflow Data WG to pursue. I'm happy to help and provide guidance there too.

Comment thread proposals/897-experiment-tracking/README.md Outdated
Signed-off-by: mprahl <mprahl@users.noreply.github.com>
Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @mprahl!
Overall, +1 on this to move forward. Left a few small comments.

Comment on lines +154 to +156
Kubeflow and OpenDataHub maintainers should agree on transferring the repository to Kubeflow community ownership, with
[`mprahl`](https://github.com/mprahl), [`HumairAK`](https://github.com/HumairAK), and any additional volunteers serving
as the initial maintainer group with clear release responsibilities.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we say that WG Pipelines initially own this code?
We can add this repo to the WG assets: https://github.com/kubeflow/community/blob/master/wgs.yaml#L455-L465

Comment on lines +218 to +221
- MLflow experiment: the shared grouping for related work across Kubeflow tools
- Kubeflow Pipelines pipeline run: one parent MLflow run, with nested MLflow runs for component tasks and loop
iterations
- TrainJob or SparkApplication execution: one MLflow run for that execution
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mprahl @kramaranya I am also curious how we can map the MLFlow Experiment concept when TrainJob or OptimizationJob is submitted via KFP?
Not a blocker, we can discuss it later.

A concrete example of that ingress pattern looks like:

```yaml
apiVersion: security.istio.io/v1beta1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have dependency on Istio in that case? I remember we talked before that we would like to integrate with Gateway API moving forward.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread .prettierrc
@@ -0,0 +1,11 @@
{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this?

@juliusvonkohout
Copy link
Copy Markdown
Member

juliusvonkohout commented Apr 24, 2026

As member of the KSC I vote in general in favor. Technical details and open question are not concerning enough for me to wait with the vote. We will find a way to integrate this at the platform level. I can help with the maintenance as well, since i anyway need to deal with the integration into Kubeflow platform.

Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this, just left a few small comments @mprahl.

@chasecadet
Copy link
Copy Markdown
Contributor

Looks good! Building a deep integration with MLflow for our go-to ML tracking and supporting a Helm chart would benefit the e2e story. I see this as a plugin or integration. Let's call out that Kubeflow (the tools/components/) integrates with MLflow, and that MLflow is not a Kubeflow project. It's a dependency, and we are essentially saying "MLflow won" the OSS registry war here, and we want to provide that functionality to our community. Excited to see this in action and help folks build, deploy, and serve models in a more deterministic manner wherever they see fit. I vote yes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.