KEP-897: First-Class MLflow Integration for Experiment Tracking in Kubeflow#892
KEP-897: First-Class MLflow Integration for Experiment Tracking in Kubeflow#892mprahl wants to merge 4 commits intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
0eefb9d to
1b99df6
Compare
andreyvelich
left a comment
There was a problem hiding this comment.
cc WGs to review
@kubeflow/wg-pipeline-leads @kubeflow/wg-data-leads @kubeflow/wg-automl-leads @kubeflow/wg-data-leads @kubeflow/kubeflow-steering-committee @kubeflow/wg-manifests-leads @kubeflow/wg-notebooks-leads
1b99df6 to
56dd509
Compare
56dd509 to
52bd338
Compare
|
I'm closing this KEP because my team no longer has capacity to take this on. If others want to pursue this, feel free to fork the KEP and I'll be happy to review and advise. 😄 |
|
@mprahl may we keep it open for now? Just to have it tracked. The stalebot will close it anyway if there is no activity on this topic |
|
I agree with @juliusvonkohout! Maybe we should put out a call for contributors to help us add Experiment Tracking support via MLFlow for Kubeflow sub-projects. cc @kubeflow/wg-training-leads @kubeflow/wg-pipeline-leads @kubeflow/kubeflow-steering-committee @kubeflow/wg-manifests-leads @kubeflow/wg-notebooks-leads @kubeflow/wg-data-leads @kubeflow/kubeflow-sdk-team @kubeflow/kubeflow-outreach-committee @jbottum |
|
Rather than tying it strictly to MlFlow implementation choice, I believe it would be very helpful to add an SPI (strongly inspired to MlFlow Exp/Run to begin with) so that if one day you want to tie other integration in this area you could. Not to dispute MlFlow king popularity, but in other community discussions other alternatives have also their market-share, so an SPI would allow to prepare the ground for as well additional contributor, to what Andrey just said. What would be the @kubeflow/kubeflow-steering-committee pov on this? |
|
I fully agree - designing an extensible architecture makes sense, since it will let us easily swap between experiment tracking solutions (e.g., MLflow, W&B, or even custom option). |
Very IMHO an SPI that is 1:1 to the MlFlow API (with MlFlow integration as its implementation) in the short term. |
|
Experiment tracking is heavily dependent on Registry and UI to support it for visualizations, and tracking models and versions and metrics. What are thoughts on that when speak out this SPI based integration? If we say SPI enables them to capture data and lets the users use the native tools they integrated with, for example using MlFlow UI separately? My next question is how do we foresee we bring back the champion model back into Kubeflow Model Registry for deployment or management? or do we need to? For me, this defines the scope of Model registry activities too going forward. Thoughts? |
|
I approve this KEP, great work @mprahl Collaborating with the MLFlow community would be wonderful. 👏 |
andreyvelich
left a comment
There was a problem hiding this comment.
Thanks a lot for this @mprahl, overall looks awesome.
Kubeflow users wanted to have this capability since 2019 🚀
cc @kubeflow/kubeflow-trainer-team @akshaychitneni @nabuskey @kubeflow/wg-data-leads @kubeflow/kubeflow-sdk-team @kubeflow/kubeflow-kale-team @kubeflow/wg-notebooks-leads appreciate your review too!
|
|
||
| ### Donated Kubernetes Plugins | ||
|
|
||
| Kubeflow should accept donation of |
There was a problem hiding this comment.
+1 on this.
Shall we discuss maintainers?
Tighten the experiment tracking KEP around shared MLflow conventions, trusted-ingress authorization, and follow-up terminology decisions so reviewers can evaluate one consistent direction. Signed-off-by: mprahl <mprahl@users.noreply.github.com>
54e5f6b to
83dace6
Compare
|
@mprahl Hello, our team is interested in contributing to the community and would like to work on this! cc @bobbravo2 |
Thanks! Specifically, this team is willing to contribute to the UI embedded work. 😄 |
e1d6fe2 to
6bf21e8
Compare
Clarify namespace mapping, UI scope, and terminology alignment in the experiment tracking KEP while keeping the proposed MLflow deployment and auth model consistent. Signed-off-by: mprahl <mprahl@users.noreply.github.com>
6bf21e8 to
18cf3e2
Compare
| ## Alternatives | ||
|
|
||
| ### Expand Model Registry | ||
|
|
||
| Kubeflow could continue with the earlier direction of enhancing Model Registry to become the experiment tracking backend | ||
| for the platform. |
There was a problem hiding this comment.
What is the role of kubeflow model registry in the kubeflow ecosystem, given the direction this KEP? This may warrant a KEP of it's own, but I want to mention it here, and hear and high-level thoughts.
For example, does kf model registry continue to serve model lifecycle needs separately from mlflow's experiment tracking and model registry features? Or does mlflow's own model registry eventually replace kf model registry? Maybe the tooling/infra around kubeflow model registry and kubeflow sdk serve to integrate various components, extending mlflow's capability and usefulness within kubeflow.
There was a problem hiding this comment.
I think coexistence is the most pragmatic near-term path.
My current view is that MLflow can cover the experiment tracking experience and optionally the model registry experience if admin wants to, while Kubeflow Model Registry can continue to cover registry-oriented and deployment-oriented capabilities. I'd rather get community feedback on how Kubeflow is deployed with MLflow before making any further decisions unless the Kubeflow Model Registry working group has a strong opinion.
For now, I could see us extracting pieces like the catalog experience and the KServe storage adapter into a more backend-agnostic repo so they can work with either MLflow or Kubeflow Model Registry. That would reduce coupling and let us evaluate, based on real usage, whether those capabilities should remain shared, move more toward MLflow, or stay primarily aligned with Model Registry.
@kubeflow/wg-data-leads do you have thoughts?
There was a problem hiding this comment.
extracting pieces like the catalog experience and the KServe storage adapter into a more backend-agnostic repo so they can work with either MLflow or Kubeflow Model Registry
+1 , Model Catalog, MCP Catalog and in general Catalog capabilities are already isolated, so that part in my view is already sort of covered, although it will help the rename currently in action following:
- KEP-907: Renaming "Model Registry" to reflect Registry and Catalog use-cases #907 (review)
- KEP-0003: Technical implementation strategy for Kubeflow Hub rename hub#2239
What's not yet in-action in my view are:
extracting the capabilities of:
- Storage (especially OCI ModelCar)
- Signature
into self-sufficient capabilities possibly in KF SDK would be nice,
and making the async-upload job work not only using KF MR as a metadata store, but also MLflow as metadata store (of the Models)
There was a problem hiding this comment.
One part that need clarification from the authors of this KEP @mprahl (imho) is once MLflow is integrated as model metadata storage in Kubeflow, how the deployment tracking would work (in MLflow)?
That would allow to assess how to best proceed with the Isvc reconciler capabilities of KF MR; that needs an indication how to remap metadata that are currently mapped to KF MR on top of MLflow
There was a problem hiding this comment.
whether those capabilities should remain shared, move more toward MLflow, or stay primarily aligned with Model Registry.
@mprahl with the two MR solutions, wondering if you have any suggestions to pull model metadata from mlflow to show in KF MR for better user experience. If there is plugin we can write may be worth while IMO.
There was a problem hiding this comment.
@tarilabs @rareddy Good questions. I think this is out of scope for this KEP. The intent here is to define MLflow as the first-class experiment tracking integration for Kubeflow, including the shared platform contract around tenancy, auth, deployment, UI hand-off, and alignment with MLflow's GenAI direction.
My view is that Kubeflow Model Registry can continue to cover model registry and model deployment capabilities. This KEP does not currently propose deeper metadata or deployment-tracking integration between MLflow and Kubeflow Model Registry. If, after adoption, we see a strong need for that, I think it would be reasonable optional follow-up work for the Kubeflow Data WG to pursue. I'm happy to help and provide guidance there too.
Signed-off-by: mprahl <mprahl@users.noreply.github.com>
andreyvelich
left a comment
There was a problem hiding this comment.
Thanks for the updates @mprahl!
Overall, +1 on this to move forward. Left a few small comments.
| Kubeflow and OpenDataHub maintainers should agree on transferring the repository to Kubeflow community ownership, with | ||
| [`mprahl`](https://github.com/mprahl), [`HumairAK`](https://github.com/HumairAK), and any additional volunteers serving | ||
| as the initial maintainer group with clear release responsibilities. |
There was a problem hiding this comment.
Shall we say that WG Pipelines initially own this code?
We can add this repo to the WG assets: https://github.com/kubeflow/community/blob/master/wgs.yaml#L455-L465
| - MLflow experiment: the shared grouping for related work across Kubeflow tools | ||
| - Kubeflow Pipelines pipeline run: one parent MLflow run, with nested MLflow runs for component tasks and loop | ||
| iterations | ||
| - TrainJob or SparkApplication execution: one MLflow run for that execution |
There was a problem hiding this comment.
@mprahl @kramaranya I am also curious how we can map the MLFlow Experiment concept when TrainJob or OptimizationJob is submitted via KFP?
Not a blocker, we can discuss it later.
| A concrete example of that ingress pattern looks like: | ||
|
|
||
| ```yaml | ||
| apiVersion: security.istio.io/v1beta1 |
There was a problem hiding this comment.
Do we have dependency on Istio in that case? I remember we talked before that we would like to integrate with Gateway API moving forward.
There was a problem hiding this comment.
| @@ -0,0 +1,11 @@ | |||
| { | |||
|
As member of the KSC I vote in general in favor. Technical details and open question are not concerning enough for me to wait with the vote. We will find a way to integrate this at the platform level. I can help with the maintenance as well, since i anyway need to deal with the integration into Kubeflow platform. |
andreyvelich
left a comment
There was a problem hiding this comment.
+1 for this, just left a few small comments @mprahl.
|
Looks good! Building a deep integration with MLflow for our go-to ML tracking and supporting a Helm chart would benefit the e2e story. I see this as a plugin or integration. Let's call out that Kubeflow (the tools/components/) integrates with MLflow, and that MLflow is not a Kubeflow project. It's a dependency, and we are essentially saying "MLflow won" the OSS registry war here, and we want to provide that functionality to our community. Excited to see this in action and help folks build, deploy, and serve models in a more deterministic manner wherever they see fit. I vote yes! |
GitHub issue: #897
Instead of building a new experiment tracking backend inside Kubeflow, the KEP proposes that Kubeflow deeply integrate with MLflow as a strong open-source option with an active community. The proposal focuses on making MLflow Kubernetes-native for Kubeflow through donation of the Kubernetes plugins, alignment with Kubeflow Profiles and multi-tenancy, a supported MLflow image and deployment path, and a UI strategy based on either launching out to MLflow or embedding it in the dashboard.