Skip to content

cloud-bulldozer/mlflow-scale

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MLflow Scale & Performance Testing Suite

πŸš€ Automated performance & scalability testing for MLflow with multi-tenant workspaces on OpenShift/Kubernetes

A comprehensive collection of scripts for running performance & scale tests for MLflow with the workspaces multi-tenancy feature. This project automates the deployment of MLflow, test artifacts deployment, prefilling the test data, running a series of tests, collecting results, and providing CSV summary along with charts.


✨ Features

  • Automated Test Suite β€” Full test lifecycle management including setup, execution, and cleanup
  • Multi-Tenant Testing β€” Validate MLflow performance across different tenant configurations
  • Database Backend Support β€” Test with SQLite (default) or PostgreSQL backends
  • Prometheus Integration β€” Automatic CPU/memory metrics collection from cluster
  • Rich Visualizations β€” Auto-generated charts for response times, throughput, and resource utilization
  • Realistic Workloads β€” 80/20 read/write split simulating actual MLflow usage patterns

πŸ“‹ Prerequisites

Requirement Description
oc OpenShift CLI configured with cluster access
jq JSON processor for parsing results
curl HTTP client for Prometheus queries
envsubst Environment variable substitution
python3 Python 3.x with pandas and matplotlib

Install Python dependencies:

pip install -r scripts/requirements.txt

πŸš€ Quick Start

1. Deploy Dependencies

Infrasturcture pre-requisite: OpenShift cluster with installed OpenDataHub operator

# Apply OpenDataHub manifests
oc apply -f manifests/DSCInitialization.yml
oc apply -f manifests/DataScienceCluster.yml

Install the mlflow-operator from the repo:

make deploy-to-platform IMG=quay.io/mlflow-operator/mlflow-operator:master PLATFORM=odh

2. Run the Test Suite

# Set required environment variables
export MLFLOW_URL="https://your-data-science-gateway.example.com/mlflow"
export MLFLOW_TOKEN="sha256~xxxxxxxxxxxx"

# Run the full test suite with SQLite backend (default)
./scripts/run_suite.sh

# Or run with PostgreSQL backend
DB_BACKEND=postgres ./scripts/run_suite.sh

3. View Results

Results are saved to scripts/results/:

ls scripts/results/
# summary_*.json       β€” Raw k6 test results
# metrics_*.csv        β€” Prometheus metrics per test
# report_summary.csv   β€” Consolidated CSV report
# chart_*.png          β€” Visualization charts

βš™οΈ Configuration

Environment Variables

Variable Default Description
RESULTS_DIR ./results Directory to store test results
NAMESPACE opendatahub Kubernetes namespace for k6 pod
K6_POD_NAME k6-benchmark Name for the k6 load generator pod
MLFLOW_NAMESPACE opendatahub Namespace where MLflow is running
MLFLOW_URL β€” MLflow server URL (required)
MLFLOW_TOKEN β€” MLflow authentication token (required)
TEST_DURATION 5m Duration for each test iteration
DB_BACKEND sqlite Database backend: sqlite or postgres

Database Backends

The test suite supports two database backends for MLflow:

SQLite (Default)

# Uses manifests/MLflow.yml with embedded SQLite
./scripts/run_suite.sh

PostgreSQL

# Deploys PostgreSQL from manifests/Postgres.yml and uses manifests/MLflow_Postgres.yml
DB_BACKEND=postgres ./scripts/run_suite.sh

When using PostgreSQL:

  • A PostgreSQL deployment, service, and PVC are automatically created
  • Between test runs, PostgreSQL is completely torn down (including PVC) and redeployed for a clean state
  • MLflow connects via postgresql://postgres:postgres@postgres.opendatahub.svc:5432/mlflow

Test Matrix

The default test matrix can be modified in run_suite.sh:

TENANT_COUNTS=("1" "10" "100" "500")  # Number of tenants
CONCURRENCY_LEVELS=(5 10 20 50)       # Concurrency per test
TEST_DURATION="10m"                   # Duration per test

πŸ“Š Test Scenarios

The k6 test script (mlflow_scale_test.js) simulates realistic MLflow usage:

Training Scenario (20% of total load)

Simulates ML training pipelines writing to MLflow:

Operation Description
create_experiment Create a new experiment
create_prompt Create 3 prompts per experiment
create_prompt_version Create version for each prompt
create_run Start 3 runs per experiment
log_metric Log 3 metrics per run
log_parameter Log 5 parameters per run
log_artifact Upload 2 artifacts (~10KB each)
update_run_status Mark run as FINISHED

Browsing Scenario (80% of total load)

Simulates users browsing MLflow UI:

Operation Description
list_workspaces List available workspaces/tenants
search_prompts Search prompts (up to 100 results)
search_experiments List up to 25 experiments
get_experiment Fetch experiment details
search_runs Search runs in experiment
get_run Fetch individual run details
list_artifacts List run artifacts
fetch_artifact Download artifact content

πŸ“ˆ Generated Charts

Chart Description
chart_summary_dashboard.png Overview of throughput, requests, and failures
chart_response_times_by_concurrency.png P95 latency vs concurrency
chart_response_times_by_tenants.png P95 latency vs tenant count
chart_throughput_heatmap.png Request rate heatmap (tenants Γ— concurrency)
chart_response_times_p95_heatmap.png P95 latency heatmap
chart_passed_counts.png Successful operations by config
chart_cpu_utilization.png CPU usage by component
chart_memory_utilization.png Memory usage by component
chart_mlflow_cpu_by_concurrency.png MLflow CPU vs concurrency
chart_mlflow_cpu_by_tenants.png MLflow CPU vs tenant count

πŸ“ Project Structure

mlflow-scale/
β”œβ”€β”€ manifests/                    # Kubernetes/OpenShift manifests
β”‚   β”œβ”€β”€ DataScienceCluster.yml    # OpenDataHub cluster config
β”‚   β”œβ”€β”€ DSCInitialization.yml     # DSC initialization
β”‚   β”œβ”€β”€ MLflow.yml                # MLflow CR (SQLite backend)
β”‚   β”œβ”€β”€ MLflow_Postgres.yml       # MLflow CR (PostgreSQL backend)
β”‚   └── Postgres.yml              # PostgreSQL deployment, service, and PVC
β”‚
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ run_suite.sh              # Main test suite orchestrator
β”‚   β”œβ”€β”€ mlflow_scale_test.js      # k6 load test script
β”‚   β”œβ”€β”€ mlflow_prefill_tenants.js # k6 script to prefill tenant data
β”‚   β”œβ”€β”€ collect_metrics.sh        # Prometheus metrics collector
β”‚   β”œβ”€β”€ report_summary.py         # Report & chart generator
β”‚   β”œβ”€β”€ k6-pod.yml                # k6 pod specification
β”‚   └── requirements.txt          # Python dependencies
β”‚
└── README.md

πŸ”§ Advanced Usage

Running Individual Components

# Collect Prometheus metrics manually
./scripts/collect_metrics.sh \
  --start-time $(date -d '10 minutes ago' +%s) \
  --end-time $(date +%s) \
  --output metrics.csv

# Generate reports from existing results
cd scripts/results
python3 ../report_summary.py \
  --pattern "summary_*.json" \
  --metrics-pattern "metrics_*.csv" \
  --output-dir .

Running k6 Tests Manually

# Exec into the k6 pod
oc exec -it k6-benchmark -n opendatahub -- sh

# Run a single tenant test
k6 run \
  -e MLFLOW_URL=https://mlflow.example.com \
  -e MLFLOW_TOKEN=sha256~xxx \
  -e CONCURRENCY=10 \
  -e DURATION=5m \
  -e TENANT_COUNT=1 \
  /scripts/mlflow_scale_test.js

# Run a multi-tenant test
k6 run \
  -e MLFLOW_URL=https://mlflow.example.com \
  -e MLFLOW_TOKEN=sha256~xxx \
  -e CONCURRENCY=50 \
  -e DURATION=5m \
  -e TENANT_COUNT=100 \
  /scripts/mlflow_scale_test.js

πŸ“ License

This project is released under the Apache License 2.0.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors