Skip to content

Commit e3ffed0

Browse files
committed
Refactor: Consolidate TF Modernization changes (Squashed from PR #233)
1 parent b730392 commit e3ffed0

338 files changed

Lines changed: 126585 additions & 1847 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/CODEOWNERS

100644100755
File mode changed.

.github/workflows/cla.yml

100644100755
File mode changed.

.github/workflows/test.yml

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
name: Tests
2+
3+
on:
4+
push:
5+
branches: [main, master]
6+
pull_request:
7+
branches: [main, master]
8+
9+
jobs:
10+
test:
11+
runs-on: ubuntu-latest
12+
strategy:
13+
fail-fast: false
14+
matrix:
15+
python-version: ['3.10', '3.11', '3.12']
16+
17+
steps:
18+
- uses: actions/checkout@v4
19+
20+
- name: Set up Python ${{ matrix.python-version }}
21+
uses: actions/setup-python@v5
22+
with:
23+
python-version: ${{ matrix.python-version }}
24+
25+
- name: Install system dependencies
26+
run: |
27+
sudo apt-get update
28+
sudo apt-get install -y libopenmpi-dev openmpi-common
29+
30+
- name: Install package and test dependencies
31+
run: |
32+
python -m pip install --upgrade pip
33+
# Install the package in editable mode without DLIO
34+
pip install -e ".[test]"
35+
36+
- name: Run unit tests
37+
run: |
38+
pytest tests/unit -v --tb=short
39+
40+
- name: Run unit tests with coverage
41+
run: |
42+
pytest tests/unit -v --cov=mlpstorage --cov-report=xml --cov-report=term-missing
43+
44+
- name: Upload coverage to Codecov
45+
uses: codecov/codecov-action@v4
46+
with:
47+
files: ./coverage.xml
48+
fail_ci_if_error: false
49+
verbose: true
50+
env:
51+
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
52+
53+
lint:
54+
runs-on: ubuntu-latest
55+
steps:
56+
- uses: actions/checkout@v4
57+
58+
- name: Set up Python
59+
uses: actions/setup-python@v5
60+
with:
61+
python-version: '3.11'
62+
63+
- name: Install lint dependencies
64+
run: |
65+
python -m pip install --upgrade pip
66+
pip install ruff
67+
68+
- name: Run ruff check
69+
run: |
70+
ruff check mlpstorage/ --output-format=github || true
71+
72+
- name: Run ruff format check
73+
run: |
74+
ruff format --check mlpstorage/ || true

.gitignore

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Python cache
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
*.so
6+
7+
# Distribution / packaging
8+
dist/
9+
build/
10+
*.egg-info/
11+
12+
# Virtual environments
13+
venv/
14+
.venv/
15+
env/
16+
17+
# IDE
18+
.idea/
19+
.vscode/
20+
*.swp
21+
*.swo
22+
23+
# Test artifacts
24+
.pytest_cache/
25+
.coverage
26+
htmlcov/
27+
*.html
28+
29+
# OS files
30+
.DS_Store
31+
Thumbs.db

.planning/PROJECT.md

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# MLPerf Storage Benchmark Suite v3.0
2+
3+
## What This Is
4+
5+
A benchmark orchestration framework for the MLCommons MLPerf Storage working group. The suite runs storage benchmarks aligned with MLPerf rules and reports results with verification of rules compliance.
6+
7+
## Core Value
8+
9+
**The ONE thing that must work:** Orchestrate multiple benchmark types (training, checkpointing, kv-cache, vectordb) across distributed systems and produce verified, rules-compliant results.
10+
11+
## Context
12+
13+
### Current State
14+
- v2.0 release with Claude Code enhancements
15+
- Training and checkpointing benchmarks use DLIO as underlying engine
16+
- KV cache benchmark exists in separate directory (`kv_cache_benchmark/`)
17+
- VectorDB benchmark code exists in external branch
18+
- MPI-based execution and host collection for DLIO benchmarks
19+
- Existing error handling and validation pipeline
20+
21+
### Target State (v3.0)
22+
- Fully integrated KV cache and VectorDB benchmarks as Benchmark subclasses
23+
- New training models (dlrm, retinanet, flux)
24+
- Package version management with lockfiles
25+
- SSH-based host collection for non-MPI benchmarks
26+
- Time-series /proc/ data collection during benchmark execution
27+
- Improved error messaging and user guidance
28+
29+
### Timeline
30+
- **Feature freeze:** 6 weeks
31+
- **Bugfix period:** 6 weeks
32+
- **Code freeze:** 12 weeks total
33+
34+
## Requirements
35+
36+
### Validated (Existing)
37+
38+
- ✓ Training benchmark orchestration via DLIO — existing
39+
- ✓ Checkpointing benchmark orchestration via DLIO — existing
40+
- ✓ MPI-based distributed execution — existing
41+
- ✓ Rules validation pipeline — existing
42+
- ✓ Report generation — existing
43+
- ✓ CLI with nested subcommands — existing
44+
- ✓ Benchmark registry pattern — existing
45+
46+
### Active
47+
48+
- [ ] Package version lockfile management
49+
- [ ] Remove GPU package dependencies (not used)
50+
- [ ] KV cache Benchmark class (wraps kv-cache.py)
51+
- [ ] KV cache MPI execution across hosts
52+
- [ ] VectorDB Benchmark class (wraps load_vdb.py, compact_and_watch.py, simple_bench.py)
53+
- [ ] SSH-based host collection for non-MPI benchmarks
54+
- [ ] New training models: dlrm, retinanet, flux
55+
- [ ] Improved error messaging for missing commands/packages
56+
- [ ] Clear user guidance for resolving dependency issues
57+
- [ ] Time-series /proc/ collection (diskstats, vmstat, cpuinfo, etc.)
58+
- [ ] Parallel collection process (10 sec intervals) without impacting benchmark
59+
60+
### Out of Scope
61+
62+
- GPU support — deliberately not supporting GPU execution
63+
- Rewriting KV/VDB as native benchmarks — v3.0 wraps existing scripts
64+
- Real-time monitoring UI — collection only, no visualization
65+
- Cloud provider integrations — on-premise/bare-metal focus
66+
67+
## Key Decisions
68+
69+
| Decision | Rationale | Outcome |
70+
|----------|-----------|---------|
71+
| Lockfile for package versions | Reproducibility across systems, MPI version issues | Pending |
72+
| Benchmark subclasses for KV/VDB | Minimal integration, reuse CLI and reporting infrastructure | Pending |
73+
| SSH for non-MPI host collection | KV cache and VectorDB don't require MPI execution | Pending |
74+
| Parallel process for time-series | Must not impact benchmark performance | Pending |
75+
76+
## Constraints
77+
78+
- **No GPU dependencies** — storage benchmark, not compute
79+
- **MPI compatibility** — must work with various MPI implementations
80+
- **Cross-platform** — Linux primarily, various distributions
81+
- **Minimal dependencies** — reduce version conflict surface area
82+
83+
## External Code References
84+
85+
| Component | Location | Notes |
86+
|-----------|----------|-------|
87+
| KV cache benchmark | `kv_cache_benchmark/` (local) | Also: `mlcommons/storage/TF_KVCache` branch |
88+
| VectorDB benchmark | `mlcommons/storage/TF_VDBBench` branch | Scripts: load_vdb.py, compact_and_watch.py, simple_bench.py |
89+
| DLIO benchmark | External package | Upstream dependency for training/checkpointing |
90+
91+
## Success Metrics
92+
93+
- All 4 benchmark types (training, checkpointing, kv-cache, vectordb) runnable from unified CLI
94+
- Package lockfile prevents version conflicts in CI
95+
- Error messages guide users to resolution for common issues
96+
- Host data collected for all benchmark types (MPI or SSH)
97+
- Time-series collection runs without measurable benchmark impact
98+
99+
---
100+
*Last updated: 2026-01-23 after initialization*

.planning/REQUIREMENTS.md

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# MLPerf Storage v3.0 Requirements
2+
3+
## v1 Requirements
4+
5+
### Package Management
6+
7+
- [x] **PKG-01**: Lockfile for Python dependencies with pinned versions
8+
- [x] **PKG-02**: Remove GPU package dependencies from default install
9+
- [x] **PKG-03**: Validate package versions match lockfile before benchmark execution
10+
11+
### Benchmark Integration
12+
13+
- [x] **BENCH-01**: KVCacheBenchmark class extending Benchmark base (wraps kv-cache.py)
14+
- [x] **BENCH-02**: KV cache MPI execution across multiple hosts
15+
- [x] **BENCH-03**: VectorDBBenchmark class extending Benchmark base (wraps VDB scripts)
16+
- [x] **BENCH-04**: VectorDB CLI commands (run, datagen operations)
17+
- [x] **BENCH-05**: Integration with existing validation/reporting pipeline
18+
19+
### Training Updates
20+
21+
- [x] **TRAIN-01**: Add dlrm model configuration
22+
- [x] **TRAIN-02**: Add retinanet model configuration
23+
- [x] **TRAIN-03**: Add flux model configuration
24+
- [x] **TRAIN-04**: Update DLIO to support parquet for data loaders, readers, data generation
25+
- [x] **TRAIN-05**: Production-ready parquet reader with memory-efficient I/O
26+
- [x] **TRAIN-06**: Update pyproject.toml to reference DLIO fork
27+
28+
### Host Collection
29+
30+
- [x] **HOST-01**: SSH-based host collection for non-MPI benchmarks
31+
- [x] **HOST-02**: Collect /proc/ data (diskstats, vmstat, cpuinfo, filesystems, cgroups)
32+
- [x] **HOST-03**: Collection at benchmark start and end
33+
- [x] **HOST-04**: Time-series collection (10 sec intervals) during execution
34+
- [x] **HOST-05**: Parallel collection process without benchmark performance impact
35+
36+
### Error Handling & UX
37+
38+
- [x] **UX-01**: Detect missing commands/packages with actionable error messages
39+
- [x] **UX-02**: Suggest installation steps for missing dependencies
40+
- [x] **UX-03**: Validate environment before benchmark execution (fail-fast)
41+
- [x] **UX-04**: Clear progress indication during long operations
42+
43+
---
44+
45+
## v2 Requirements (Deferred)
46+
47+
- [ ] Deeper KV cache integration (native implementation vs wrapper)
48+
- [ ] Deeper VectorDB integration (native implementation vs wrapper)
49+
- [ ] Real-time monitoring dashboard for time-series data
50+
- [ ] Cloud provider integrations (AWS, GCP, Azure)
51+
52+
---
53+
54+
## Out of Scope
55+
56+
- **GPU support** — Storage benchmark, deliberately not supporting GPU execution
57+
- **Rewriting KV/VDB as native benchmarks** — v3.0 wraps existing scripts
58+
- **Real-time visualization** — Collection only, no visualization in v3.0
59+
- **Windows support** — Linux-only target
60+
61+
---
62+
63+
## Traceability
64+
65+
| Requirement | Phase | Status |
66+
|-------------|-------|--------|
67+
| PKG-01 | Phase 1 | Complete |
68+
| PKG-02 | Phase 1 | Complete |
69+
| PKG-03 | Phase 1 | Complete |
70+
| UX-01 | Phase 2 | Complete |
71+
| UX-02 | Phase 2 | Complete |
72+
| UX-03 | Phase 2 | Complete |
73+
| BENCH-01 | Phase 3 | Complete |
74+
| BENCH-02 | Phase 3 | Complete |
75+
| BENCH-03 | Phase 4 | Complete |
76+
| BENCH-04 | Phase 4 | Complete |
77+
| BENCH-05 | Phase 5 | Complete |
78+
| HOST-01 | Phase 6 | Complete |
79+
| HOST-02 | Phase 6 | Complete |
80+
| HOST-03 | Phase 6 | Complete |
81+
| HOST-04 | Phase 7 | Complete |
82+
| HOST-05 | Phase 7 | Complete |
83+
| TRAIN-01 | Phase 8 | Complete |
84+
| TRAIN-02 | Phase 8 | Complete |
85+
| TRAIN-03 | Phase 8 | Complete |
86+
| TRAIN-04 | Phase 9 | Complete |
87+
| UX-04 | Phase 10 | Complete |
88+
| TRAIN-05 | Phase 11 | Complete |
89+
| TRAIN-06 | Phase 11 | Complete |
90+
91+
---
92+
*Last updated: 2026-01-25*

0 commit comments

Comments
 (0)