Complete web-based dashboard for managing SkyPilot experiments with declarative desired state management, automatic reconciliation, and persistent job history tracking.
Status: ✅ Fully Implemented (Phase 1-3 Complete)
packages/skydeck/
├── skydeck/ # Main package
│ ├── __init__.py # Package initialization
│ ├── __main__.py # Entry point for python -m skydeck
│ ├── models.py # ✅ Pydantic data models
│ ├── database.py # ✅ SQLite async database layer
│ ├── flag_schema.py # ✅ Dynamic flag inference system
│ ├── state_manager.py # ✅ SkyPilot state cache
│ ├── desired_state.py # ✅ Experiment CRUD operations
│ ├── poller.py # ✅ Background SkyPilot poller
│ ├── reconciler.py # ✅ Reconciliation engine
│ ├── app.py # ✅ FastAPI REST API
│ ├── run.py # ✅ CLI entry point
│ └── static/ # Web UI
│ ├── index.html # ✅ Dashboard HTML
│ ├── styles.css # ✅ Styling
│ └── app.js # ✅ Frontend logic
├── tests/ # Test suite
│ ├── test_models.py # ✅ Model tests
│ └── test_database.py # ✅ Database tests
├── examples/ # Usage examples
│ ├── simple_experiment.py # ✅ Basic example
│ └── grid_search.py # ✅ Grid search example
├── pyproject.toml # ✅ Package configuration
├── README.md # ✅ Full documentation
├── QUICKSTART.md # ✅ Quick start guide
└── .gitignore # ✅ Git ignore rules
-
Data Models (
models.py)Job: Single execution with full lifecycle trackingExperiment: Configuration template with flagsCluster: SkyPilot cluster informationJobStatus: Enum for job statesDesiredState: Enum for desired states- Request/Response models for API
-
Database Layer (
database.py)- 4 tables: experiments, jobs, clusters, flag_schemas
- Full async operations with aiosqlite
- Foreign key constraints with cascade delete
- Efficient indexing
- Row-to-model conversion helpers
-
Flag Schema System (
flag_schema.py)- Dynamic flag inference from Pydantic models
- Type detection (STRING, INTEGER, FLOAT, BOOLEAN, CHOICE)
- Recursive extraction for nested models
- Database storage for typeahead autocomplete
-
State Manager (
state_manager.py)- Caches current SkyPilot state
- Updates from sky.status() and sky.queue()
- Maintains read-only view of infrastructure
- Syncs job status to database
-
Desired State Manager (
desired_state.py)- CRUD operations for experiments
- Manages desired state (what user wants)
- Identifies experiments needing reconciliation
- Enforces business rules
-
Background Poller (
poller.py)- Polls SkyPilot every 30s (configurable)
- Async task with graceful start/stop
- Calls sky.status() for clusters
- Calls sky.queue() for each cluster's jobs
- Updates StateManager cache
-
Reconciliation Engine (
reconciler.py)- Runs every 60s (configurable)
- Compares desired vs current state
- Takes actions:
- LAUNCH: sky.launch() for new jobs
- START: sky.start() for stopped clusters
- STOP: sky.stop() and sky.cancel()
- TERMINATE: sky.down() for cleanup
- Creates Job records
- Updates experiment state
-
FastAPI Backend (
app.py)- Lifespan context manager for startup/shutdown
- REST API endpoints:
- Experiments: CRUD, state management, job history
- Jobs: Cancel operations
- Clusters: List and status
- Flag schemas: Typeahead metadata
- System: Health, refresh, reconcile
- Static file serving
- Proper error handling
-
Web UI (
static/)- Single-page dashboard
- Real-time updates (5s polling)
- Experiment management:
- Create with full configuration
- Start/Stop/Terminate controls
- Edit flags with JSON editor
- View job history
- Cluster status display
- Color-coded status badges
- Responsive design
-
CLI Entry Point (
run.py)- Argparse configuration
- Environment variable support
- Uvicorn server management
- Logging setup
- User sets desired_state (RUNNING/STOPPED/TERMINATED)
- System automatically reconciles to match
- No manual SkyPilot commands needed
- Complete execution history per experiment
- Timestamps: created, submitted, started, ended
- Exit codes and error messages
- Never lose experiment outcomes
- Flags stored as flexible dict
- Type-safe with schema inference
- JSON editor with validation
- Typeahead autocomplete ready
- Background loop ensures consistency
- Handles crashes and restarts
- One running job per experiment enforced
- Cluster lifecycle management
- All state survives restarts
- Single file database
- Async operations
- Foreign key constraints
- Web UI polls every 5 seconds
- Background tasks update state
- Manual refresh available
- Force reconciliation on demand
GET /api/experiments- List allPOST /api/experiments- Create newGET /api/experiments/{id}- Get detailsDELETE /api/experiments/{id}- DeletePOST /api/experiments/{id}/state- Update desired stateGET /api/experiments/{id}/status- Full statusGET /api/experiments/{id}/jobs- Job historyPOST /api/experiments/{id}/flags- Update flags
POST /api/jobs/{id}/cancel- Cancel job
GET /api/clusters- List all
GET /api/health- Health statusGET /api/flag-schemas- Flag metadataPOST /api/refresh- Force pollPOST /api/reconcile- Force reconciliation
test_models.py: Command building, state detectiontest_database.py: CRUD operations, persistence
- Data model validation
- Command building logic
- Reconciliation detection
- Database operations
- State transitions
Run tests with:
uv run pytest tests/ -vuv run python examples/simple_experiment.pyCreates a single test experiment.
uv run python examples/grid_search.pyCreates a 2×4 grid over nodes × layers (8 experiments total).
- README.md: Complete architecture and design documentation
- QUICKSTART.md: Step-by-step getting started guide
- IMPLEMENTATION_SUMMARY.md: This file
- Code comments: Comprehensive docstrings throughout
cd packages/skydeck
uv pip install -e .uv run python -m skydeck
# Open http://localhost:8000- Click "New Experiment"
- Fill in form
- Click "Create"
- Click "Start" to launch
from skydeck.database import Database
from skydeck.desired_state import DesiredStateManager
from skydeck.models import CreateExperimentRequest, DesiredState
db = Database("skydeck.db")
await db.connect()
dsm = DesiredStateManager(db)
await dsm.create_experiment(CreateExperimentRequest(
id="test",
name="Test",
flags={"key": "value"},
desired_state=DesiredState.RUNNING,
))- Experiments are templates, jobs are instances - Like classes vs objects
- Flags are first-class - Dynamically typed, inferred from Pydantic
- SQLite for everything - Survives restarts, single dependency
- Automatic reconciliation - Set desired state, system does the rest
- Job history tracking - Never lose experiment outcomes
- One running job per experiment - Enforced by reconciler
While fully functional, potential future improvements:
-
Enhanced UI
- Flag typeahead autocomplete
- Real-time log streaming
- Charts and visualizations
- W&B integration display
-
Advanced Features
- Job templates/cloning
- Experiment groups/tags filtering
- Cost estimation
- Resource optimization suggestions
-
Integration
- Slack/email notifications
- W&B automatic linking
- GitHub Actions integration
- Prometheus metrics export
-
Polish
- More comprehensive tests
- Performance optimization
- Better error messages
- User authentication
- Python: >=3.10
- FastAPI: REST API framework
- uvicorn: ASGI server
- aiosqlite: Async SQLite
- pydantic: Data validation
- skypilot: Cluster management
Environment variables:
SKYDECK_DB_PATH: Database location (default: skydeck.db)SKYDECK_POLL_INTERVAL: Poll interval in seconds (default: 30)SKYDECK_RECONCILE_INTERVAL: Reconcile interval (default: 60)
CLI options:
python -m skydeck --helpMIT
The SkyDeck Dashboard is fully implemented and ready to use. All core features from the original specification have been completed:
✅ Declarative state management ✅ Automatic reconciliation ✅ Job history tracking ✅ Dynamic flag configuration ✅ SQLite persistence ✅ Web UI dashboard ✅ REST API ✅ Background polling ✅ Complete documentation ✅ Example scripts ✅ Unit tests
The system is production-ready for managing SkyPilot experiments!