Skip to content

erezrozenbaum/pf9-mngt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

766 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

pf9-mngt

Provisioning infrastructure is solved.
Operating it at scale is not.

pf9-mngt is a self-hosted operational control plane for Platform9 / OpenStack. It adds the persistent inventory, automated recovery workflows, and governance layer that Platform9 itself does not provide β€” built for the teams responsible for what happens after Day-0.

Operational Control Plane for Platform9 / OpenStack
Visibility Β Β·Β  Recovery Β Β·Β  Operations Β Β·Β  Intelligence

License: MIT Version CI Kubernetes Demo Mode Buy Me A Coffee

⭐ If pf9-mngt saves your team time, star the repo β€” it helps others find it.


πŸ”„ What This Actually Replaces

Without pf9-mngt With pf9-mngt
Scripts that dump inventory to CSV, manually maintained Persistent PostgreSQL inventory, 29 resource types, always current
VM restore = manual reconstruction at 3am under SLA pressure Fully automated restore β€” flavor, network, IPs, volumes, credentials
No snapshot scheduler β†’ custom cron per tenant, no SLA tracking Policy-driven snapshot automation, cross-tenant, quota-aware, SLA-compliant
Migration planning in spreadsheets β†’ guesswork End-to-end planner: RVTools β†’ risk scoring β†’ wave planning β†’ PCD provisioning
Separate ticketing tool + separate runbook wiki + separate billing exports Built-in: tickets, 25 runbooks, metering, chargeback β€” one system
Tenants call you for every status check β†’ your team is the bottleneck Tenant self-service portal: customers view their own VMs, snapshots, and restores β€” scoped, isolated, MFA-protected

One system. No duct tape.


🧭 What It Gives You

pf9-mngt adds a persistent operational layer on top of Platform9 / OpenStack, combining inventory, automation, recovery workflows, and governance into a single self-hosted system:

  • Full infrastructure visibility β€” all metadata in your own PostgreSQL, independent of platform uptime, 29 resource types, cross-tenant
  • Automated snapshot & restore workflows β€” no native equivalent exists in Platform9 or OpenStack; fully automated, SLA-tracked, audited
  • VMware β†’ OpenStack migration planning β€” end-to-end from RVTools ingestion to PCD auto-provisioning
  • Governance, audit, and Day-2 tooling β€” runbooks, tickets, metering, chargeback, tenant self-service
  • MSP business value reporting β€” SLA compliance tracking per tier (Gold/Silver/Bronze), QBR PDF generation per customer, Account Manager Portfolio dashboard (per-tenant SLA status, vCPU usage, leakage alerts), Executive Health dashboard (fleet SLA gauge, MTTR, revenue leakage)

Works alongside Platform9 via its APIs. Not a replacement β€” an operational layer on top.


🚨 The Day-2 Operations Reality

Provisioning is not the hard part anymore.

Running infrastructure at scale is.

What actually breaks in real Platform9 / OpenStack environments:

  • Snapshot SLAs across tenants β€” no native scheduler exists
  • VM restore under pressure β€” no native workflow; everything is manual reconstruction
  • Metadata ownership β€” resource names, relationships, and topology live on the platform, not with you
  • Cross-tenant visibility at scale β€” the native UI is per-tenant, not operational-aggregate
  • Multi-region complexity β€” managing multiple clusters with no unified console
  • Coordination gaps β€” between support, engineering, and management teams
  • Customer self-service β€” tenants need to see their own infrastructure status without you being a human API; the native Platform9 UI is admin-only

These are Day-2 operations problems. pf9-mngt solves them.


πŸ’‘ What pf9-mngt Is

A self-hosted operational platform that extends Platform9 / OpenStack β€” not replaces it.

  • A persistent inventory engine β€” all Platform9 / OpenStack metadata in your own PostgreSQL, always available, independent of platform uptime (the RVTools equivalent for OpenStack)
  • A snapshot automation engine β€” no native scheduler exists in Platform9 or OpenStack; this one is quota-aware, cross-tenant, policy-driven, with SLA compliance reporting
  • A VM restore system β€” full automation of flavor, network, IPs, credentials, and volumes; two modes (side-by-side and replace); no native equivalent exists in OpenStack
  • A migration planning workbench β€” from RVTools ingestion through cohort design, wave planning, and PCD auto-provisioning
  • A unified engineering console β€” 30+ management tabs, RBAC, metering, chargeback, runbooks, tickets, and AI Ops Copilot
  • A tenant self-service portal β€” a completely isolated, MFA-protected web interface that gives customers read + restore access to their own infrastructure without touching your admin panel; access is opt-in per Keystone user, controlled by you

βœ” Works alongside Platform9 via its APIs Β Β·Β  ❌ Not a UI replacement Β Β·Β  ❌ Not an official Platform9 product


πŸ”‘ Four Pillars

Everything in pf9-mngt is built around four operational concerns:

Pillar What it covers
πŸ”­ Visibility Cross-tenant, multi-region inventory with drift detection, dependency graph, and historical tracking β€” metadata owned by you, not the platform
♻️ Recovery Snapshot automation and full VM restore orchestration β€” two modes, dry-run validation, SLA compliance, no native equivalent in OpenStack
🎫 Operations Ticketing, 25 built-in runbooks, metering, chargeback, standardized governance workflows, and tenant self-service portal
πŸ€– Intelligence AI Ops Copilot (plain-language queries against live infrastructure), Operational Intelligence Feed (capacity, waste, risk and anomaly engines), SLA compliance tracking and breach detection, QBR PDF generator, Account Manager Portfolio and Executive Health dashboards, revenue leakage detection, VMware migration planning end-to-end

Everything else in the system β€” LDAP, multi-region, Kubernetes, export reports β€” supports one of these four pillars.


🧠 Why This Matters

Challenge Native Platform9 pf9-mngt
Cross-tenant visibility Per-tenant only Centralized persistent inventory
Snapshot SLA enforcement None built-in Policy-driven, multi-tenant, audited
VM restore workflow Manual reconstruct Full automation, two modes, dry-run
Metadata ownership Lives on the platform Your PostgreSQL, always available
Multi-region ops Operationally complex Unified console, one-click context switch
Day-2 workflows External tools Built-in tickets, runbooks, metering
VMware migration No native tooling End-to-end planner: RVTools β†’ PCD
Tenant visibility You are the human API Self-service portal: MFA-protected, RLS-isolated, scoped to their projects

πŸ”₯ What Makes It Different

Most platforms solve provisioning.

pf9-mngt solves what happens after deployment β€” the snapshot SLAs that must hold, the 3am restore that must succeed, the compliance report due tomorrow, the capacity forecast before the cluster fills up, the VMware migration that has to go right.

Built from real-world operations. 670+ commits, 270+ releases, 18 containerized services.

Not theory β€” from what actually breaks in production.


πŸ€” Why Not Just Use Platform9, Scripts, or Grafana?

Because pf9-mngt combines in one system what would otherwise take 5+ separate tools:

Problem Typical approach pf9-mngt
Infrastructure inventory Scripts β†’ CSV dumps Persistent PostgreSQL, 29 resource types, always yours
Snapshot scheduling No native scheduler Policy-driven, cross-tenant, quota-aware, SLA-compliant
VM restore Manual reconstruction under pressure Fully automated, two modes, dry-run, audited
VMware migration planning Spreadsheets + guesswork End-to-end: RVTools β†’ risk scoring β†’ wave planning β†’ PCD provisioning
Operations governance Separate ticketing + runbook tool Built-in: 25 runbooks, full ticket lifecycle, approval gates, metering
MSP reporting Manual QBRs + spreadsheet SLA tracking QBR PDF generator, SLA tier compliance, Account Manager Portfolio dashboard

A custom script solves one problem once. pf9-mngt enforces operational discipline at scale.

Full technical feature reference: docs/FEATURES_REFERENCE.md


⚑ Try It Now β€” Demo Mode (No Platform9 Required)

Explore the full dashboard without a Platform9 environment:

git clone https://github.com/erezrozenbaum/pf9-mngt.git
cd pf9-mngt
.\deployment.ps1   # select option 2 β€” Demo

Populates the database with 3 tenants, 35 VMs, 50+ volumes, snapshots, drift events, compliance reports, and a metrics cache. Every dashboard, report, and workflow is fully functional β€” no live cluster needed.

UI: http://localhost:5173 Β Β·Β  API Docs: http://localhost:8000/docs


⚑ What You Get in 30 Seconds

After running Demo Mode you'll find:

  • 3 tenants preloaded with realistic VM topology and metadata
  • 35 VMs with volumes, snapshot policies, and compliance reports
  • Migration plan example β€” risk-scored VMs, cohort design, wave planning
  • Ticketing + runbook system β€” full lifecycle, SLA tracking, 25 built-in procedures
  • Dashboard KPIs, drift events, and audit trail β€” every workflow wired up

No Platform9 cluster required. Full product experience in under 5 minutes.


πŸ—οΈ Architecture

18-container microservices platform:

Service Stack Port Purpose
nginx (TLS proxy) nginx:1.27-alpine 80/443 HTTPS termination, HTTP→HTTPS redirect, reverse proxy to API and UI
Frontend UI React 19.2+ / TypeScript / Vite 5173 30+ management tabs + admin panel
Backend API FastAPI / Gunicorn / Python 8000 170+ REST endpoints, RBAC middleware, 4 workers + --max-requests 1000
Redis redis:7-alpine internal OpenStack inventory/quota cache (60–300 s TTL, allkeys-lru, 128 MiB cap)
LDAP Server OpenLDAP internal Enterprise authentication directory (not exposed to host)
LDAP Admin phpLDAPadmin 8081 (dev profile) Web-based LDAP management (--profile dev)
Monitoring Service FastAPI / Python 8001 Real-time metrics via Prometheus
Database PostgreSQL 16 internal 160+ tables, audit, metering, migration planner, tenant portal RLS (not exposed to host)
Database Admin pgAdmin4 8080 (dev profile) Web-based PostgreSQL management (--profile dev)
Snapshot Worker Python β€” Automated snapshot management
Notification Worker Python / SMTP β€” Email alerts for drift, snapshots, compliance
Backup Worker Python / PostgreSQL β€” Scheduled DB + LDAP backups to NFS, restore (backup profile)
Scheduler Worker Python β€” Host metrics collection + RVTools inventory (runs inside Docker)
Metering Worker Python / PostgreSQL β€” Resource metering every 15 minutes
Search Worker Python / PostgreSQL β€” Incremental full-text indexing for Ops Assistant
LDAP Sync Worker Python / PostgreSQL / OpenLDAP β€” Bi-directional DB ↔ LDAP sync, polls every 30 s
Tenant Portal API FastAPI / Gunicorn / Python 8010 Tenant self-service portal β€” JWT + RLS, MFA, per-user access allowlist
Tenant Portal UI React 19.2+ / TypeScript / nginx 8083 (dev: 8082) Tenant self-service web interface β€” 10 screens, MFA login, per-customer branding, VM provisioning, SG rule editing, dependency graph

Architecture

pf9_scheduler_worker (Docker container) runs host_metrics_collector.py (every 60 s) and pf9_rvtools.py (configurable interval or daily schedule) for infrastructure discovery and metrics collection. No Windows Task Scheduler dependency.


πŸ“Š Feature Status

Feature Status
Inventory Engine (RVTools-style, 29 resource types) βœ… Production
Snapshot Automation βœ… Production
VM Restore (side-by-side + replace modes) βœ… Production
Reports (20 types + CSV export) βœ… Production
Customer Provisioning & Domain Management βœ… Production
Metering & Chargeback βœ… Production
Notifications (SMTP + Slack + Teams) βœ… Production
Drift Detection βœ… Production
Ops Assistant β€” Full-Text Search & Smart Queries βœ… Production
Runbooks (25 built-in, dept visibility, approval workflows, tenant execution) βœ… Production
External Integrations Framework (billing gate, CRM, webhooks) βœ… Production
Dependency Graph: Health Scores, Blast Radius, Delete Impact βœ… Production
Backup & Restore (DB) with Integrity Validation βœ… Production
Inventory Versioning & Diff βœ… Production
AI Ops Copilot βœ… Production
Migration Planner (end-to-end) βœ… Production
Support Ticket System (SLA, auto-tickets, approvals) βœ… Production
Container Restart Alerting βœ… Production
Multi-Region & Multi-Cluster Support βœ… Production
External LDAP / AD Identity Federation βœ… Production
Kubernetes Deployment (Helm + ArgoCD + Sealed Secrets) βœ… Production
Tenant Self-Service Portal βœ… Production
Tenant VM Provisioning (self-service) βœ… Production
Tenant Network & Security Group Management βœ… Production
SLA Compliance Tracking βœ… Production
Operational Intelligence Feed βœ… Production
Client Health Scoring (Efficiency Β· Stability Β· Capacity Runway) βœ… Production
Tenant Observer Role (read-only portal access, invite flow) βœ… Production
Role-Based Dashboard Views (Account Manager Portfolio + Executive Health) βœ… Production

🧭 Why This Was Built

Built during a serious Platform9 evaluation β€” stress-testing real operational workflows revealed four gaps no native tooling covered: metadata ownership (no RVTools-equivalent for OpenStack), VM restore (no native workflow exists), snapshot automation (no native scheduler), and VMware migration planning (no native RVTools β†’ PCD workflow).

Rather than pause the evaluation, we solved them. The result is pf9-mngt β€” 670+ commits, 270+ releases, built using AI as a genuine engineering partner alongside regular responsibilities.

Full engineering story and gap analysis: docs/ENGINEERING_STORY.md


πŸ“Έ Screenshots

Landing Dashboard

Landing Dashboard

Snapshot Compliance Report

Snapshot Compliance Report

VM Inventory

VM Inventory

Drift Detection

Drift Detection

Operational Intelligence β€” Insights Feed, SLA & Capacity

Operational Intelligence

Intelligence Management Views

Intelligence Management Views

Metering & Chargeback

Metering & Chargeback

Support Ticket System

Support Ticket System

Tenant Portal β€” Self-Service Infrastructure

Tenant Portal

Dependency Graph

Dependency Graph

Snapshot Restore Process

Snapshot Restore Process


🎬 Video Walkthrough

A 15-minute explainer video walking through the UI and key features:

Watch on YouTube

▢️ PF9 Management System β€” Full UI Walkthrough (15 min)


βš™οΈ Core Capabilities

πŸ” Inventory & Drift Detection

Persistent inventory outside Platform9 β€” 29 resource types, historical tracking, drift detection across tenants, domain/project mapping, CSV / Excel export.

πŸ“Έ Snapshot Automation & Compliance

Policy-based snapshots (daily / monthly / custom), cross-tenant execution, quota-aware batching, retention enforcement, SLA compliance tracking, full audit visibility.

♻️ Restore Workflows

Side-by-side and replace modes, dry-run validation, full flavor / network / IP / credentials / volume automation, concurrent-restore prevention, complete audit logging.

πŸ—ΊοΈ Migration Planner

RVTools ingestion β†’ VM risk scoring β†’ tenant scoping β†’ network + flavor mapping β†’ cohort design with ease scoring β†’ wave planning with approval gates β†’ PCD auto-provisioning β†’ migration summary with throughput modeling.

🌍 Multi-Region / Multi-Cluster

Register multiple Platform9 control planes and regions. All inventory, reporting, and workers are region-aware. Unified console with one-click context switch. No restart required to add a new cluster.

🎫 Ticketing System

Full incident / change / request lifecycle, SLA tracking, auto-ticketing from health events (health score < 40, drift, graph deletes, runbook failures), department workflows, approval gates.

πŸ“‹ Runbooks

25 built-in operational procedures covering VM recovery, security audits, quota management, capacity forecasting, and tenant offboarding. Parameterized, dry-run support, approval flows, export to CSV / JSON / PDF β€” integrated with the ticket system.

πŸ“Š Metering & Chargeback

Per-VM resource tracking, snapshot / restore metering, API usage metrics, efficiency scoring (excellent / good / fair / poor / idle), multi-category pricing, one-click CSV chargeback export.

πŸ“ˆ SLA Compliance & Business Intelligence

SLA tier templates (Gold/Silver/Bronze/Custom), per-tenant KPI measurement (uptime %, RTO, RPO, MTTA, MTTR, backup success), monthly compliance scoring with breach and at-risk detection.

QBR PDF Generator β€” one-click Quarterly Business Review reports with configurable sections: executive summary, ROI interventions, health trend, open items, and methodology. Generated on demand per customer via the tenant detail pane (POST /api/intelligence/qbr/generate/{tenant_id}).

Account Manager Portfolio Dashboard β€” per-tenant portfolio grid with SLA status badge, vCPU usage bar, critical/leakage insight counts, and KPI strip (healthy/at-risk/breached). Gives account managers a single-screen view of all their customers without switching tenants.

Executive Health Dashboard β€” fleet-level stacked SLA bar, 6 KPI cards (fleet health %, breached clients, at-risk clients, open critical insights, estimated revenue leakage/month, average MTTR), and narrative sections for leakage and MTTR compliance.

πŸ€– AI Ops Copilot β€” Query Layer for the Entire Platform

Not just an LLM integration β€” a purpose-built operator assistant that queries your live infrastructure in plain language. Ask "which tenants are over quota?", "show drift events from last week", or "how many VMs are powered off on host X?" and get live SQL-backed answers instantly. 40+ built-in intents with tenant / project / host scoping. Ollama backend keeps all data on your network; OpenAI / Anthropic available with automatic sensitive-data redaction.

🏒 Tenant Self-Service Portal (v1.84.0+, latest v1.94.7)

A completely isolated, MFA-protected web portal that gives your customers read and restore access to their own infrastructure β€” without exposing your admin panel.

  • Security by design: data isolated at the PostgreSQL Row-Level Security layer (not just application code); separate JWT namespace; IP-bound Redis sessions; per-user rate limiting.
  • Observer role (v1.91.0): grant read-only access (portal_role=observer) to stakeholders (account managers, auditors). Observers see all dashboards but are blocked at the API layer from any state-mutating action β€” runbooks, restore, VM provisioning, security group changes.
  • 10 self-service screens: Health Overview (default), Dashboard, Infrastructure (VMs + disk + IPs + dependency graph), Snapshot Coverage (30-day calendar), Monitoring, Restore Center (side-by-side restore wizard β€” non-destructive), Runbooks (execute tenant-visible runbooks, dry-run, execution history), Reports, New VM (πŸš€ Provision), Activity Log.
  • Controlled access: opt-in per Keystone user; you define which OpenStack projects are visible; set MFA policy, role (manager or observer), and runbook visibility per customer.
  • Admin controls: grant/revoke access, toggle observer/manager role, view active sessions, force-revoke, reset MFA, configure per-customer branding (logo, accent colour, portal title), review full audit log β€” all from the Admin β†’ 🏒 Tenant Portal UI or REST API.
  • Kubernetes-native: dedicated nginx-ingress-tenant Helm controller on its own MetalLB IP β€” TLS, WAF rules, and rate limits are isolated from the admin ingress.

πŸ“– See the dedicated Tenant Portal Operator Guide for step-by-step setup, branding, MFA, and Kubernetes configuration.


πŸ§ͺ Real Scenario β€” What a Day-2 Operator Actually Does

A tenant reports a critical VM is down. Here's what happens next with pf9-mngt:

  1. Alert fires β€” health score drops below 40 β†’ auto-ticket created, team notified via Slack/email
  2. Diagnose β€” Dependency Graph shows the VM's blast radius: which volumes, ports, and downstream services are affected
  3. Restore β€” launch side-by-side restore: system reconstructs flavor, network, IPs, and credentials automatically; dry-run validates the plan first
  4. Verify β€” new VM boots alongside the original; operator confirms, original deleted only after sign-off
  5. Audit β€” full restore log: who triggered it, what mode, duration, outcome β€” auto-attached to the ticket
  6. Report β€” SLA compliance report updated; metering records the restore operation for chargeback

Total operator effort: decisions and approvals. The system handles the rest.

This same workflow applies to snapshot SLA breaches, drift events, capacity warnings, and tenant offboarding β€” all integrated, all audited.


πŸš€ Quick Start

Prerequisites

  • Docker & Docker Compose (for complete platform)
  • Python 3.11+ with packages: requests, openpyxl, psycopg2-binary, aiofiles
  • Valid Platform9 credentials (service account recommended) β€” not required in Demo Mode
  • Network access to Platform9 cluster and compute nodes β€” not required in Demo Mode

1. Complete Automated Setup (Recommended)

# Clone repository
git clone https://github.com/erezrozenbaum/pf9-mngt.git
cd pf9-mngt

# Configure environment (CRITICAL: No quotes around values)
cp .env.template .env
# Edit .env with your Platform9 credentials

# One-command complete deployment
.\deployment.ps1

# What deployment.ps1 does:
# βœ“ Checks/installs Docker Desktop
# βœ“ Creates and validates .env configuration
# βœ“ Creates required directories (logs, secrets, cache)
# βœ“ Installs Python dependencies
# βœ“ Builds and starts all Docker containers
# βœ“ Initializes PostgreSQL database schema
# βœ“ Configures LDAP directory structure
# βœ“ Creates automated scheduled tasks
# βœ“ Runs comprehensive health checks

# Alternative quick startup (assumes Docker installed)
.\startup.ps1

# Access services after deployment:
# UI:            http://localhost:5173
# API:           http://localhost:8000
# API Docs:      http://localhost:8000/docs
# Monitoring:    http://localhost:8001
# Database:      http://localhost:8080

1b. Kubernetes Deployment

For production environments, pf9-mngt ships a full Helm chart with ArgoCD GitOps support:

# Add the Helm chart
helm repo add pf9-mngt https://erezrozenbaum.github.io/pf9-mngt
helm repo update

# Install with your values
helm install pf9-mngt pf9-mngt/pf9-mngt \
  --namespace pf9-mngt --create-namespace \
  -f k8s/helm/pf9-mngt/values.yaml \
  -f k8s/helm/pf9-mngt/values.prod.yaml

# Or use the supplied kustomize entrypoint
kubectl apply -k k8s/

Full Kubernetes guide including Sealed Secrets, ArgoCD GitOps pipeline, MetalLB IP pools, and day-2 operations: docs/KUBERNETES_GUIDE.md

1c. Demo Mode (No Platform9 Required)

Want to try the full system without a Platform9 environment? Demo mode populates the database with realistic sample data (3 tenants, 35 VMs, 50+ volumes, snapshots, drift events, compliance reports, etc.) and generates a static metrics cache.

git clone https://github.com/erezrozenbaum/pf9-mngt.git
cd pf9-mngt

# The deployment wizard will ask "Production or Demo?" β€” choose 2 for Demo
.\deployment.ps1

# Or enable demo mode manually on an existing install:
#   1. Set DEMO_MODE=true in .env
#   2. python seed_demo_data.py          # populates DB + generates metrics cache
#   3. docker compose restart pf9_api    # API picks up DEMO_MODE env var

In demo mode the UI shows an amber DEMO banner, the background metrics collector is skipped, and Platform9 credentials are not required.

2. Environment Configuration

# Platform9 Authentication
PF9_USERNAME=your-service-account@example.com
PF9_PASSWORD=your-secure-password
PF9_AUTH_URL=https://your-cluster.platform9.com/keystone/v3
PF9_USER_DOMAIN=Default
PF9_PROJECT_NAME=service
PF9_PROJECT_DOMAIN=Default
PF9_REGION_NAME=region-one

# Database
POSTGRES_USER=pf9
POSTGRES_PASSWORD=generate-secure-password-here
POSTGRES_DB=pf9_mgmt

# Monitoring
PF9_HOSTS=<HOST_IP_1>,<HOST_IP_2>,<HOST_IP_3>
METRICS_CACHE_TTL=60

# Production image version (docker-compose.prod.yml)
PF9_IMAGE_TAG=latest    # Pin to a release tag (e.g. v1.70.0) to lock images from ghcr.io

3. Manual Docker Setup

docker compose up -d
docker compose ps
docker compose logs pf9_api

4. Standalone Script Usage (No Docker Required)

# Inventory export
python pf9_rvtools.py

# Snapshot automation
python snapshots/p9_auto_snapshots.py --policy daily_5 --dry-run
python snapshots/p9_auto_snapshots.py --policy daily_5

# Compliance reporting
python snapshots/p9_snapshot_compliance_report.py --input latest_export.xlsx --output compliance.xlsx

# Policy assignment
python snapshots/p9_snapshot_policy_assign.py --config snapshots/snapshot_policy_rules.json --dry-run

πŸ”§ Configuration

Snapshot Policies

# Daily snapshots with 5-day retention
openstack volume set --property auto_snapshot=true \
                    --property snapshot_policies=daily_5 \
                    --property retention_daily_5=5 \
                    <volume-id>

# Multiple policies on one volume
openstack volume set --property auto_snapshot=true \
                    --property snapshot_policies=daily_5,monthly_1st \
                    --property retention_daily_5=5 \
                    --property retention_monthly_1st=12 \
                    <volume-id>

Scheduler Worker (v1.62.0+)

# Check scheduler status
docker logs pf9_scheduler_worker --tail 30

# Trigger metrics collection manually
docker exec pf9_scheduler_worker python host_metrics_collector.py --once

# Trigger RVTools collection manually
docker exec pf9_scheduler_worker python pf9_rvtools.py

Metrics collection and RVTools inventory now run inside the pf9_scheduler_worker container automatically. No Windows Task Scheduler setup is required.


πŸ“ Project Structure

pf9-mngt/
β”œβ”€β”€ api/                          # FastAPI backend (170+ endpoints)
β”œβ”€β”€ tenant_portal/                # Tenant self-service portal service (port 8010)
β”œβ”€β”€ pf9-ui/                       # React 19 + TypeScript frontend
β”œβ”€β”€ monitoring/                   # Prometheus metrics service
β”œβ”€β”€ snapshots/                    # Snapshot automation engine
β”‚   β”œβ”€β”€ p9_auto_snapshots.py      # Cross-tenant snapshot automation
β”‚   β”œβ”€β”€ snapshot_service_user.py  # Service user management
β”‚   β”œβ”€β”€ p9_snapshot_compliance_report.py
β”‚   β”œβ”€β”€ p9_snapshot_policy_assign.py
β”‚   └── snapshot_policy_rules.json
β”œβ”€β”€ db/                           # PostgreSQL schema + migrations
β”œβ”€β”€ backup_worker/                # Scheduled backup service
β”œβ”€β”€ metering_worker/              # Resource metering service
β”œβ”€β”€ search_worker/                # Full-text search indexer (Ops Assistant)
β”œβ”€β”€ notifications/                # Email notification service
β”œβ”€β”€ ldap/                         # OpenLDAP configuration
β”œβ”€β”€ docs/                         # Full documentation suite
β”œβ”€β”€ pf9_rvtools.py                # RVTools-style inventory export
β”œβ”€β”€ host_metrics_collector.py     # Prometheus metrics collection
β”œβ”€β”€ seed_demo_data.py             # Demo mode: populate DB + metrics cache
β”œβ”€β”€ p9_common.py                  # Shared utilities
β”œβ”€β”€ docker-compose.yml            # Full stack orchestration
β”œβ”€β”€ deployment.ps1                # One-command deployment
β”œβ”€β”€ startup.ps1                   # Quick start script
└── .env.template                 # Environment configuration template

πŸ“š Documentation

Document Purpose
Deployment Guide Step-by-step deployment instructions
Admin Guide Day-to-day administration reference
Architecture System design, trust boundaries, data model, auth flow
API Reference Complete API endpoint documentation
Security Guide Security model, authentication, encryption
Security Checklist Pre-production security audit checklist
Restore Guide Snapshot restore feature documentation
Snapshot Automation Snapshot system design and configuration
Snapshot Service User Service user setup and troubleshooting
VM Provisioning Setup Includes provisionsrv service user setup (Runbook 2)
Quick Reference Common commands and URLs cheat sheet
Kubernetes Deployment Helm chart, ArgoCD GitOps, Sealed Secrets, day-2 ops
Linux Deployment Running pf9-mngt on Linux instead of Windows
Multi-Region & Multi-Cluster Guide MSP operator guide: onboarding clusters, Region Selector UI, per-region filtering, workers, migration planning
Support Ticket System Guide Full reference for the ticket lifecycle, API, SLA, email templates, and auto-tickets
Tenant Portal Guide Tenant self-service portal: setup, branding, MFA, access management, Kubernetes deployment
CI/CD Guide CI pipeline, release process, and Docker image publishing
Engineering Story Platform9 evaluation background and the four operational gaps pf9-mngt solves
Features Reference Complete technical deep-dive: auth, inventory, snapshots, restore, runbooks, tickets, copilot, migration planner
Contributing Contribution guidelines

πŸ†˜ Troubleshooting

Common issues and solutions are covered in docs/ADMIN_GUIDE.md.

Quick commands:

  • Container logs: docker logs <container> --tail 50
  • Monitoring issues: .\fix_monitoring.ps1
  • Force inventory sync: docker exec pf9_scheduler_worker python pf9_rvtools.py
  • Database reset: docker compose down -v && docker compose up -d

❓ FAQ

Q: Does this replace the Platform9 UI? No β€” it is a complementary engineering console adding operational workflows not present in the native UI.

Q: Is this an official Platform9 product? No. Independent project, not endorsed by or affiliated with Platform9 Systems, Inc.

Q: Can I try this without a Platform9 environment? Yes β€” choose Demo Mode in deployment.ps1 or set DEMO_MODE=true in .env.

Q: Can I run this on Kubernetes? Yes β€” fully supported since v1.82.0. See docs/KUBERNETES_GUIDE.md.

Q: What are the minimum hardware requirements? A Docker host with at least 4 GB RAM, 2 CPU cores, and network access to your Platform9 region endpoints.

For questions on authentication, RBAC, LDAP/AD, snapshots, and restore see docs/ADMIN_GUIDE.md.


πŸ• Recent Major Releases

Critical decimal type conversion hotfix β€” v1.94.7

v1.94.7 β€” Critical hotfix resolving TypeError in chargeback calculations caused by decimal.Decimal and float type mismatch. Ensures proper type conversion for all database numeric values in financial calculations.

Critical hotfix for chargeback endpoint β€” v1.94.6

v1.94.6 β€” Critical hotfix resolving 500 internal server errors in chargeback-summary endpoint caused by missing datetime import. Ensures reliable metering functionality across all environments including Kubernetes deployments.

Enhanced chargeback system and metering fixes β€” v1.94.5

v1.94.5 β€” Complete chargeback overhaul with multi-resource support: (1) Critical chargeback bug fix β€” resolved zero-cost calculations by fixing metering worker to capture VM flavor data when monitoring service returns incomplete information. (2) Comprehensive cost calculations β€” support for all resource types (VMs, Storage, Network, Snapshots) with period-based analysis (7d, 30d, 90d, custom ranges). (3) Multi-currency support β€” enhanced API with ILS pricing configuration and proper currency alignment. (4) Database fallback logic β€” smart metering worker automatically uses database queries when monitoring service lacks flavor data.

Dark mode fixes and chargeback enhancements β€” v1.94.3

v1.94.3 β€” Comprehensive fixes and enhancements: (1) Dark mode dependency graphs β€” improved edge visibility in Kubernetes dependency graphs using lighter CSS values for better contrast. (2) Enhanced chargeback system β€” new per-VM details endpoint with cost attribution, currency selection support, and expanded VM state collection including stopped/suspended VMs. (3) Tenant portal UI modernization β€” enhanced dashboard and overview screens with improved loading states and theme-aware styling. (4) CI package synchronization β€” resolved npm ci failures by regenerating package-lock.json with missing dependencies.

Enhanced theme system and UI polish β€” v1.94.2

v1.94.2 β€” UI consistency and theme improvements: (1) Docs layout optimization β€” removed unnecessary right details panel from documentation page for full-width content display. (2) Dark mode dependency graphs β€” fixed edge visibility issues using theme-aware CSS variables instead of hard-coded colors. (3) TypeScript build fixes β€” resolved import syntax issues in tenant-ui for React 19.2+ compatibility. (4) Theme toggle component β€” added modern theme switching UI with accessibility features. (5) Unified theme system β€” shared CSS custom properties between admin and tenant portals for consistent styling.

UI layout and dark mode fixes β€” v1.94.1

v1.94.1 β€” Bug fixes and dark mode polish: (1) Sidebar scroll fixed β€” body/root/app-shell locked to height: 100vh; only the page content area scrolls. (2) Header/sidebar dividers aligned β€” brand area corrected to 64 px to match the header height. (3) GlobalHealthBar now loads β€” corrected API URL to /dashboard/health-summary (was returning 404). (4) Dark mode improvements β€” metric bar tracks visible, minimum fill width on low-utilisation bars, health stat boxes have contrast, card-to-background separation improved, header and sidebar separator lines visible.

Enterprise dashboard overhaul β€” v1.94.0

v1.94.0 β€” (1) Grafana-class dark palette: deep navy/slate background with cyan-sky primary accent, replacing the previous indigo palette. All CSS tokens are fully separated between light and dark themes. (2) Inter font adopted: Google Fonts Inter (weights 400–700) throughout the entire UI. (3) GlobalHealthBar: persistent 32 px top-of-page strip showing live VM counts, host count, critical/warning counts β€” refreshes every 30 s. (4) Recharts charts: VM Hotspots card now renders horizontal BarCharts with colour-coded cells; Top Hosts shows grouped CPU+Memory bars. (5) 7-day sparkline in the System Health card using a new dashboard_health_snapshots table populated daily by the scheduler. (6) StatusBadge component for consistent status pill rendering across the UI. (7) Skeleton loading states replace spinner/text placeholders in the dashboard and Insights tab. (8) Table density reduced for a more compact information-dense layout.

Monitoring push-cache, host network throughput, Copilot intents, dashboard polish β€” v1.93.47

v1.93.46 — Fixed the root cause of several "allocation-based usage" and N/A metric issues across all surfaces. The monitoring pod ran with hostNetwork: true, making its K8s Service endpoint resolve to a physical node IP (172.17.30.164). When kube-proxy on pf9-worker02 tried to DNAT ClusterIP traffic to that node IP it timed out, making the monitoring service unreachable from every pod on worker02 (tenant-portal, API pod, all workers). Fixed by: (1) disabling hostNetwork — the CNI masquerade rule NATSs pod traffic through the node IP for non-pod destinations, so hypervisor scrapes continue to work without hostNetwork; (2) making the tenant portal proxy all metrics through the main API (pf9-api:8000) instead of calling pf9-monitoring:8001 directly — the API is the single gateway for live metrics; (3) adding API→monitoring egress in the NetworkPolicy; (4) fixing three wrong default MONITORING_SERVICE_URL fallbacks (pf9_monitoring with underscore → pf9-monitoring with hyphen) in main.py and dashboards.py.

Monitoring pod node placement β€” v1.93.45

v1.93.45 β€” Fixed monitoring pod landing on the wrong K8s node (pf9-worker02, 172.17.30.165) which has no route to the hypervisor subnet 172.17.95.0/24. All Prometheus scrapes timed out, leaving the cache at source: database with storage, memory, and network fields all None. Added nodeSelector: kubernetes.io/hostname: pf9-worker01 to the monitoring Helm deployment to pin the pod to the node that has the route.

Dashboard live metrics in K8s β€” v1.93.44

v1.93.44 β€” Fixed Dashboard VM Hotspots, Host Utilization, and Health Summary avg CPU/memory showing allocation-based estimates instead of real Prometheus values in Kubernetes. The root cause was _load_metrics_cache() in dashboards.py only searching for a local cache file (written by the monitoring service via a shared Docker volume). In K8s there is no shared volume between the API pod and the monitoring pod, so the function always returned None and all three dashboard widgets fell back to DB allocation data. Fixed by adding an HTTP fallback: when no local cache file is found, the function calls GET /metrics/vms and GET /metrics/hosts on the monitoring service. Added MONITORING_SERVICE_URL=http://pf9-monitoring:8001 env var to the API pod Helm deployment (was already set on tenant-portal but missing from the API pod).

Live VM metrics, restore job cleanup, metering VM count, docs highlighting β€” v1.93.43

v1.93.43 β€” (1) Fixed live VM metrics (storage/memory/network all None): enabled hostNetwork: true on the monitoring pod so it uses the K8s node IP instead of the blocked pod CIDR, allowing it to reach the libvirt-exporter on hypervisors. (2) Added SSH+virsh fallback so VM metrics can be collected directly via SSH when the exporter is unreachable. (3) Added restore job deletion: DELETE /restore/jobs/{job_id} endpoint + Clear button in the Restore Audit table for PLANNED/FAILED/INTERRUPTED/CANCELED/SUCCEEDED jobs. (4) Added auto-timeout for stale restore jobs (PLANNED>2h, RUNNING>6h β†’ FAILED). (5) Fixed metering overview VM count: now uses the live servers table instead of historical metering records. (6) Added syntax highlighting (highlight.js + github-dark theme) to the in-app Docs viewer.

Hypervisor graph crash, volume assignments, storage display β€” v1.93.42

v1.93.42 β€” (1) Fixed "Error: Graph query failed" when opening a hypervisor dependency graph: _fetch_host() was referencing columns that don’t exist on the hypervisors table (fields live in raw_json). (2) Fixed Volume Assignments tab showing empty even when volumes are assigned via Cinder metadata: the assignments endpoint now merges DB-table rows with Cinder-metadata-enrolled volumes. (3) Improved storage cell display from ambiguous β€” to N/A / no live data / X GB provisioned and fixed the Storage Used column header tooltip to render on all browsers.

UX fixes: pagination, graph depth, hypervisors panel, metering filters β€” v1.93.41

v1.93.41 β€” (1) Fixed Snapshot Audit Trail pagination stuck on page 1 when navigating pages. (2) Domain Dependency Graph now opens at depth 3 (showing domain β†’ tenants β†’ VMs/volumes) instead of stopping at depth 2. (3) Added Hypervisors detail panel with full host info and a dependency graph shortcut. (4) Metering tab domain/project filters now reset when switching sub-tabs, preventing filter carry-over. (5) Snapshot PolicyForm apiFetch migration completed β€” the create/edit form was missed in the earlier refactor. (6) Improved empty-state messages on Volume Assignments and Monitoring storage column.

Auth fixes, SLA 503, VM metrics, capacity forecast β€” v1.93.40

v1.93.40 β€” (1) Fixed HTTP 401 on System Log and API Metrics tabs: cookie-first auth added to both backend handlers and the frontend now uses apiFetch with proper credential passing. (2) Fixed Snapshot Policy Assignments showing no data: raw fetch calls with a fake Bearer token replaced by apiFetch throughout SnapshotPolicyManager. (3) Fixed SLA Compliance Summary returning HTTP 503: unhandled DB exception now caught and returns a graceful 200 with empty summary. (4) Fixed VM Resource Metrics showing a misleading hypervisor-level CPU ratio instead of per-VM usage: DB fallback now returns null with a warning banner. (5) Fixed Capacity Forecast showing no data on new installs: minimum data-point threshold lowered to 2 days and the metering worker seeds an initial quota snapshot on startup. (6) Improved empty-state messages on all Insights tabs to explain data requirements.

volumes:read 403 fix, monitoring Unknown fields, dashboard storage β€” v1.93.39

v1.93.39 β€” (1) Fixed HTTP 403 on Change Management, Drift Detection, and Hypervisors tabs for admin/superadmin users: root cause was a corrupt idx_role_permissions_unique index in PostgreSQL; resolved with REINDEX TABLE role_permissions. (2) Admin Monitoring no longer shows "Unknown" for VM IP, Domain, and Tenant β€” the monitoring service bootstrap cache was discarding identity metadata; now preserved. (3) Dashboard VM Hotspots storage column no longer shows only "N/A" β€” shows "Provisioned: X GB" when live usage is unavailable.

Release pipeline fix β€” v1.93.38

v1.93.38 β€” Release pipeline fix: v1.93.37 git tag was pushed manually before the GitHub Actions Release workflow ran, causing all build/publish jobs (Docker images, Helm chart, deploy repo update) to be skipped. Version bumped to re-run the full pipeline correctly.

Admin UI fixes β€” v1.93.37

v1.93.37 β€” Fixes five admin UI regressions: (1) Flavors "VMs Using" now counts all VMs via a server-side SQL subquery instead of filtering the paginated page. (2) Change Management browser hang fixed by removing large inventory arrays from the loadRecentChanges effect dependency list. (3) Metering tab now enriches stale vm_ip/domain/project_name fields from live DB JOIN. (4) Tenant portal chargeback no longer shows "unknown" project/flavor by joining servers β†’ flavors β†’ projects. (5) technical role can now access Insights and SLA tabs (sla:read and intelligence:read grants added via migration). Also includes VM-level Prometheus metrics in the inventory table.

Monitoring live metrics now working (NetworkPolicy fix) β€” v1.93.36

v1.93.36 β€” The pf9-monitoring Kubernetes NetworkPolicy was missing egress rules for ports 9177 (libvirt-exporter) and 9388 (node-exporter), so every Prometheus scrape against the PF9 compute nodes (172.17.95.x) silently timed out. The monitoring service was permanently stuck serving DB allocation estimates. Added egress rules for TCP 9177 and 9388 so the monitoring pod can now collect real CPU/memory/storage metrics from the hypervisor exporters. Also fixed the tenant portal bypassing Gnocchi (Platform9 native telemetry) when the monitoring cache contained allocation data.

Monitoring storage 100% and wrong banner fix β€” v1.93.35

v1.93.35 β€” Storage bar no longer shows 100% when running on DB-fallback metrics (set storage_used_gb=null). Monitoring banner now correctly shows "allocation-based" instead of "live metrics" when hypervisor exporters are unreachable.

Capacity Runway false notice fix β€” v1.93.34

v1.93.34 β€” Capacity Runway "no quotas configured" notice no longer fires for tenants that have quotas. quota_configured is now sourced from project_quotas (actual OpenStack quota ceilings) rather than metering_quotas (whose quota columns are NULL in practice).

Monitoring 401 fix, capacity runway notice, test resilience β€” v1.93.33

v1.93.33 β€” Monitoring worker bootstrap no longer gets 401 (added /internal/monitoring/vm-metrics endpoint); capacity runway "no quotas" notice no longer fires when quotas are configured but usage is flat; live integration tests now skip gracefully when the local stack is not running.

Tenant portal live metrics, health dial guidance β€” v1.93.32

v1.93.32 β€” Current Usage tab now shows real Prometheus/libvirt metrics instead of allocation estimates (libvirt domain-name β†’ OpenStack UUID resolution fixed); Efficiency and Capacity Runway health dials gain explanatory tooltips and contextual advisory text when scores are low.

rvtools runs always success, history table schema fixed β€” v1.93.31

v1.93.31 β€” rvtools no longer recorded as failure on every run; duplicate-key race in project-quota upsert isolated with a savepoint; missing columns added to five *_history tables to restore drift/history tracking.

API error hardening, SVG upload restriction, docs validation β€” v1.93.30

v1.93.30 β€” Raw exception strings removed from all HTTP 500 responses; SVG removed from accepted branding upload types to prevent stored XSS; docs filename regex tightened to alphanumeric-only.

Security hardening, image pinning, alerting & log aggregation β€” v1.93.29

v1.93.29 β€” Branding URLs restricted to safe schemes; autocomplete attributes on all password fields; all Docker base images pinned to exact patch versions; optional pre-migration database backup; migration rollback guidance; Prometheus alerting rules for pods, API, DB pool and workers; Loki+Promtail log aggregation.

Code hardening: timeouts, chmod, SHA256, template validation, token cleanup, nginx rate limit β€” v1.93.28

v1.93.28 β€” Worker timeouts configurable via env vars; backup files chmod 0600; SHA256 in cache keys; Jinja2 template dir validated at startup; expired password reset tokens purged nightly; dev nginx rate-limited.

K8s hardening: ResourceQuota, PDB, HPA, imagePullPolicy β€” v1.93.27

v1.93.27 β€” Namespace ResourceQuota caps CPU/memory/pods; PodDisruptionBudgets protect API/portal/monitoring during node drains; HPA scaffolding for auto-scaling (disabled until metrics-server confirmed); imagePullPolicy: Always ensures security patches are always fetched.

πŸ”’ K8s image pinning: Postgres + Redis β€” v1.93.26

v1.93.26 β€” Completes M4 for Kubernetes: values.yaml Postgres and Redis tags pinned to postgres:16.8-alpine and redis:7.4.3-alpine. No data loss β€” Postgres data persists in a PVC; Redis is in-memory only.

πŸ”’ Security fixes: console leaks, image pinning, CSP, CSRF β€” v1.93.25

v1.93.25 β€” Five medium-severity security fixes. M1: all console.* calls stripped from pf9-ui production builds via Vite esbuild drop. M4: third-party Docker images pinned to exact versions (postgres:16.8-alpine, redis:7.4.3-alpine, osixia/phpldapadmin:0.9.0). M5: Content-Security-Policy and Permissions-Policy headers added to dev nginx config, matching prod. M6: X-Requested-With: XMLHttpRequest added to all mutating API requests in both frontends, defeating simple form-based CSRF. M8: unsafe-inline removed from style-src in prod nginx CSP.

πŸ”’ Security fixes: login enumeration, TOTP rate limit, HTML escape β€” v1.93.24

v1.93.24 β€” Three medium-severity security fixes. M2: tenant-ui login form now returns the same generic message for HTTP 401 and 403, preventing username enumeration. M3: all MFA endpoints (/verify, /verify-setup, /disable) limited to 3/minute; /verify adds Redis-based account lockout after 10 consecutive failures for 15 minutes. M7: db_writer.py alert email builders apply html.escape() defensively to all interpolated values.

🩹 Hotfix: tenant-ui CMD reads from wrong template path β€” v1.93.23

v1.93.23 β€” Hotfix for v1.93.22 regression. The COPY destination was fixed to /etc/nginx/templates/ but the CMD still read from /etc/nginx/conf.d/tenant-ui.conf.template, causing no such file at startup. Fix: CMD now reads from /etc/nginx/templates/tenant-ui.conf.template and writes the rendered config to /etc/nginx/conf.d/tenant-ui.conf (the writable emptyDir).

🩹 Hotfix: tenant-ui CrashLoopBackOff with readOnlyRootFilesystem β€” v1.93.22

v1.93.22 β€” Hotfix for v1.93.21 regression. The nginx entrypoint envsubst script writes a processed config to /etc/nginx/conf.d/ at startup; with readOnlyRootFilesystem: true this caused an immediate crash. Fix: template moved from conf.d/ to nginx/templates/ in the tenant-ui Dockerfile, and a new nginx-conf emptyDir volume added at /etc/nginx/conf.d in the K8s Deployment.

πŸ”’ Security hardening: TLS warnings, backup checksums, readOnlyRootFilesystem, LDAP conn leaks, circuit breakers β€” v1.93.21

v1.93.21 β€” Security hardening release. H4 TLS bypass warnings: ldap_sync_worker and api/auth.py now log a WARNING whenever verify_tls_cert=False, making insecure LDAP connections visible without blocking operation. H7 Backup integrity checksums: The backup worker computes a SHA-256 checksum of every .sql.gz file immediately after writing it and stores the hex digest in backup_history.integrity_hash; the restore endpoint verifies the on-disk file before queuing a restore (HTTP 409 on mismatch). New migration: db/migrate_v1_93_21.sql. H8 readOnlyRootFilesystem: All 15 Kubernetes Deployment templates now set readOnlyRootFilesystem: true; each service has /tmp (and nginx cache paths) mounted as emptyDir. H10 LDAP connection leaks: api/auth.py authentication and external LDAP bind paths now guarantee unbind_s() via try/finally. H15 Database circuit breaker: All 9 background workers have a circuit-breaker wrapper that opens after 3 consecutive DB failures and backs off 60 s. 582+ unit tests pass, 0 HIGH Bandit findings.

🩹 Hotfix: K8s JWT TTL + metrics key β€” v1.93.19

v1.93.19 β€” Kubernetes config hotfix. JWT TTL corrected: values.yaml had accessTokenExpireMinutes: 480; reduced to 60 to match the Docker Compose default from v1.93.18. Metrics endpoint protection wired into K8s: METRICS_API_KEY is now injected from pf9-metrics-secret K8s Secret into the API pod; sealed secret committed to the private deploy repo. Cluster check bug fixed: check_cluster.py was false-PASSing the METRICS_API_KEY check when the key was absent ("CONFIGURED" in "NOT_CONFIGURED" matched). 568 unit tests pass, 0 HIGH Bandit findings.

οΏ½πŸ”’ Auth hardening β€” v1.93.18

v1.93.18 β€” Security hardening release. JWT jti revocation: Tokens now include a unique jti claim; logout stores the jti in Redis for immediate invalidation with DB session as defence-in-depth. Shorter token lifetimes: JWT default TTL reduced 90 β†’ 15 min, MFA challenge TTL 5 β†’ 2 min. Tighter rate limits: Login endpoint 10 β†’ 5/min, password reset 5/min β†’ 3/hour. Metrics endpoint protection: /metrics and /worker-metrics require X-Metrics-Key header when METRICS_API_KEY is configured (constant-time comparison). Log hygiene: Password reset token no longer logged in plaintext (gate behind DEBUG_SHOW_RESET_TOKEN=true). Secret file permissions: Write bits on secret files now raise PermissionError instead of a warning. Structured logging: config_validator.py outputs via logging module. 581 unit tests pass, 0 HIGH Bandit findings.

🩹 Hotfix: migration job unblocked β€” v1.93.17

v1.93.17 β€” Fixed pf9-db NetworkPolicy missing db-migrate in allowed ingress sources. Helm post-upgrade migration job was stuck in Init:0/1 because the new NetworkPolicy blocked the init container's DB connectivity check.

οΏ½πŸ”’ NetworkPolicies enabled β€” v1.93.16

v1.93.16 β€” NetworkPolicies activated in production. All 16 service-level NetworkPolicies are now enforced in the pf9-mngt namespace following successful --dry-run=server validation against the live cluster. Default-deny between all services except explicitly permitted traffic paths.

πŸ”’ Security hardening β€” v1.93.15

v1.93.15 β€” Security hardening release. Kubernetes NetworkPolicies: Each service now has a dedicated NetworkPolicy with default-deny semantics (disabled by default β€” enable with networkPolicy.enabled=true after dry-run verification). Container security contexts: allowPrivilegeEscalation: false and capabilities.drop: [ALL] added to all application containers; pod-level seccompProfile: RuntimeDefault added to all 15 workloads. Ingress TLS enforcement and rate limiting: Both admin and tenant-UI ingresses now enforce HTTPS redirect and carry rate-limit annotations. 570 unit tests pass (32 new K8s Helm security tests), 0 HIGH Bandit findings.

πŸ”’ Security hardening β€” v1.93.14

v1.93.14 β€” Security fix release. Internal route authentication: The RBAC middleware now validates X-Internal-Secret for all /internal paths instead of passing them through without any check. Upload size limit: POST /onboarding/upload now caps reads at 10 MB and returns HTTP 413 for oversized payloads. Notification digest cap: The per-user notification digest bucket is now capped at 1000 events in a single SQL statement β€” oldest events are trimmed when the cap is reached. Redis authentication: All services (API, workers, tenant portal) now support REDIS_PASSWORD; when set, Redis starts with --requirepass. Kubernetes deployments read the password from a K8s secret. 538 unit tests pass, 0 HIGH Bandit findings.

πŸ”’ Security hardening β€” v1.93.13

v1.93.13 β€” Security fix release. Cache invalidation bug: The cache invalidate() helper was building a different key than wrapper() (missing region_id segment), making every invalidation call a silent no-op. Fixed to use the exact same key structure. HTML injection in welcome email: The inline fallback provisioning email template interpolated user-supplied values (username, domain_name, project_name, etc.) directly into HTML without escaping. All values now use html.escape(). Backup path traversal protection: Backup file deletion now validates the resolved absolute path is within NFS_BACKUP_PATH before calling os.remove(). SSRF prevention in PSA webhooks: Webhook URLs targeting private, loopback, link-local, or reserved IP ranges are now rejected at input validation time. 538 unit tests pass, 0 HIGH Bandit findings.

🩹 Storage % + Efficiency + Capacity Runway display fixes β€” v1.93.12

v1.93.12 β€” Bug-fix release. Storage bar 100% for all VMs: The DB allocation fallback set storage_used_gb = flavor_disk_gb and storage_total_gb = flavor_disk_gb, making the percentage always 100%. Fixed by setting storage_used_gb = None so no misleading bar is drawn β€” the allocated disk size in GB is still shown as a label. Health Overview Efficiency=0: The internal client-health API received the tenant's project UUID but metering_efficiency stores human-readable project names (e.g. ORG1); the UUID matched zero rows, returning COALESCE(AVG, 0) = 0. Fixed by resolving the UUID via the projects table before the query. Capacity Runway red "0": When quotas are not configured, capacity_runway_days is correctly null but the HealthDials component mapped null β†’ 0, rendering a red ring with "no data". Now renders a neutral grey empty ring with "no quota configured" label. 538 unit tests pass, TypeScript clean.

✨ Platform9 Gnocchi Real Telemetry + CI Docker Fix β€” v1.93.10

v1.93.10 β€” Feature + fix release. Real VM metrics from Platform9 Gnocchi: The tenant portal Current Usage tab now queries Platform9's Gnocchi telemetry API for real CPU %, resident memory MB, disk IOPS, and network MB/s β€” the same values visible in Platform9's own resource-utilization UI. Uses existing PF9_AUTH_URL/PF9_USERNAME/PF9_PASSWORD credentials. Fires as step 3 in the metrics fallback chain (after the monitoring-service cache, before DB allocation estimates). Token caching, parallel per-VM queries via asyncio.gather, and graceful degradation to DB allocation when Ceilometer is not installed. New "Live Platform9 telemetry" UI badge. CI Docker build fix: Release pipeline tenant-portal and API images were taking 10+ minutes under QEMU ARM64 due to RUN chown -R 1000:1000 /app recursively chown-ing thousands of pip package files through emulated syscalls; switched to COPY --chown=1000:1000 with targeted directory chown only. 538 unit tests pass, TypeScript clean.

🩹 Monitoring Current Usage β€” DB Fallback Fix β€” v1.93.9

v1.93.9 β€” Bug-fix release. Current Usage "No metrics collected yet": The DB allocation fallback queried jsonb_array_elements(vol.raw_json->'attachments') to resolve disk size from attached volumes; if any volume row stored attachments as a non-array JSONB value the entire query aborted silently, returning an empty VM list. Guarded with jsonb_typeof() = 'array' so malformed rows are skipped. Also broadened the server filter from status = 'ACTIVE' to status NOT IN ('DELETED','SOFT_DELETED') so SHUTOFF/PAUSED/ERROR VMs also appear with allocation data. Fix applied in both tenant_portal/metrics_routes.py and api/main.py. 538 unit tests pass, TypeScript clean.

🩹 Layout Flicker + Monitoring 500 + CI Fix β€” v1.93.8

v1.93.8 β€” Bug-fix release. Admin UI flicker: After the v1.93.6 lazy-init fix, navLoading still started false, causing the legacy flat tab bar to flash before GroupedNavBar loaded; navLoading now initialises to true on authenticated page loads so the sidebar stays invisible until nav data arrives. Monitoring availability 500: last_seen was only assigned inside the legacy else branch but used unconditionally β€” in Kubernetes (real OpenStack statuses) the else was never reached, producing NameError β†’ HTTP 500. CI: test_T01_branding_via_proxy failed on the dev branch because httpx.RemoteProtocolError (server drops connection) was not caught alongside ConnectError. 524 unit tests pass, TypeScript clean.

🩹 Monitoring Status + Usage Bars Fix β€” v1.93.7

v1.93.7 β€” Bug-fix release. Monitoring Availability: All VMs showed "Down" despite being ACTIVE because status was derived from last_seen_at staleness (inventory sync ~2.5h lag); now reads servers.status directly so ACTIVE VMs show "Up" immediately. Monitoring Current Usage: Kubernetes deployments showed static text (1 vCPU, 2 GB) instead of usage bars because the DB fallback returned null percentages; now computes CPU/RAM as VM's share of hypervisor capacity with real progress bars. 524 unit tests pass, TypeScript clean.

🩹 Flicker Fix + Graph Labels + VM Detail Usage β€” v1.93.6

v1.93.6 β€” Bug-fix release. Flicker (Admin UI): On browser refresh isAuthenticated started as false so the login screen flashed before the main app mounted; fixed with lazy useState initialisers that read localStorage synchronously on the first render. Tenant portal auth also hardened: useAuth now initialises to a restoring phase when a token is present, showing a full-screen spinner until apiMe() resolves. Dependency Graph: Node labels were hard-truncated at 12 characters (column spacing 160px); widened to 210px and raised threshold to 18 characters, plus SVG <title> tooltip for hover. VM Detail Panel: "Current Usage" section was hidden when no live metrics were available; now always visible with flavor allocation values as fallback. 524 unit tests pass, 0 HIGH Bandit findings, TypeScript clean.

🩹 VM Provisioning QEMU Channel Fix + Monitoring Allocation View β€” v1.93.5

v1.93.5 β€” Bug-fix release. VM Provisioning: Linux images were never patched with hw_qemu_guest_agent=yes before VM creation; Nova/libvirt therefore never added the virtio-serial channel device to the domain XML, making changePassword always return 409 even for VMs where cloud-init successfully installed the agent. Fixed: provisioning loop now patches Linux images with hw_qemu_guest_agent=yes (same pattern as Windows hw_disk_bus/hw_firmware_type patching). Monitoring: Current Usage cards showed β€” when using the DB allocation fallback; cards now show allocated vCPU/RAM/disk with an info banner. Runbooks: Reset VM Password 409 now shows distro-specific install instructions instead of the generic note; pre-emptive Guest Agent Warning removed from all-Linux flow. 524 unit tests pass, 0 HIGH Bandit findings, TypeScript clean.

🩹 New VM Portal Sync + SLA Compliance + 4 More Fixes β€” v1.93.4

v1.93.4 β€” Bug-fix release. Tenant portal: New VMs created after a fresh RVtools sync were invisible in the tenant portal because upsert_servers() never set region_id (left NULL); tenant portal query WHERE region_id = ANY(%s) silently excluded them β€” fixed by assigning the default region in db_writer.py and backfilling existing NULL rows on startup. My Infrastructure status filter (Running/Stopped/Error dropdown) showed "No VMs found" for all specific selections because the option values ("running", "stopped") didn’t match the OpenStack DB values ("ACTIVE", "SHUTOFF"). Snapshot SLA Compliance card β€” clicking a tenant row showed nothing for compliant tenants (warnings.length > 0 condition blocked the details row); now always shows either the issues list or a β€œAll volumes compliant” confirmation. Also: monitoring DB fallback when cache empty, chargeback 500 fix, panel widened to 680px, snapshot calendar "OK" vs "success" comparison. 538 unit tests pass, 0 HIGH Bandit findings, TypeScript clean.

🩹 4 Fixes + Chargeback: Tenant Portal β€” v1.93.3

v1.93.3 β€” Bug-fix + feature patch. Tenant portal: VM Health Quick Fix result panel rendered nested check objects as [object Object] β€” replaced with a recursive renderer. Reset VM Password crashed on volume-booted VMs ('str' object has no attribute 'get') and always reported OS type as unknown β€” fixed with isinstance guard and os_distro/image-name heuristics. Monitoring Current Usage was always empty in Kubernetes because _load_metrics_cache() returned early on an empty monitoring response before the DB allocation fallback could run. New Chargeback screen shows per-VM cost estimates scoped to the tenant's own projects, with currency selector, period picker, pricing-basis detail, and a clear estimation disclaimer. 538 unit tests pass, 0 HIGH Bandit findings, TypeScript clean.

🩹 7 Bug Fixes: Tenant Portal + Migration Planner Analysis β€” v1.93.2

v1.93.2 β€” Bug-fix release. Tenant portal (6 fixes): VM Health Quick Fix runbook sent vm_name instead of UUID (server_id param key) β†’ Nova 404, now always sends UUID. Reset VM Password result panel rendered nested objects as [object Object] β€” added striped key-value renderer with URL linkification. VM Rightsizing x-lookup: vms_multi was unhandled β€” added multi-checkbox selector sending a UUID array. Dashboard quota showed 0 used for all resources β€” DB fallback counts from servers+flavors/volumes+snapshots when Nova/Cinder returns flat integers. Snapshot Coverage calendar tooltips and history tab now include error_message (failure reason). Monitoring "service unreachable" banner shown when pod was running and returning empty data β€” fixed by returning the HTTP 200 response immediately regardless of empty vms list. Migration Planner Analysis (1 fix): All Analysis sub-view tabs (VMs, Tenants, Networks, Hosts, Clusters, Stats) returned 404 β€” SourceAnalysis.tsx used project.id (integer PK 1) instead of project.project_id (UUID) to construct API URLs. 538 unit tests pass, 0 HIGH Bandit findings, TypeScript clean.

🩹 Tenant Portal Runbooks Bug Fixes β€” v1.93.0

v1.93.0 β€” Bug-fix release for tenant portal runbooks. Execute dialog was permanently stuck on "Run Dry Run" because supports_dry_run and parameters_schema were missing from the list endpoint response β€” VM-targeted runbooks (VM Health Quick Fix, Snapshot Before Escalation) never rendered the VM selector and always executed without a server_id, returning 0 items. All runbook results showed "0 items found / 0 actioned" because items_found/items_actioned are stored as separate DB columns (not inside the result JSONB) and were never wired through the TypeScript interface or normalisers. Result panel also read from the wrong nesting level (result.result instead of result). Fixed across tenant_portal/environment_routes.py, api/restore_management.py, tenant-ui/src/lib/api.ts, and Runbooks.tsx. Quota Threshold Check description updated to not imply cross-project scope. 538 unit tests pass, 0 HIGH Bandit findings.

πŸ“Š Role-Based Dashboard Layer β€” v1.92.0

v1.92.0 β€” Phase 6: Persona-Aware Dashboards. Two new role-specific views surface existing intelligence data in job-relevant formats. Account Manager Dashboard (My Portfolio tab) β€” per-tenant portfolio grid with SLA status badge, vCPU usage bar, critical/leakage insight counts, and KPI strip (healthy/at-risk/breached/not-configured/critical/leakage totals). Powered by GET /api/sla/portfolio/summary. Executive Dashboard (Portfolio Health tab) β€” fleet-level stacked SLA bar, 6 KPI cards (fleet health %, breached clients, at-risk clients, open critical insights, revenue leakage/month, avg MTTR), and narrative sections for leakage and MTTR compliance. Powered by GET /api/sla/portfolio/executive-summary. New account_manager and executive RBAC roles, two new departments (Account Management, Executive Leadership) with default_nav_item_key so each persona lands on their dashboard at login. unit_price DECIMAL(10,4) column added to msp_contract_entitlements (nullable β€” enables revenue leakage dollar estimates). DB migration migrate_v1_92_0_phase6.sql applied to Docker and Kubernetes. 538 unit tests pass, 0 HIGH bandit findings, TypeScript clean.

οΏ½πŸ“‹ SLA Commitment Editor & Compliance History β€” v1.91.3

v1.91.3 β€” Tenant detail drawer now includes a full SLA section with two sub-tabs. The Commitment sub-tab lets admins select a tier template (Gold/Silver/Bronze/Custom) or manually enter Uptime %, RTO, RPO, MTTA, MTTR, Backup Frequency, effective date, and notes, then save via PUT /api/sla/commitments/{tenant_id} β€” with the form pre-populated from any existing commitment on open. The History sub-tab shows a 12-month compliance scorecard table with per-cell breach (red) and at-risk (amber) highlighting driven by breach_fields/at_risk_fields from GET /api/sla/compliance/{tenant_id}. SLA data loads in parallel with the existing quota fetch when the detail panel opens. No backend changes required. 538 unit tests pass, 0 HIGH bandit findings.

🩹 PSA Webhooks, Health 500, Clickable Sort Headers β€” v1.91.2

v1.91.2 β€” Bug-fix patch. Fixed GET /api/psa/configs and POST /api/psa/configs/{id}/test-fire missing /api prefix in IntelligenceSettingsPanel.tsx β€” PSA Webhooks tab no longer throws Unexpected token '<', "<!doctype".... Fixed /internal/client-health/{tenant_id} 500: endpoint was querying non-existent resource/runway_days columns on metering_quotas; replaced with correct linear-regression runway logic (_days_runway / _linear_forecast over 14-day quota history). Insights Feed column headers (Entity, Tenant, Status, Detected, Severity, Type) are now clickable sort triggers with triangle indicators; filter-bar sort labelled Sort by:. 538 unit tests pass, 0 HIGH bandit findings.

🧩 Client Health, Observer Role & Insights History β€” v1.91.0

v1.91.0 β€” Full Client Transparency Layer. Added portal_role column (manager | observer) to tenant_portal_access; observer tokens are blocked at the API layer from all write routes. New GET /api/intelligence/client-health/{tenant_id} endpoint returning three-axis health (Efficiency, Stability, Capacity Runway). Tenant UI gains a Health Overview default screen with SVG circular dials. Observer invite flow via magic-link email. Insights History tab (resolved insights with pagination). Operations summary bar. Admin UI role-toggle per portal user. DB migration migrate_v1_91_0_phase5.sql. 538 unit tests pass, 0 HIGH bandit findings.

🩹 Intelligence 500 / Sort / Entitlements UX Fixes β€” v1.90.1

v1.90.1 β€” Hotfix patch for v1.90.0. Fixed /api/intelligence/regions 500 crash (wrong SQL column names hypervisor_id/collected_at on the servers and servers_history tables; root cause of cascading 502/503 pod-restart loop). Fixed cross-region growth-rate always returning 0.0 (same column bug silently swallowed in cross_region.py). Fixed Python syntax error in intelligence_routes.py (_SORT_CLAUSES dict placed between decorator and function). Added Sort dropdown to Insights Feed (server-side, 5 options). Added clickable sort headers to Risk & Capacity and Capacity Forecast tables (client-side, toggle asc/desc). Contract Entitlements tab now includes a full feature explanation, column-reference spec table, downloadable CSV template, and styled import button. All intel-settings-* CSS classes added to InsightsTab.css. 538 unit tests pass, 0 HIGH bandit findings.

🏒 MSP Business Value Layer β€” v1.90.0

v1.90.0 β€” Revenue Leakage engine detects over-consumption upsell opportunities (leakage_overconsumption) and ghost-resource billing gaps (leakage_ghost). New Quarterly Business Review PDF generator (POST /api/intelligence/qbr/generate/{tenant_id}) with configurable sections (cover, executive summary, ROI interventions, health trend, open items, methodology). PSA outbound webhook integration with per-config severity/type/region filtering and Fernet-encrypted auth headers. Labor rate configuration per insight type for defensible ROI reporting. Intelligence Settings panel (admin-only): labor rates editor, PSA webhook CRUD, CSV contract entitlement import. Business Review button in Tenant Health detail pane. SLA PDF report pipeline consolidated into export_reports.py. DB migration adds 3 new tables; 538 unit tests pass, 0 HIGH bandit findings.

πŸ“ˆ Extended Forecasting, Cross-Region Intelligence & Anomaly Detection β€” v1.89.0

v1.89.0 β€” Capacity engine extended with per-hypervisor compute forecasting and per-project quota-saturation forecasting (vCPUs, RAM, instances, floating IPs) including confidence scoring. New cross-region engine detects utilization imbalance, risk concentration, and growth-rate divergence across regions. New threshold-based anomaly engine fires on snapshot spikes, VM-count spikes, and API error spikes. Two new REST endpoints: GET /api/intelligence/forecast (on-demand runway per project/resource) and GET /api/intelligence/regions (per-region utilization + runway + growth). Intelligence Dashboard gains two tabs: Capacity Forecast and Cross-Region comparison. Department filter upgraded to prefix matching so insight subtypes are correctly routed. 524 unit tests pass, 0 HIGH bandit findings.

�🩹 SLA Summary Route Hotfix + Insights Feed Tenant Column β€” v1.88.1

v1.88.1 β€” Hotfix: GET /api/sla/compliance/summary was being shadowed by the earlier GET /api/sla/compliance/{tenant_id} route (FastAPI matches in registration order), causing the SLA Summary tab to always show empty even when tiers were configured. Fixed by reordering the routes. Also adds a Tenant/Project column to the Insights Feed table (from metadata.project), matching the column already present in Risk & Capacity. No DB migration required.

οΏ½ Phase 2 Intelligence β€” Recommendations, Bulk Actions, Copilot Intents β€” v1.88.0

v1.88.0 β€” Phase 2 of Operational Intelligence: idle-VM waste insights now generate actionable recommendations (cleanup runbook β‰₯14 days, downsize suggestion β‰₯7 days). Risk engine auto-creates support tickets for snapshot-gap and critical health-decline insights. New bulk-acknowledge/bulk-resolve API endpoints. Five new Copilot natural-language intents (critical_insights, capacity_warnings, waste_insights, unacknowledged_insights_count, risk_summary). InsightsTab UI: SLA Summary shows only configured tenants sorted by breach status; Risk & Capacity gains Tenant/Project column; bulk-action bar above feed; per-row recommendations panel with dismiss. 524 unit tests pass, 0 HIGH bandit findings.

�🩹 SLA & Intelligence Write 500 Hotfix β€” v1.87.2

v1.87.2 β€” PUT /api/sla/commitments and intelligence write endpoints (acknowledge/snooze/resolve) all returned HTTP 500. Root cause: require_permission() returns user.model_dump() (a dict) but the affected handlers called user.username (attribute access). Fixed to user["username"] dict access in both sla_routes.py and intelligence_routes.py. 524 unit tests pass, 0 HIGH bandit findings.

🩹 Intelligence 500 Hotfix β€” v1.87.1

v1.87.1 β€” All GET /api/intelligence/ endpoints returned HTTP 500 after v1.87.0 deployed to Kubernetes. Root cause: # nosec B608 bandit suppression comments placed on the same line as the opening triple-quoted f-string were included in the SQL text sent to PostgreSQL. PostgreSQL raised a syntax error on the # character, crashing every intelligence request. Fix: moved suppression comments to the cur.execute( call line. 524 unit tests pass, 0 HIGH bandit findings.

πŸ” Department Workspaces + SLA Tier Modal β€” v1.87.0

v1.87.0 — Operational Intelligence workspace selector: four context-aware workspaces (Global / Support / Engineering / Operations) filter the insight feed to relevant insight types with sensible severity presets; workspace preference persists to localStorage; operator role defaults to Engineering on first load. New intelligence_utils.py is the single source of truth for insight-type→department routing, consumed by GET /api/intelligence/insights?department= and GET /api/intelligence/insights/summary?department=. Fixed SLA tier assignment modal: SlaTierTemplate interface was using id/name but the API returns tier/display_name causing an empty dropdown; replaced bare KPI summary with a rich description block per tier (plain-language guidance, 3-column KPI grid, abbreviation legend). 538 tests, 0 HIGH bandit findings.

πŸ”§ SLA Summary Hotfix β€” v1.86.2

v1.86.2 β€” InsightsTab SLA Summary fix: API returns { summary, month } but the component consumed data.projects (undefined), crashing on .length. Also corrected SlaSummaryRow interface and table columns to match the actual summary endpoint response (tenant_id/tenant_name/breach_fields/at_risk_fields instead of KPI values). 524 tests, 0 HIGH bandit findings.

πŸ”§ Kubernetes Hotfix β€” v1.86.1

v1.86.1 β€” K8s CrashLoopBackOff hotfix for sla-worker and intelligence-worker: Helm values.yaml was missing redis.host and redis.port keys. Both worker Deployments inject REDIS_HOST/REDIS_PORT via {{ .Values.redis.host | quote }} / {{ .Values.redis.port | quote }}, which resolved to empty strings when the keys were absent. int("") raised ValueError: invalid literal for int() with base 10: '' at startup. Fixed by adding redis.host: pf9-redis and redis.port: "6379" to values.yaml. Helm chart version bumped from 1.85.7 to 1.86.1. 538 tests, 0 HIGH bandit findings.

οΏ½ SLA Compliance + Operational Intelligence β€” v1.86.0

v1.86.0 β€” SLA Compliance Tracking and Operational Intelligence Feed: SLA tier templates (bronze/silver/gold/custom), per-tenant commitments, monthly KPI measurement (uptime %, RTO, RPO, MTTA, MTTR, backup success %), and PDF compliance reports. sla_worker computes KPIs every 4 hours; breach detection fires sla_risk insights. intelligence_worker (15-min poll) runs three engine families β€” Capacity (linear-regression storage trend), Waste (idle VMs, unattached volumes, stale snapshots), Risk (snapshot gap, health decline, unacknowledged drift). New πŸ” Insights tab with three sub-views: Insights Feed (ack/snooze/resolve), Risk & Capacity, SLA Summary. Dashboard widget shows insight count by severity.

οΏ½πŸ”§ Tenant Portal Bug-Fixes β€” v1.85.5–v1.85.12

v1.85.12 β€” K8s CrashLoopBackOff hotfix (tenant-ui nginx + monitoring httpx): pf9-tenant-ui crashed on v1.85.11 because nginx.conf hardcoded proxy_pass http://tenant_portal:8010 (Docker Compose service name), which fails DNS resolution in Kubernetes (service is pf9-tenant-portal). Fixed using an envsubst template β€” same image works in Docker Compose (default tenant_portal:8010) and Kubernetes (TENANT_PORTAL_UPSTREAM=pf9-tenant-portal:8010 via Helm). pf9-monitoring crashed because _bootstrap_cache_from_api() imports httpx at the function level (outside try) but httpx was absent from monitoring/requirements.txt β€” CI-built image raised ModuleNotFoundError on startup. Added httpx==0.27.2. 538 tests, 0 HIGH bandit findings.

v1.85.11 β€” Tenant portal fully operational + branding logo + [object Object] error fix + Restore Center (MANUAL_IP / result panel / email): Tenant portal was completely broken in production β€” tenant-ui nginx had no proxy for /tenant/* so every API call returned index.html; fixed by adding location /tenant/ proxy block. Branding logos uploaded via the admin UI (file-path logo_url in DB) now convert to inline base64 data URLs at read time β€” no nginx re-routing required. Admin UI no longer shows [object Object] on API validation errors (apiFetch in pf9-ui now unwraps FastAPI 422 array detail into readable messages). Restore Center gains MANUAL_IP network/IP strategy, post-restore result panel (new VM name, error details accordion), email summary button, and expandable history rows. Monitoring bootstrap always runs on startup. 538 tests, 0 HIGH bandit findings.

v1.85.10 β€” K8s Branding/Monitoring/Runbook fixes: Branding save 422 fixed (logo URL validator now accepts server-relative /api/ paths); logo upload 400 fixed in K8s (content-type extension fallback when nginx ingress strips multipart part headers); monitoring empty-hosts bug fixed ("".split(",")=[""]β†’ now correctly[]); monitoring startup race fixed (5Γ— retry with 5 s gaps); branding_logosemptyDir volume added to K8spf9-apipod; runbook results now includeitems_scannedcounts andsummarystrings for operator visibility; SQL injection B608 fixed incapacity_forecast` engine; 70 new tests (28 integration, 42 unit).

v1.85.9 β€” Branding logo upload + monitoring docker-compose fixes: Admin Branding tab now has an Upload Image button with live preview (PNG/JPEG/GIF/WebP/SVG, ≀512 KB, per-tenant via ?project_id=). Fixed 3 docker-compose bugs that caused "No metrics collected yet": wrong MONITORING_SERVICE_URL DNS name (http://monitoring β†’ http://pf9_monitoring), PF9_HOSTS defaulting to localhost (prevents auto-discovery), missing monitoring/cache volume mount in tenant_portal. 35 new unit tests.

v1.85.8 β€” Quota Usage / Runbooks VM picker / Monitoring host auto-discovery: Dashboard Quota bars now show real in-use figures (Nova/Cinder ?usage=true was missing); vm_health_quickfix + snapshot_before_escalation Execute dialogs now show the Target VM dropdown (server_id field detected via x-lookup: vms); monitoring service auto-discovers hypervisor IPs from DB at startup when PF9_HOSTS is empty (new /internal/prometheus-targets admin API endpoint). 27 new unit tests.

v1.85.7 β€” K8s bug-fix release: "Connection lost" banner on Branding tab eliminated (apiFetch now throws immediately on any HTTP error without retrying); /tenant/quota 400 fixed (CP ID regex now accepts slugs like default); snapshot calendar header labels realigned with cells + today marker added; Runbooks blank page / TypeError on risk_level.toLowerCase() fixed (normalised apiExecuteRunbook response + null guards); Monitoring empty-state now shows distinct message for service-unreachable vs no-data-collected.

v1.85.6 β€” K8s bug-fix release: Active Sessions tab 500 fixed (Redis errors handled gracefully); Branding tab "branding_not_found" error banner fixed (detail string caught alongside HTTP 404); per-tenant branding overrides added (project-scoped rows, admin scope dropdown, useBranding re-fetches on login).

v1.85.5 β€” K8s bug-fix release: Monitoring/Runbooks 401 fixed (added /internal to admin API RBAC exclusions); Volumes "Attached To" column shows VM name; VM list Coverage column populated; Fixed IP picker filters by selected network.

v1.85.4 β€” K8s bug-fix release: VM Disk column now shows boot-volume size for BFV VMs; Volumes table shows last snapshot date; Monitoring/Runbooks 502 fixed by adding NetworkPolicy egress to admin API + monitoring pods; New VM Fixed IP picker shows IPs already in use in the selected network.

v1.85.3 β€” Runbook execution from tenant portal (execute button, parameter form, dry-run toggle, execution history tab); Create VM: RFC-1123 name validation, fixed IP picker, cloud-init user/password; Dependency graph expanded to 5 node types (VM, Network, Subnet, Security Group, Volume) and 4 edge types; VM list and inventory CSV now include disk size and IP addresses; Activity Log shows username + truncated Keystone user ID; Dashboard correctly shows amber "Skipped" for skipped snapshot events.


🏒 Tenant Self-Service Portal β€” v1.84.0 β†’ v1.84.19 (Complete)

v1.84.21 β€” Fix tenant-ui build: api.ts had a second corrupted copy appended after the first clean copy (1341 lines instead of ~661) β€” prior replace_string_in_file left old interleaved fragments in place. Truncated file to first clean copy; Docker build now passes. v1.84.20 β€” Fix tenant-ui build: api.ts was corrupted by overlapping replacements (code fragments interleaved, missing closing parens, unterminated template literals) β†’ Docker npm run build failed with 10+ TS1005/TS1160 errors. Rewrote file cleanly; tsc --noEmit passes. v1.84.19 β€” Tenant portal crash-fix: restore_jobs table has no region_id column β€” 4 queries wrongly filtered by it β†’ dashboard 500 UndefinedColumn; full api.ts adapter layer rewrite β€” all 16 API functions now unwrap backend {key:[...],total:N} envelopes and remap field names to match TypeScript interfaces, fixing vms.filter is not a function crash on every tenant screen. v1.84.18 β€” DB/K8s fixes: tenant_portal_role had INSERT but not SELECT on tenant_action_log β†’ every post-login endpoint returned 500; K8s secret password never set on DB user tenant_portal_role in pf9-db-0 β†’ login returned 500 immediately. v1.84.17 β€” CI fix: httpx was missing from the integration test job pip install step; test_tenant_portal_login_integration.py imports it for live HTTP calls, causing ModuleNotFoundError at collection time and aborting the entire CI run. Added httpx to .github/workflows/ci.yml. v1.84.16 β€” Fix K8s 504: NetworkPolicy ingress namespace was ingress-nginx but nginx-tenant controller deploys to ingress-nginx-tenant; egress had no Keystone (443/5000) rule; login error banner now shows context-aware messages (was always "Invalid credentials" for any error including 504/403). v1.84.15 β€” Fix 504 on tenant portal login: async Keystone call (was blocking uvicorn event loop); VITE_TENANT_API_TARGET added to docker-compose override (dev proxy was hitting localhost inside container); K8s ingress proxy-read/connect-timeout annotations added. v1.84.14 β€” Domain field on login form (Keystone multi-domain support); domain field hardened with max_length + regex whitelist; security tests extended to S33. v1.84.13 β€” Bug-fix & security hardening: log_auth_event TypeError crash on every access grant/revoke fixed; Audit Log sub-tab 500 (wrong column names) fixed; batch grant transaction-poisoning fixed (savepoints); stored-XSS via javascript: / data: URIs in branding URLs blocked; field length limits added; security test suite extended to S30. v1.84.12 β€” Grant Access wizard (3-step: tenant picker β†’ user checkboxes β†’ MFA/notes); batch grant API; CP dropdown. v1.84.11 β€” Grant Access form gains User Name + Tenant/Org Name fields; access table shows friendly labels; user_name/tenant_name DB + API. v1.84.10 β€” Nav fix: tenant_portal tab now appears in Admin Tools; DB migration for live environments; guide corrections. v1.84.9 β€” Tenant Portal complete: GET /tenant/branding unauthenticated branding endpoint (60 s cache); admin GET/PUT /branding/{cp_id} and DELETE /mfa/{cp_id}/{user_id} endpoints; Admin UI "🏒 Tenant Portal" tab with 4 sub-tabs; 27 P8 security tests (S01–S27 across 8 categories). β†’ Tenant Portal Guide

v1.84.4 β€” Tenant-ui SPA: React + TypeScript, 7 screens (Dashboard, Infrastructure, Snapshot Coverage, Monitoring, Restore Center, Runbooks, Activity Log), MFA login, per-customer branding. Kubernetes stability fixes in v1.84.5–v1.84.8 (dedicated nginx-ingress-tenant on separate MetalLB IP).

v1.84.3 β€” Full restore center (6 endpoints), TOTP + email OTP + backup-code MFA, audit logging on all tenant endpoints, ops Slack/Teams + tenant email notifications.

v1.84.0 β€” Tenant Self-Service Portal foundation: tenant_portal_role with RLS on 5 inventory tables; 5 schema tables; isolated FastAPI on port 8010 (JWT role=tenant, Redis sessions, IP binding, per-user rate limiting); 6 admin API endpoints; Helm NetworkPolicy.


🌍 Multi-Region & Multi-Cluster Support β€” v1.73.0 β†’ v1.79.0

v1.79.0 β€” External LDAP / AD identity federation with group-to-role mapping, credential passthrough, and sync worker.

v1.76.0 β€” Multi-region management UI: RegionSelector nav dropdown, ClusterManagement admin panel (add/delete/test/discover CPs and regions), per-region filtering across all views.

v1.73.0 β€” Full multi-cluster infrastructure: ClusterRegistry, per-region worker loops, cross-region migration planning, SSRF protection, health tracking.


🎫 Support Ticket System β€” v1.58 β†’ v1.60

v1.60 β€” Ticket analytics, bulk actions, LandingDashboard KPI widget, metering and runbook ticket integration.

v1.58 β€” Full ticket lifecycle: 5 types, SLA daemon, 35+ endpoints, auto-ticket triggers (health score, drift, graph deletes, runbook failures), approval workflows, email templates.


Security hardening, performance, CI fixes, and UI polish are documented in the full changelog.

Complete version history for all releases: CHANGELOG.md


πŸ‘₯ Who This Is For

  • MSPs running multi-tenant Platform9 environments β€” multi-region console, per-customer chargeback, SLA enforcement, automated tenant onboarding and offboarding
  • Enterprise OpenStack teams β€” operational governance, snapshot compliance, capacity planning, VMware migration tooling
  • Engineering teams responsible for Day-2 operations β€” not provisioning, but everything that comes after it

❌ When NOT to Use pf9-mngt

  • You manage a single small tenant with no SLA requirements β€” the native Platform9 UI is sufficient
  • You don't need automation or governance β€” if manual workflows are acceptable at your scale, this is over-engineered for you
  • Your team doesn't own Day-2 operations β€” if Platform9 SaaS handles everything and you never touch restore, compliance, or chargeback, you don't need this layer
  • You want a Platform9-supported product β€” pf9-mngt is independent and community-maintained, not an official Platform9 offering

If any of the above applies, save yourself the setup. If they don't β€” this is built for you.


🎯 Positioning

pf9-mngt is:

  • ❌ Not a UI replacement β€” it is an engineering console that adds workflows the native Platform9 UI does not provide
  • ❌ Not a cloud control plane β€” it orchestrates Platform9 / OpenStack via their existing APIs
  • ❌ Not a provisioning tool β€” it operates on what has already been provisioned
  • βœ… The operational layer on top β€” what you reach for when something breaks, needs auditing, or must be tracked at scale

🀝 Contributing

Contributions are welcome β€” code, documentation, bug reports, feature suggestions, or feedback.

See CONTRIBUTING.md for guidelines on:

  • How to report bugs
  • How to suggest features
  • How to submit pull requests
  • Development setup and coding standards

πŸ’ Support the Project

If pf9-mngt saves your team time, consider:

  • ⭐ Star the repository β€” helps others discover the project
  • πŸ› Report bugs β€” open an issue
  • πŸ’» Contribute code β€” PRs are welcome
  • πŸ’¬ Share feedback β€” what would you add?

β˜• Buy Me a Coffee

If this project saves you time or makes your Platform9 operations easier, you can support its continued development:

Buy Me A Coffee


πŸ‘€ About the Creator

Erez Rozenbaum β€” Cloud Engineering Manager & Original Developer

Built as part of a serious Platform9 evaluation to solve real operational gaps for MSP and enterprise teams. 670+ commits, 270+ releases, 18 containerized services, 170+ API endpoints β€” built alongside regular responsibilities.


πŸ“„ License

MIT License β€” see LICENSE for details.

Copyright Β© 2026 Erez Rozenbaum and Contributors


Project Status: Production Ready | Version: 1.94.7 | Last Updated: May 2026

About

Open-source MSP and enterprise operations platform for Platform9/OpenStack, providing multi-cluster / multi-region management, identity federation, inventory intelligence, snapshot automation, restore orchestration, migration planning, governance, and operational analytics in one self-hosted control layer, Open-source MSP and enterprise operations

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors