Real-time infrastructure monitoring dashboard with agent-based metrics, host monitoring, inventory management, and system logging.
- Backend: FastAPI + SQLite +
psutil - Frontend: Static HTML/CSS/JS + Chart.js (CDN)
- Agents: Lightweight Python agents for Linux hosts (Ubuntu, Debian, RHEL, CentOS, Rocky Linux)
- Anomaly detection: Rolling z-score over recent metric history
ai-system-health-dashboard/
├── agents/ # Per-OS agent scripts (centos/debian/rhel/rocky/ubuntu)
├── app/ # FastAPI backend + static frontend
│ ├── static/
│ │ ├── assets/ # Per-page CSS + JS
│ │ └── *.html
│ └── *.py # main, models, storage, auth, config, anomaly, metrics, protocols
├── data/ # Runtime DBs + logs (gitignored)
├── docs/
│ ├── guides/ # All .md deployment/fix/config guides
│ └── screenshots/ # README screenshots
├── scripts/
│ ├── deploy/ # Per-host deploy_*.sh scripts
│ ├── fix/ # fix_*.sh / fix_*.py scripts
│ ├── util/ # One-off utility/migration scripts
│ ├── manage_users.py
│ ├── take_screenshots.py
│ └── ...
├── .env.example
├── .gitignore # Now also ignores data/*.json, cookies.txt, scratch scripts
├── README.md # Updated with Project Structure section + correct paths
└── requirements.txt
| Page | URL | Description |
|---|---|---|
| Dashboard | / |
Overview, Problems, Hosts, Maps |
| Host Monitor | /host?id=<id> |
Per-host real-time charts (CPU, RAM, Disk, Network, GPU) |
| Overview | /overview |
System overview |
| Inventory | /inventory |
Asset/equipment tracking (rack, shelf, S/N) |
| Configuration | /configuration |
System configuration |
| Hosts Management | /hosts |
Discovered hosts and agent deployment |
| System Logs | /logs |
Filterable system event log |
| User Management | /users |
User accounts |
| User Groups | /user-groups |
Group-based access control |
- Enhanced auto-discovery (
/api/discovery/start) to classify device types using SNMP sysDescr keyword matching and TCP port fingerprinting - Detected device types:
rack-server,switch,router,firewall,patch-panel,linux,windows,network,other dashboard.js: AddedDEVICE_TYPE_METAmap — device type icons and colours shown in host table and SVG map nodeshosts.html: Edit modal type dropdown updated with all new device types and iconsindex.html: Host type datalist updated with new device typesmaps.html: Rebuilt with dynamic SVG topology map and device inventory grid; auto-refreshes every 30 seconds
- Users page (
/users): Action buttons split into separate columns — Edit | Reset Password | Delete | Activate - Reset Password button: Opens a modal to set a new password for any user (admin only); passwords are hashed correctly
- Fixed
editUseronclick — was passing user ID instead of full user object to modal - Fixed
editGrouponclick — same bug fixed for user groups page - Fixed
create_userAPI — was using brokenpbkdf2_hmacwith hardcoded salt; now useshash_password()so created users can log in - Fixed
update_userAPI — same password hashing bug fixed; changing a user's password via UI now works correctly
- Fixed HTTP 400 on save —
allowed_hostslist was passed raw to SQLite; now JSON-serialised before storage
dashboard.js: Non-admin (userrole) users no longer see Users and User groups sidebar links on any page- If those are the only items in the Administration group, the entire group header is hidden too
- Applied globally via
dashboard.js— no per-page changes needed
- Fixed
maps.htmlCSS/JS structure —.topoMenuItemstyles were inside<script>tag causing lint errors host.js: Added 60-second grace period on page load before flagging "Agent offline" as critical — prevents false alarm before first agent report- Started and enabled
system-trace-agentsystemd service (was installed but inactive)
main.py: On startup, if theuserstable is empty, the app automatically creates a defaultadmin/adminaccount- A warning is logged reminding the operator to change the password after first login
- Ensures a fresh clone + run is immediately usable without manual DB setup
.envis gitignored — after a fresh clone, copy.env.exampleto.envand setSESSION_SECRET_KEYdata/auth.dbusers are not committed — on first run the app now auto-createsadmin/admin- To add more users after setup:
python scripts/manage_users.py create --username <name> --role admin --password <pass>
- Real-time charts updating every ~3 seconds (CPU, RAM, Disk, Network, GPU)
- Frontend rolling ring buffer — 120-point history (2 min at 1s polling)
- Multi-GPU support — per-GPU utilization %, VRAM used/total, temperature tiles and charts (via
nvidia-smi) - Problem detection — header blinks red when CPU ≥ 75%, RAM ≥ 80%, Disk ≥ 85%, GPU ≥ 85%, or agent offline
- Per-host error history panel — shows active problems + last 50 log entries for that host
- Protocol checks: ICMP, HTTP, HTTPS, SSH, DNS, SNMP, NTP
- Auto-logs threshold breaches from agent metrics (CPU, RAM, Disk, GPU util/temp)
- Frontend also reports problems to
/api/logswhen detected on the Host Monitor page - Filter by level (Critical / Warning / Info) and hostname
- Summary tiles: Critical / Warning / Info / Total counts
- Auto-refreshes every 5 seconds
- Clear all logs button (admin)
- Track servers, racks, switches, PDUs, and other equipment
- Fields: Name, Category, Rack, Shelf/Unit, S/N (serial number), Qty, Notes
- Summary tiles: Total Items, Categories, Racks, Last Refresh
- Search across all fields
- Admin-only add/remove
- Post metrics every 3 seconds (down from 30s)
- CPU sampling interval: 0.5s
- GPU metrics via
nvidia-smi— multi-GPU list with percent, VRAM, temperature - Supported: Ubuntu, Debian, RHEL, CentOS, Rocky Linux
GET /api/logs— list logs, filterable bylevelandhostnamePOST /api/logs— write a log entryDELETE /api/logs— clear all logs (admin)GET /api/logs/host/{hostname}— per-host log history- Stored in
data/system_logs.json, rolling cap of 2000 entries
cd ai-system-health-dashboard
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtuvicorn app.main:app --reload --host 0.0.0.0 --port 8000- Dashboard: http://localhost:8000/
- UI examples: http://localhost:8000/static/ui-examples.html
- API docs: http://localhost:8000/docs
You can also set UI preferences via URL params:
/?theme=light&layout=stack&density=compact
Environment variables (optional):
SAMPLE_INTERVAL_SECONDS(default:1.0)HISTORY_SECONDS(default:600)ANOMALY_WINDOW_SECONDS(default:120)ANOMALY_Z_THRESHOLD(default:3.0)
SQLite persistence:
METRICS_DB_PATH(default:data/metrics.db) — set to an empty string to disable persistenceSQLITE_RETENTION_SECONDS(default:86400) — how long to keep samples in SQLite (set0to keep forever)
Example:
SAMPLE_INTERVAL_SECONDS=0.5 HISTORY_SECONDS=900 SQLITE_RETENTION_SECONDS=172800 uvicorn app.main:app --reloadThis dashboard includes a simple login system (cookie sessions) with a SQLite users database.
SESSION_SECRET_KEY— required. Set this to a long random string.- A starter
.envfile is included (gitignored). Edit it before running.
- A starter
AUTH_DB_PATH(default:data/auth.db)SESSION_MAX_AGE_SECONDS(default:86400)SESSION_COOKIE_NAME(default:system-trace_session)SESSION_COOKIE_SAMESITE(default:strict)SESSION_COOKIE_SECURE(default:0) — set to1only when serving over HTTPS
viewer: can view dashboard + non-admin APIsadmin: can also call/api/admin/*endpoints
This repo includes a simple Inventory page for tracking generic assets (e.g. switches, PDUs, storage systems).
- Page:
GET /inventory - APIs:
GET /api/inventory— list inventory items (any authenticated user)POST /api/admin/inventory— add item (admin only)DELETE /api/admin/inventory/{item_id}— remove item (admin only)
Inventory items are stored in SQLite alongside the metrics DB (same METRICS_DB_PATH).
The dashboard has a Problems view (/#problems) that shows:
- Current problems derived from host checks + protocol checks
- Recent structured per-host events (failures/recoveries)
API:
GET /api/events/recent— last ~500 in-memory host events (resets on server restart)
Project UI screenshots are stored under docs/screenshots/.
To regenerate them locally, use the helper script:
scripts/take_screenshots.py
It requires Playwright (Chromium). The script reads credentials from env vars so you don't hardcode passwords:
System Trace_USERSystem Trace_PASS
Users are stored in the users table inside AUTH_DB_PATH (default data/auth.db).
If you don't have the sqlite3 CLI installed, you can use the included helper script instead:
cd ai-system-health-dashboard
python scripts/manage_users.py list
python scripts/manage_users.py create --username admin --role admin --password 'changeme'
python scripts/manage_users.py set-password --username admin --password 'newpass'
python scripts/manage_users.py set-role --username viewer --role admin- Generate a PBKDF2 hash for a password:
cd ai-system-health-dashboard
python -c "from app.auth_storage import auth_storage; print(auth_storage.hash_password('your-password-here'))"- Insert the user (example admin):
sqlite3 data/auth.db "INSERT INTO users (username, password_hash, role, is_active, created_at) VALUES ('admin', '<PASTE_HASH_HERE>', 'admin', 1, strftime('%s','now'));"If the DB file doesn't exist yet, start the server once (it will create the table), then insert the user.
GET /api/metrics/latest— latest sample- Unauthenticated:
401 viewer/admin:200
- Unauthenticated:
GET /api/metrics/history?seconds=300GET /api/insights- WebSocket:
ws://localhost:8000/ws/metrics
GET /api/users— List all usersPOST /api/users— Create new userPUT /api/users/{user_id}— Update userDELETE /api/users/{user_id}— Delete userGET /api/user-groups— List all user groupsPOST /api/user-groups— Create user groupPUT /api/user-groups/{group_id}— Update user groupDELETE /api/user-groups/{group_id}— Delete user group
GET /api/discovery/results— Get discovery resultsGET /api/agent/files/{os_type}— Get agent files for OS typePOST /api/agent/deploy/{host_ip}— Deploy agent to host
Admin (SQLite):
GET /api/admin/db— DB stats (rows, time range, file size)- Unauthenticated:
401 viewer:403admin:200
- Unauthenticated:
POST /api/admin/db/prune— delete rows older thanSQLITE_RETENTION_SECONDSPOST /api/admin/db/vacuum— rebuild DB file (may pause briefly)
The dashboard now includes comprehensive network monitoring and agent deployment capabilities:
- Auto-Discovery: Scans network ranges to discover hosts
- Multi-Platform Support: Ubuntu, Debian, RHEL, CentOS, Rocky Linux
- Non-Root Deployment: Secure agent deployment with minimal privileges
- SNMP Monitoring: Full SNMP v2c support with automatic configuration
- NTP Synchronization: Time service management and monitoring
python scripts/auto_discover_hosts.pybash scripts/deploy/deploy_non_root_centos_docker.sh- Hosts Dashboard: http://localhost:8001/hosts
- View discovered hosts and deployment status
- Generate deployment commands automatically
- System Metrics: CPU, memory, disk, network monitoring
- SNMP Integration: Full v2c support with automatic configuration
- NTP Sync: Time service configuration and monitoring
- Service Management: Systemd integration with auto-restart
- Security: Non-root deployment with minimal sudo privileges
- Firewall Config: Automatic port opening (SNMP:161, NTP:123)
| OS | Package Manager | SNMP Service | NTP Service |
|---|---|---|---|
| Ubuntu | apt | snmpd | ntp |
| Debian | apt | snmpd | ntp |
| RHEL | dnf | snmpd | chronyd |
| CentOS | yum | snmpd | chronyd |
| Rocky Linux | dnf | snmpd | chronyd |
Environment variables for network monitoring:
SNMP_HOST- SNMP target host (default: empty)SNMP_PORT- SNMP port (default: 161)SNMP_COMMUNITY- SNMP community string (default: public)SNMP_TIMEOUT_SECONDS- SNMP timeout (default: 5)ICMP_HOST- ICMP target host (default: 1.1.1.1)ICMP_TIMEOUT_SECONDS- ICMP timeout (default: 3)NTP_SERVER- NTP server (default: pool.ntp.org)PROTOCOL_CHECK_INTERVAL_SECONDS- Check interval (default: 30)
Example configuration:
SNMP_HOST=<YOUR_AGENT_HOST>
SNMP_COMMUNITY=public
ICMP_HOST=<YOUR_AGENT_HOST>
NTP_SERVER=pool.ntp.org
PROTOCOL_CHECK_INTERVAL_SECONDS=30# Test SNMP connectivity
snmpwalk -v2c -c public <YOUR_AGENT_HOST> 1.3.6.1.2.1.1.1.0
# Check SNMP service
systemctl status snmpd# Check NTP status
chronyc sources
# or
ntpq -p
# Force NTP sync
chronyc -a makestep# Check agent service
systemctl status system-trace-agent
# Check agent logs
journalctl -u system-trace-agent -f
# Test agent manually
sudo -u system-trace-agent python3 /home/system-trace-agent/system-trace-agent/system-trace_agent.py- Agent Deployment Guide:
docs/guides/AGENT_DEPLOYMENT_GUIDE.md - Non-Root Deployment:
docs/guides/NON_ROOT_DEPLOYMENT_GUIDE.md - Network Monitoring:
docs/guides/NETWORK_MONITORING_GUIDE.md - SNMP Configuration:
docs/guides/SNMP_CONFIGURATION.md - CentOS/Docker Deployment:
docs/guides/CENTOS_DOCKER_DEPLOYMENT.md
| Script | Purpose |
|---|---|
scripts/auto_discover_hosts.py |
Network host discovery |
scripts/deploy_agents_non_root.py |
Non-root agent deployment |
scripts/quick_deploy_agent.py |
Quick single-host deployment |
scripts/setup_snmp.py |
SNMP configuration |
scripts/test_snmp_devices.py |
SNMP connectivity testing |
scripts/manage_users.py |
User management CLI |
scripts/take_screenshots.py |
Capture UI screenshots |
scripts/deploy/ |
Per-host deploy shell scripts |
scripts/fix/ |
One-off fix scripts |







