Version: 4.0.0 Last Updated: December 2025 Audience: Platform Administrators, SRE Teams, DevOps Engineers
- Introduction
- Daily Operations
- Monitoring and Alerting
- Scaling Operations
- Backup and Recovery
- Secret Management
- User Management
- Certificate Management
- Cost Management
- Security Operations
- Maintenance Windows
- Incident Response
- Runbook: Common Procedures
This Administrator Guide provides everything you need to operate and maintain the Three Horizons platform on a day-to-day basis. It covers routine tasks, monitoring, troubleshooting, and incident response.
💡 Different from Other Guides
- Deployment Guide: How to install the platform (one-time)
- Architecture Guide: How the platform is designed (reference)
- Administrator Guide (this): How to operate the platform (daily)
- Troubleshooting Guide: How to fix specific problems (when issues occur)
| Role | What You'll Learn |
|---|---|
| Platform Administrators | Day-to-day platform operations |
| SRE Engineers | Reliability and monitoring |
| DevOps Engineers | CI/CD and deployment operations |
| Security Engineers | Security operations and compliance |
| On-Call Engineers | Incident response procedures |
Keep this handy for daily operations:
💡 Why Daily Health Checks?
Catching problems early prevents outages. A 5-minute daily check can prevent hours of emergency response later.
Run this script every morning:
#!/bin/bash
# Save as: daily-health-check.sh
# Run with: ./daily-health-check.sh
echo ""
echo "╔══════════════════════════════════════════════════════════════════╗"
echo "║ THREE HORIZONS DAILY HEALTH CHECK ║"
echo "║ $(date) ║"
echo "╚══════════════════════════════════════════════════════════════════╝"
echo ""
# ─────────────────────────────────────────────────────────────────────────────
# SECTION 1: CLUSTER HEALTH
# ─────────────────────────────────────────────────────────────────────────────
echo "┌─────────────────────────────────────────────────────────────────────┐"
echo "│ 1. CLUSTER NODES │"
echo "└─────────────────────────────────────────────────────────────────────┘"
echo ""
# Get node status
NODE_STATUS=$(kubectl get nodes --no-headers 2>/dev/null)
if [ $? -ne 0 ]; then
echo " ❌ ERROR: Cannot connect to cluster!"
echo " → Run: az aks get-credentials --resource-group <rg> --name <aks>"
exit 1
fi
# Count nodes by status
TOTAL_NODES=$(echo "$NODE_STATUS" | wc -l | tr -d ' ')
READY_NODES=$(echo "$NODE_STATUS" | grep -c " Ready ")
NOT_READY=$(echo "$NODE_STATUS" | grep -c -v " Ready ")
if [ "$READY_NODES" -eq "$TOTAL_NODES" ]; then
echo " ✅ All $TOTAL_NODES nodes are Ready"
else
echo " ⚠️ $NOT_READY of $TOTAL_NODES nodes are NOT Ready!"
echo ""
kubectl get nodes | grep -v " Ready "
fi
echo ""
# ─────────────────────────────────────────────────────────────────────────────
# SECTION 2: PROBLEM PODS
# ─────────────────────────────────────────────────────────────────────────────
echo "┌─────────────────────────────────────────────────────────────────────┐"
echo "│ 2. PROBLEM PODS │"
echo "└─────────────────────────────────────────────────────────────────────┘"
echo ""
# Find pods not in Running or Succeeded state
PROBLEM_PODS=$(kubectl get pods -A --no-headers 2>/dev/null | grep -v -E "Running|Completed|Succeeded")
PROBLEM_COUNT=$(echo "$PROBLEM_PODS" | grep -c "." || echo "0")
if [ -z "$PROBLEM_PODS" ] || [ "$PROBLEM_COUNT" -eq 0 ]; then
echo " ✅ No problem pods found"
else
echo " ⚠️ Found $PROBLEM_COUNT problem pods:"
echo ""
echo " NAMESPACE NAME STATUS"
echo " ─────────────────────────────────────────────────────────────────"
echo "$PROBLEM_PODS" | head -10 | awk '{printf " %-20s %-35s %s\n", $1, $2, $4}'
if [ "$PROBLEM_COUNT" -gt 10 ]; then
echo " ... and $((PROBLEM_COUNT - 10)) more"
fi
fi
echo ""
# ─────────────────────────────────────────────────────────────────────────────
# SECTION 3: ARGOCD APPLICATIONS
# ─────────────────────────────────────────────────────────────────────────────
echo "┌─────────────────────────────────────────────────────────────────────┐"
echo "│ 3. ARGOCD APPLICATIONS │"
echo "└─────────────────────────────────────────────────────────────────────┘"
echo ""
APPS=$(kubectl get applications -n argocd --no-headers 2>/dev/null)
if [ -z "$APPS" ]; then
echo " ℹ️ No ArgoCD applications found (ArgoCD may not be installed)"
else
SYNCED=$(echo "$APPS" | grep -c "Synced.*Healthy" || echo "0")
TOTAL_APPS=$(echo "$APPS" | wc -l | tr -d ' ')
if [ "$SYNCED" -eq "$TOTAL_APPS" ]; then
echo " ✅ All $TOTAL_APPS applications are Synced and Healthy"
else
echo " ⚠️ Some applications need attention:"
echo ""
echo " APPLICATION SYNC HEALTH"
echo " ─────────────────────────────────────────────────────────────────"
kubectl get applications -n argocd --no-headers | grep -v "Synced.*Healthy" | \
awk '{printf " %-24s %-10s %s\n", $1, $2, $3}'
fi
fi
echo ""
# ─────────────────────────────────────────────────────────────────────────────
# SECTION 4: RESOURCE USAGE
# ─────────────────────────────────────────────────────────────────────────────
echo "┌─────────────────────────────────────────────────────────────────────┐"
echo "│ 4. RESOURCE USAGE │"
echo "└─────────────────────────────────────────────────────────────────────┘"
echo ""
# Check if metrics-server is available
kubectl top nodes &>/dev/null
if [ $? -ne 0 ]; then
echo " ℹ️ Metrics server not available (kubectl top won't work)"
else
echo " NODE CPU MEMORY"
echo " ─────────────────────────────────────────────────────────────────"
kubectl top nodes --no-headers 2>/dev/null | \
awk '{
cpu_pct = $3; mem_pct = $5;
cpu_warn = (cpu_pct > 80) ? "⚠️" : "✓";
mem_warn = (mem_pct > 85) ? "⚠️" : "✓";
printf " %-24s %s %-6s %s %-6s\n", $1, cpu_warn, cpu_pct, mem_warn, mem_pct
}'
fi
echo ""
# ─────────────────────────────────────────────────────────────────────────────
# SECTION 5: RECENT WARNINGS
# ─────────────────────────────────────────────────────────────────────────────
echo "┌─────────────────────────────────────────────────────────────────────┐"
echo "│ 5. RECENT WARNING EVENTS (Last 1 Hour) │"
echo "└─────────────────────────────────────────────────────────────────────┘"
echo ""
WARNINGS=$(kubectl get events -A --field-selector type=Warning \
--sort-by='.lastTimestamp' 2>/dev/null | tail -5)
if [ -z "$WARNINGS" ]; then
echo " ✅ No recent warning events"
else
echo " Recent warnings:"
echo ""
echo "$WARNINGS" | awk 'NR>1 {printf " • [%s] %s: %s\n", $1, $5, $7}'
fi
echo ""
# ─────────────────────────────────────────────────────────────────────────────
# SECTION 6: EXTERNAL SECRETS
# ─────────────────────────────────────────────────────────────────────────────
echo "┌─────────────────────────────────────────────────────────────────────┐"
echo "│ 6. EXTERNAL SECRETS STATUS │"
echo "└─────────────────────────────────────────────────────────────────────┘"
echo ""
ES_STATUS=$(kubectl get externalsecrets -A --no-headers 2>/dev/null)
if [ -z "$ES_STATUS" ]; then
echo " ℹ️ No External Secrets configured"
else
SYNCED_ES=$(echo "$ES_STATUS" | grep -c "SecretSynced" || echo "0")
TOTAL_ES=$(echo "$ES_STATUS" | wc -l | tr -d ' ')
if [ "$SYNCED_ES" -eq "$TOTAL_ES" ]; then
echo " ✅ All $TOTAL_ES External Secrets are synced"
else
echo " ⚠️ Some External Secrets need attention:"
echo "$ES_STATUS" | grep -v "SecretSynced" | \
awk '{printf " • %s/%s - Status: %s\n", $1, $2, $4}'
fi
fi
echo ""
# ─────────────────────────────────────────────────────────────────────────────
# SUMMARY
# ─────────────────────────────────────────────────────────────────────────────
echo "╔══════════════════════════════════════════════════════════════════╗"
echo "║ HEALTH CHECK COMPLETE ║"
echo "╚══════════════════════════════════════════════════════════════════╝"
echo ""Understanding the output:
| Symbol | Meaning | Action Required |
|---|---|---|
| ✅ | Everything OK | None |
| Warning | Investigate soon | |
| ❌ | Error | Investigate immediately |
| ℹ️ | Information | For your awareness |
💡 What Metrics Should I Watch?
These are the most important metrics that indicate platform health. Set up alerts for these in Grafana/Prometheus.
| Metric | Normal Range | Warning Threshold | Critical Threshold | What It Means |
|---|---|---|---|---|
| Node CPU | < 70% | > 80% | > 90% | Nodes are overloaded |
| Node Memory | < 75% | > 85% | > 95% | Risk of OOM kills |
| Pod Restarts | 0-2/hour | > 5/hour | > 10/hour | Application instability |
| API Server Latency | < 200ms | > 500ms | > 1s | Control plane issues |
| Failed Pods | 0 | > 0 | > 5 | Application failures |
| PV Usage | < 70% | > 80% | > 90% | Storage running out |
| Certificate Expiry | > 30 days | < 30 days | < 7 days | TLS cert expiring |
Print this and check off each item:
💡 How to Access Dashboards
All monitoring tools run inside the Kubernetes cluster. You access them by "port-forwarding" - creating a tunnel from your computer to the service.
# Step 1: Start port-forward to Grafana
kubectl port-forward svc/prometheus-grafana -n observability 3000:80
# Step 2: Get the admin password
kubectl get secret prometheus-grafana -n observability \
-o jsonpath="{.data.admin-password}" | base64 -d && echo
# Step 3: Open browser to http://localhost:3000
# Username: admin
# Password: (from step 2)What you'll see in Grafana:
# Start port-forward to Prometheus
kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n observability 9090:9090
# Open browser to http://localhost:9090Useful PromQL Queries:
# CPU usage by node (percentage)
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage by node (percentage)
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Pod restart count in last hour
increase(kube_pod_container_status_restarts_total[1h])
# HTTP request rate by service
sum(rate(http_requests_total[5m])) by (service)
# HTTP error rate (5xx) by service
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
# Start port-forward to ArgoCD
kubectl port-forward svc/argocd-server -n argocd 8080:443
# Get admin password
kubectl -n argocd get secret argocd-initial-admin-secret \
-o jsonpath="{.data.password}" | base64 -d && echo
# Open browser to https://localhost:8080
# Username: admin
# Password: (from above command)💡 How Alerts Work
- Prometheus evaluates alert rules continuously
- When a rule matches, Prometheus fires an alert to Alertmanager
- Alertmanager groups similar alerts and routes them
- You receive notification via Slack, PagerDuty, email, etc.
Alert Severity Levels:
| Severity | Response Time | Who to Notify | Examples |
|---|---|---|---|
| Critical | Immediate (< 15 min) | On-call + escalation | Platform down, data loss risk |
| Warning | Same day (< 4 hours) | On-call | High CPU, approaching limits |
| Info | Next business day | Team channel | FYI events, non-urgent |
Configuring Alert Routes:
File: prometheus/alertmanager-config.yaml
# This configures WHERE alerts go based on severity
route:
# Default receiver
receiver: 'slack-notifications'
# Group alerts by these labels
group_by: ['alertname', 'namespace']
# Wait before sending (to group similar alerts)
group_wait: 30s
# How long to wait before sending new alerts in the same group
group_interval: 5m
# How often to resend alerts that are still firing
repeat_interval: 4h
# Child routes - more specific matching
routes:
# Critical alerts go to PagerDuty
- match:
severity: critical
receiver: 'pagerduty'
repeat_interval: 15m
# Warning alerts go to Slack
- match:
severity: warning
receiver: 'slack-notifications'
# Info alerts are logged only
- match:
severity: info
receiver: 'null'
receivers:
# PagerDuty for critical alerts
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
description: '{{ .GroupLabels.alertname }}'
# Slack for warnings
- name: 'slack-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
channel: '#platform-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
# Null receiver (drops alerts)
- name: 'null'Example: Alert when HTTP error rate exceeds 1%
# Add to prometheus/alerting-rules.yaml
groups:
- name: application-alerts
rules:
# Alert: High HTTP Error Rate
- alert: HighHTTPErrorRate
# PromQL expression that triggers the alert
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
) > 0.01
# How long condition must be true before alerting
for: 5m
# Labels for routing
labels:
severity: warning
team: platform
# Human-readable information
annotations:
summary: "High HTTP error rate for {{ $labels.service }}"
description: |
Service {{ $labels.service }} has an error rate of
{{ $value | humanizePercentage }} (threshold: 1%).
Runbook: https://wiki.company.com/runbooks/high-error-rate
dashboard: "https://grafana.company.com/d/api-dashboard"Alert Best Practices:
| Do | Don't |
|---|---|
✅ Set meaningful for duration (avoid flapping) |
❌ Alert on every blip |
| ✅ Include runbook links in annotations | ❌ Leave operators guessing |
| ✅ Route by severity to appropriate channels | ❌ Send everything to PagerDuty |
| ✅ Alert on symptoms (error rate high) | ❌ Alert on causes (CPU high) |
| ✅ Test alerts before deploying | ❌ Find out alerts don't work during incident |
💡 Types of Autoscaling
The platform supports three types of autoscaling:
- HPA (Horizontal Pod Autoscaler): Scales pods within a deployment
- VPA (Vertical Pod Autoscaler): Adjusts pod resource requests
- Cluster Autoscaler: Adds/removes nodes from the cluster
⚠️ When to Scale ManuallyUsually, let the Cluster Autoscaler handle scaling. Manual scaling is for:
- Preparing for known traffic spikes
- Cost optimization (scaling down during off-hours)
- Emergency situations
Scale cluster nodes:
# View current node count
kubectl get nodes | wc -l
# Scale the workload node pool
az aks nodepool scale \
--resource-group rg-threehorizons-dev \
--cluster-name aks-threehorizons-dev \
--name workload \
--node-count 5
# Verify new nodes are Ready
watch kubectl get nodesUnderstanding the scaling process:
Create an HPA for a deployment:
# Example: Autoscale based on CPU usage
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
namespace: production
spec:
# Target deployment
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
# Min and max replicas
minReplicas: 3
maxReplicas: 20
# Scaling triggers
metrics:
# Scale when CPU usage > 70%
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# Scale when memory usage > 80%
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# Scale-down settings (prevent flapping)
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 50 # Max 50% reduction at once
periodSeconds: 60Apply and verify:
# Apply the HPA
kubectl apply -f my-app-hpa.yaml
# Check HPA status
kubectl get hpa -n production
# Watch HPA in action
kubectl get hpa -n production -wExpected output:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
my-app-hpa Deployment/my-app 23%/70% 3 20 5 2m
💡 Backup Strategy
We use a "belt and suspenders" approach:
- Terraform state: In Azure Storage (versioned)
- Git: All configs in version control
- Velero: Kubernetes resources and PV snapshots
- Azure Backup: Managed services (PostgreSQL, etc.)
Check backup status:
# List all backups
velero backup get
# Check backup details
velero backup describe <backup-name>
# Check backup logs
velero backup logs <backup-name>Create manual backup:
# Backup everything
velero backup create full-backup-$(date +%Y%m%d)
# Backup specific namespace
velero backup create myapp-backup-$(date +%Y%m%d) \
--include-namespaces myapp-production
# Backup with specific labels
velero backup create critical-apps-$(date +%Y%m%d) \
--selector app.kubernetes.io/part-of=critical
⚠️ Before Restoring
- Communicate to stakeholders that restoration is happening
- Ensure you have the right backup identified
- Decide: restore to same namespace or new namespace?
Restore from backup:
# List available backups
velero backup get
# Restore entire backup
velero restore create --from-backup full-backup-20241210
# Restore specific namespace
velero restore create --from-backup full-backup-20241210 \
--include-namespaces production
# Restore to different namespace (create mapping)
velero restore create --from-backup full-backup-20241210 \
--namespace-mappings production:production-restored
# Check restore status
velero restore describe <restore-name>Disaster Recovery Runbook:
💡 Where Secrets Live
- Azure Key Vault: Source of truth for all secrets
- External Secrets Operator: Syncs secrets to Kubernetes
- Kubernetes Secrets: What applications actually read
Add a new secret:
# Set secret in Key Vault
az keyvault secret set \
--vault-name kv-threehorizons-dev \
--name "database-password" \
--value "super-secret-password-123"
# Verify secret was created
az keyvault secret show \
--vault-name kv-threehorizons-dev \
--name "database-password" \
--query "value"Create ExternalSecret to sync it:
# externalsecret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: my-app-secrets
namespace: my-app
spec:
# How often to sync
refreshInterval: 1h
# Which secret store to use
secretStoreRef:
kind: ClusterSecretStore
name: azure-key-vault
# Target Kubernetes Secret
target:
name: my-app-secrets
# What to sync
data:
# Local key name : Key Vault secret name
- secretKey: DATABASE_PASSWORD
remoteRef:
key: database-password
- secretKey: API_KEY
remoteRef:
key: my-app-api-keyApply and verify:
# Apply the ExternalSecret
kubectl apply -f externalsecret.yaml
# Check sync status
kubectl get externalsecret my-app-secrets -n my-app
# Verify Kubernetes Secret was created
kubectl get secret my-app-secrets -n my-app
# View secret contents (base64 encoded)
kubectl get secret my-app-secrets -n my-app -o jsonpath='{.data.DATABASE_PASSWORD}' | base64 -d💡 Why Rotate Secrets?
- Compliance requirements
- After personnel changes
- After suspected compromise
- Best practice: every 90 days
Secret rotation procedure:
💡 RBAC (Role-Based Access Control)
We use RBAC at two levels:
- Azure RBAC: Who can access Azure resources
- Kubernetes RBAC: Who can access Kubernetes resources
Step 1: Add to Microsoft Entra ID Group
# Get user's Object ID
az ad user show --id "user@company.com" --query id -o tsv
# Add to appropriate group
# For operators:
az ad group member add \
--group "Platform-Operators" \
--member-id "user-object-id"
# For admins:
az ad group member add \
--group "Platform-Admins" \
--member-id "user-object-id"Step 2: Create Kubernetes RoleBinding (if namespace-specific)
# rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: team-alpha-access
namespace: team-alpha
subjects:
# Bind to Microsoft Entra ID group
- kind: Group
name: "team-alpha-developers" # Microsoft Entra ID group name
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: edit # Kubernetes built-in role
apiGroup: rbac.authorization.k8s.ioStep 3: Verify access
# User should run:
az aks get-credentials --resource-group rg-XXX --name aks-XXX
# Test access
kubectl auth can-i create pods -n team-alpha
# Expected: yes
kubectl auth can-i create pods -n other-team
# Expected: no# Remove from Microsoft Entra ID group
az ad group member remove \
--group "Platform-Operators" \
--member-id "user-object-id"
# Verify removal
az ad group member check \
--group "Platform-Operators" \
--member-id "user-object-id"
# Expected: false
# User's access will be revoked next time they try to authenticate
# For immediate revocation, delete their kubeconfig credentials| Certificate Type | Purpose | Managed By | Rotation |
|---|---|---|---|
| TLS for Ingress | HTTPS for web apps | cert-manager | Auto (Let's Encrypt) |
| Kubernetes CA | Internal cluster TLS | AKS | Auto (Azure) |
| Key Vault Certs | Custom certificates | Key Vault | Manual or auto |
# List all certificates managed by cert-manager
kubectl get certificates -A
# Check specific certificate details
kubectl describe certificate my-tls-cert -n my-app
# Check certificate expiry
kubectl get certificate -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.status.notAfter}{"\n"}{end}'Automatic renewal (cert-manager):
cert-manager automatically renews certificates 30 days before expiry. If renewal fails:
# Check cert-manager logs
kubectl logs -n cert-manager deployment/cert-manager
# Force renewal
kubectl delete certificate my-tls-cert -n my-app
# cert-manager will recreate it
# Or delete the secret to trigger re-issuance
kubectl delete secret my-tls-cert -n my-appManual renewal (Key Vault certificates):
# Check certificate expiry
az keyvault certificate show \
--vault-name kv-XXX \
--name my-cert \
--query "attributes.expires"
# Create new version (CSR or import)
az keyvault certificate create \
--vault-name kv-XXX \
--name my-cert \
--policy @cert-policy.json💡 Major Cost Drivers
In order of typical impact:
- AKS Node VMs: 60-70% of cost
- Azure OpenAI: Variable based on usage
- Storage: Disks, blobs, logs
- Network: Egress traffic
Check current spend:
# Get current month spend
az consumption usage list \
--subscription $SUBSCRIPTION_ID \
--start-date $(date -v-30d +%Y-%m-%d) \
--end-date $(date +%Y-%m-%d) \
--query "[].{Name:instanceName, Cost:pretaxCost}" \
--output tableSet up budget alerts:
# Create budget with alerts
az consumption budget create \
--budget-name "platform-monthly" \
--amount 5000 \
--category Cost \
--time-grain Monthly \
--start-date 2024-01-01 \
--end-date 2025-12-31 \
--resource-group rg-threehorizons-dev \
--notification-key-1 80Percent \
--notification-threshold 80 \
--notification-operator GreaterThan \
--contact-emails "platform-team@company.com" \
--notification-enabled true| Optimization | Savings | Effort | Risk |
|---|---|---|---|
| Spot instances for workload pool | 60-80% | Low | Medium (interruptions) |
| Reserved instances (1 year) | 30-40% | Low | Low |
| Scale down dev at night | 50% | Medium | Low |
| Right-size VMs | 10-30% | Medium | Low |
| Optimize AI model usage | Variable | High | Low |
Implement spot instances:
# In terraform/terraform.tfvars
additional_node_pools = {
spot = {
vm_size = "Standard_D4s_v5"
count = 3
priority = "Spot"
eviction_policy = "Delete"
spot_max_price = -1 # Pay up to on-demand price
}
}Daily security checks:
# Check Defender for Cloud recommendations
az security assessment list \
--query "[?status.code=='Unhealthy'].{Name:displayName, Status:status.code}" \
--output table
# Check for security events in past 24 hours
az monitor activity-log list \
--start-time $(date -v-1d +%Y-%m-%dT%H:%M:%SZ) \
--query "[?authorization.action contains 'Microsoft.Security'].{Time:eventTimestamp, Action:authorization.action}" \
--output table| Maintenance Type | Frequency | Window | Duration | Impact |
|---|---|---|---|---|
| AKS Upgrades | Quarterly | Saturday 2-6 AM | 2-4 hours | Rolling (minimal) |
| Node Pool Updates | Monthly | Saturday 2-4 AM | 1-2 hours | Rolling |
| Certificate Rotation | As needed | Any time | Minutes | None |
| Helm Chart Updates | Weekly | Wednesday 10 PM | 30 min | Rolling |
| Level | Definition | Response Time | Examples |
|---|---|---|---|
| SEV1 | Platform down | 15 min | Cluster unreachable, data loss |
| SEV2 | Major degradation | 1 hour | Multiple apps failing, high error rate |
| SEV3 | Minor degradation | 4 hours | Single app affected, non-critical |
| SEV4 | Informational | Next business day | Cosmetic issues, questions |
# Check current status
kubectl get deployment my-app -n production
# Restart all pods (rolling)
kubectl rollout restart deployment/my-app -n production
# Watch rollout progress
kubectl rollout status deployment/my-app -n production
# If rollout fails, rollback
kubectl rollout undo deployment/my-app -n production# Hard refresh - re-read from Git
kubectl patch application my-app -n argocd --type merge \
-p '{"operation": {"initiatedBy": {"username": "admin"}, "sync": {"syncStrategy": {"apply": {"force": true}}}}}'
# Or use ArgoCD CLI
argocd app sync my-app --force# Mark node as unschedulable (cordon)
kubectl cordon node-name
# Safely evict pods (drain)
kubectl drain node-name --ignore-daemonsets --delete-emptydir-data
# After maintenance, uncordon
kubectl uncordon node-name# Immediately add nodes
az aks nodepool scale \
--resource-group rg-XXX \
--cluster-name aks-XXX \
--name workload \
--node-count 10
# Scale specific deployment
kubectl scale deployment my-app -n production --replicas=10# Follow logs for a deployment
kubectl logs -f deployment/my-app -n production
# View logs from crashed pod
kubectl logs deployment/my-app -n production --previous
# View logs with timestamps
kubectl logs -f deployment/my-app -n production --timestamps
# View logs from specific container in multi-container pod
kubectl logs -f deployment/my-app -n production -c sidecar-container# View rollout history
kubectl rollout history deployment/my-app -n production
# Rollback to previous version
kubectl rollout undo deployment/my-app -n production
# Rollback to specific version
kubectl rollout undo deployment/my-app -n production --to-revision=3
# Verify rollback
kubectl rollout status deployment/my-app -n productionThis Administrator Guide covered:
- Daily Operations: Health checks, monitoring, checklists
- Monitoring: Accessing Grafana, Prometheus, ArgoCD
- Scaling: Manual and automatic scaling procedures
- Backup/Recovery: Velero operations, disaster recovery
- Secrets: Key Vault and External Secrets management
- Users: RBAC and access control
- Certificates: TLS and cert-manager
- Costs: Monitoring and optimization
- Security: Monitoring and incident response
- Maintenance: Windows and procedures
- Incidents: Response and escalation
- Runbooks: Common operational procedures
For specific troubleshooting scenarios, see the Troubleshooting Guide.
The platform includes AI agents that assist with day-2 operations:
| Task | Agent | Example Prompt |
|---|---|---|
| Monitoring & alerts | @sre |
"Show me the current error rate and latency for all services" |
| Security operations | @security |
"Audit RBAC assignments on the AKS cluster" |
| Scaling decisions | @terraform |
"Help me scale the AKS nodepool to 5 nodes" |
| Cost optimization | @architect |
"Suggest cost savings for our current Azure setup" |
| Pipeline management | @devops |
"Help me set up a new GitHub Actions workflow" |
| Documentation updates | @docs |
"Update the README to reflect the new module we added" |
Tip: Use
@sreas your daily operations companion. It will triage issues, check metrics, and suggest fixes.
| Document | Description |
|---|---|
| Troubleshooting Guide | Diagnostic workflows for common platform issues |
| Performance Tuning Guide | Optimization recommendations for all components |
| Deployment Guide | Step-by-step platform deployment instructions |
| Module Reference | Detailed inputs/outputs for all Terraform modules |
| Runbooks | Operational runbooks for common procedures |
- Review performance tuning: Optimize cluster resources — see Performance Tuning Guide
- Set up runbooks: Familiarize with operational procedures — see Runbooks
- Configure alerting: Set up alerting rules and notification channels
Document Version: 2.0.0 Last Updated: December 2025 Maintainer: Platform Engineering Team