Three Horizons Accelerator - Administrator Guide

Version: 4.0.0 Last Updated: December 2025 Audience: Platform Administrators, SRE Teams, DevOps Engineers

Introduction
Daily Operations
Monitoring and Alerting
Scaling Operations
Backup and Recovery
Secret Management
User Management
Certificate Management
Cost Management
Security Operations
Maintenance Windows
Incident Response
Runbook: Common Procedures

1. Introduction

What is This Guide?

This Administrator Guide provides everything you need to operate and maintain the Three Horizons platform on a day-to-day basis. It covers routine tasks, monitoring, troubleshooting, and incident response.

💡 Different from Other Guides

Deployment Guide: How to install the platform (one-time)

Architecture Guide: How the platform is designed (reference)

Administrator Guide (this): How to operate the platform (daily)

Troubleshooting Guide: How to fix specific problems (when issues occur)

Who Should Read This?

Role	What You'll Learn
Platform Administrators	Day-to-day platform operations
SRE Engineers	Reliability and monitoring
DevOps Engineers	CI/CD and deployment operations
Security Engineers	Security operations and compliance
On-Call Engineers	Incident response procedures

Quick Reference Card

Keep this handy for daily operations:

2. Daily Operations

2.1 Daily Health Check

💡 Why Daily Health Checks?

Catching problems early prevents outages. A 5-minute daily check can prevent hours of emergency response later.

Run this script every morning:

#!/bin/bash
# Save as: daily-health-check.sh
# Run with: ./daily-health-check.sh

echo ""
echo "╔══════════════════════════════════════════════════════════════════╗"
echo "║           THREE HORIZONS DAILY HEALTH CHECK                       ║"
echo "║           $(date)                                                 ║"
echo "╚══════════════════════════════════════════════════════════════════╝"
echo ""

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 1: CLUSTER HEALTH
# ─────────────────────────────────────────────────────────────────────────────
echo "┌─────────────────────────────────────────────────────────────────────┐"
echo "│  1. CLUSTER NODES                                                   │"
echo "└─────────────────────────────────────────────────────────────────────┘"
echo ""

# Get node status
NODE_STATUS=$(kubectl get nodes --no-headers 2>/dev/null)
if [ $? -ne 0 ]; then
    echo "  ❌ ERROR: Cannot connect to cluster!"
    echo "     → Run: az aks get-credentials --resource-group <rg> --name <aks>"
    exit 1
fi

# Count nodes by status
TOTAL_NODES=$(echo "$NODE_STATUS" | wc -l | tr -d ' ')
READY_NODES=$(echo "$NODE_STATUS" | grep -c " Ready ")
NOT_READY=$(echo "$NODE_STATUS" | grep -c -v " Ready ")

if [ "$READY_NODES" -eq "$TOTAL_NODES" ]; then
    echo "  ✅ All $TOTAL_NODES nodes are Ready"
else
    echo "  ⚠️  $NOT_READY of $TOTAL_NODES nodes are NOT Ready!"
    echo ""
    kubectl get nodes | grep -v " Ready "
fi
echo ""

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 2: PROBLEM PODS
# ─────────────────────────────────────────────────────────────────────────────
echo "┌─────────────────────────────────────────────────────────────────────┐"
echo "│  2. PROBLEM PODS                                                    │"
echo "└─────────────────────────────────────────────────────────────────────┘"
echo ""

# Find pods not in Running or Succeeded state
PROBLEM_PODS=$(kubectl get pods -A --no-headers 2>/dev/null | grep -v -E "Running|Completed|Succeeded")
PROBLEM_COUNT=$(echo "$PROBLEM_PODS" | grep -c "." || echo "0")

if [ -z "$PROBLEM_PODS" ] || [ "$PROBLEM_COUNT" -eq 0 ]; then
    echo "  ✅ No problem pods found"
else
    echo "  ⚠️  Found $PROBLEM_COUNT problem pods:"
    echo ""
    echo "  NAMESPACE              NAME                              STATUS"
    echo "  ─────────────────────────────────────────────────────────────────"
    echo "$PROBLEM_PODS" | head -10 | awk '{printf "  %-20s %-35s %s\n", $1, $2, $4}'
    if [ "$PROBLEM_COUNT" -gt 10 ]; then
        echo "  ... and $((PROBLEM_COUNT - 10)) more"
    fi
fi
echo ""

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 3: ARGOCD APPLICATIONS
# ─────────────────────────────────────────────────────────────────────────────
echo "┌─────────────────────────────────────────────────────────────────────┐"
echo "│  3. ARGOCD APPLICATIONS                                             │"
echo "└─────────────────────────────────────────────────────────────────────┘"
echo ""

APPS=$(kubectl get applications -n argocd --no-headers 2>/dev/null)
if [ -z "$APPS" ]; then
    echo "  ℹ️  No ArgoCD applications found (ArgoCD may not be installed)"
else
    SYNCED=$(echo "$APPS" | grep -c "Synced.*Healthy" || echo "0")
    TOTAL_APPS=$(echo "$APPS" | wc -l | tr -d ' ')

    if [ "$SYNCED" -eq "$TOTAL_APPS" ]; then
        echo "  ✅ All $TOTAL_APPS applications are Synced and Healthy"
    else
        echo "  ⚠️  Some applications need attention:"
        echo ""
        echo "  APPLICATION            SYNC       HEALTH"
        echo "  ─────────────────────────────────────────────────────────────────"
        kubectl get applications -n argocd --no-headers | grep -v "Synced.*Healthy" | \
            awk '{printf "  %-24s %-10s %s\n", $1, $2, $3}'
    fi
fi
echo ""

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 4: RESOURCE USAGE
# ─────────────────────────────────────────────────────────────────────────────
echo "┌─────────────────────────────────────────────────────────────────────┐"
echo "│  4. RESOURCE USAGE                                                  │"
echo "└─────────────────────────────────────────────────────────────────────┘"
echo ""

# Check if metrics-server is available
kubectl top nodes &>/dev/null
if [ $? -ne 0 ]; then
    echo "  ℹ️  Metrics server not available (kubectl top won't work)"
else
    echo "  NODE                    CPU      MEMORY"
    echo "  ─────────────────────────────────────────────────────────────────"
    kubectl top nodes --no-headers 2>/dev/null | \
        awk '{
            cpu_pct = $3; mem_pct = $5;
            cpu_warn = (cpu_pct > 80) ? "⚠️" : "✓";
            mem_warn = (mem_pct > 85) ? "⚠️" : "✓";
            printf "  %-24s %s %-6s %s %-6s\n", $1, cpu_warn, cpu_pct, mem_warn, mem_pct
        }'
fi
echo ""

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 5: RECENT WARNINGS
# ─────────────────────────────────────────────────────────────────────────────
echo "┌─────────────────────────────────────────────────────────────────────┐"
echo "│  5. RECENT WARNING EVENTS (Last 1 Hour)                             │"
echo "└─────────────────────────────────────────────────────────────────────┘"
echo ""

WARNINGS=$(kubectl get events -A --field-selector type=Warning \
    --sort-by='.lastTimestamp' 2>/dev/null | tail -5)

if [ -z "$WARNINGS" ]; then
    echo "  ✅ No recent warning events"
else
    echo "  Recent warnings:"
    echo ""
    echo "$WARNINGS" | awk 'NR>1 {printf "  • [%s] %s: %s\n", $1, $5, $7}'
fi
echo ""

# ─────────────────────────────────────────────────────────────────────────────
# SECTION 6: EXTERNAL SECRETS
# ─────────────────────────────────────────────────────────────────────────────
echo "┌─────────────────────────────────────────────────────────────────────┐"
echo "│  6. EXTERNAL SECRETS STATUS                                         │"
echo "└─────────────────────────────────────────────────────────────────────┘"
echo ""

ES_STATUS=$(kubectl get externalsecrets -A --no-headers 2>/dev/null)
if [ -z "$ES_STATUS" ]; then
    echo "  ℹ️  No External Secrets configured"
else
    SYNCED_ES=$(echo "$ES_STATUS" | grep -c "SecretSynced" || echo "0")
    TOTAL_ES=$(echo "$ES_STATUS" | wc -l | tr -d ' ')

    if [ "$SYNCED_ES" -eq "$TOTAL_ES" ]; then
        echo "  ✅ All $TOTAL_ES External Secrets are synced"
    else
        echo "  ⚠️  Some External Secrets need attention:"
        echo "$ES_STATUS" | grep -v "SecretSynced" | \
            awk '{printf "  • %s/%s - Status: %s\n", $1, $2, $4}'
    fi
fi
echo ""

# ─────────────────────────────────────────────────────────────────────────────
# SUMMARY
# ─────────────────────────────────────────────────────────────────────────────
echo "╔══════════════════════════════════════════════════════════════════╗"
echo "║                    HEALTH CHECK COMPLETE                          ║"
echo "╚══════════════════════════════════════════════════════════════════╝"
echo ""

Understanding the output:

Symbol	Meaning	Action Required
✅	Everything OK	None
⚠️	Warning	Investigate soon
❌	Error	Investigate immediately
ℹ️	Information	For your awareness

2.2 Key Metrics to Monitor

💡 What Metrics Should I Watch?

These are the most important metrics that indicate platform health. Set up alerts for these in Grafana/Prometheus.

Metric	Normal Range	Warning Threshold	Critical Threshold	What It Means
Node CPU	< 70%	> 80%	> 90%	Nodes are overloaded
Node Memory	< 75%	> 85%	> 95%	Risk of OOM kills
Pod Restarts	0-2/hour	> 5/hour	> 10/hour	Application instability
API Server Latency	< 200ms	> 500ms	> 1s	Control plane issues
Failed Pods	0	> 0	> 5	Application failures
PV Usage	< 70%	> 80%	> 90%	Storage running out
Certificate Expiry	> 30 days	< 30 days	< 7 days	TLS cert expiring

2.3 Daily Tasks Checklist

Print this and check off each item:

3. Monitoring and Alerting

3.1 Accessing Monitoring Tools

💡 How to Access Dashboards

All monitoring tools run inside the Kubernetes cluster. You access them by "port-forwarding" - creating a tunnel from your computer to the service.

Grafana (Dashboards and Visualization)

# Step 1: Start port-forward to Grafana
kubectl port-forward svc/prometheus-grafana -n observability 3000:80

# Step 2: Get the admin password
kubectl get secret prometheus-grafana -n observability \
  -o jsonpath="{.data.admin-password}" | base64 -d && echo

# Step 3: Open browser to http://localhost:3000
# Username: admin
# Password: (from step 2)

What you'll see in Grafana:

Prometheus (Metrics Query)

# Start port-forward to Prometheus
kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n observability 9090:9090

# Open browser to http://localhost:9090

Useful PromQL Queries:

# CPU usage by node (percentage)
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage by node (percentage)
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Pod restart count in last hour
increase(kube_pod_container_status_restarts_total[1h])

# HTTP request rate by service
sum(rate(http_requests_total[5m])) by (service)

# HTTP error rate (5xx) by service
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)

ArgoCD (GitOps Deployments)

# Start port-forward to ArgoCD
kubectl port-forward svc/argocd-server -n argocd 8080:443

# Get admin password
kubectl -n argocd get secret argocd-initial-admin-secret \
  -o jsonpath="{.data.password}" | base64 -d && echo

# Open browser to https://localhost:8080
# Username: admin
# Password: (from above command)

3.2 Alert Configuration

💡 How Alerts Work

Prometheus evaluates alert rules continuously

When a rule matches, Prometheus fires an alert to Alertmanager

Alertmanager groups similar alerts and routes them

You receive notification via Slack, PagerDuty, email, etc.

Alert Severity Levels:

Severity	Response Time	Who to Notify	Examples
Critical	Immediate (< 15 min)	On-call + escalation	Platform down, data loss risk
Warning	Same day (< 4 hours)	On-call	High CPU, approaching limits
Info	Next business day	Team channel	FYI events, non-urgent

Configuring Alert Routes:

File: prometheus/alertmanager-config.yaml

# This configures WHERE alerts go based on severity
route:
  # Default receiver
  receiver: 'slack-notifications'

  # Group alerts by these labels
  group_by: ['alertname', 'namespace']

  # Wait before sending (to group similar alerts)
  group_wait: 30s

  # How long to wait before sending new alerts in the same group
  group_interval: 5m

  # How often to resend alerts that are still firing
  repeat_interval: 4h

  # Child routes - more specific matching
  routes:
    # Critical alerts go to PagerDuty
    - match:
        severity: critical
      receiver: 'pagerduty'
      repeat_interval: 15m

    # Warning alerts go to Slack
    - match:
        severity: warning
      receiver: 'slack-notifications'

    # Info alerts are logged only
    - match:
        severity: info
      receiver: 'null'

receivers:
  # PagerDuty for critical alerts
  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
        description: '{{ .GroupLabels.alertname }}'

  # Slack for warnings
  - name: 'slack-notifications'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
        channel: '#platform-alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  # Null receiver (drops alerts)
  - name: 'null'

3.3 Creating Custom Alerts

Example: Alert when HTTP error rate exceeds 1%

# Add to prometheus/alerting-rules.yaml
groups:
  - name: application-alerts
    rules:
      # Alert: High HTTP Error Rate
      - alert: HighHTTPErrorRate
        # PromQL expression that triggers the alert
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
            /
            sum(rate(http_requests_total[5m])) by (service)
          ) > 0.01

        # How long condition must be true before alerting
        for: 5m

        # Labels for routing
        labels:
          severity: warning
          team: platform

        # Human-readable information
        annotations:
          summary: "High HTTP error rate for {{ $labels.service }}"
          description: |
            Service {{ $labels.service }} has an error rate of
            {{ $value | humanizePercentage }} (threshold: 1%).

            Runbook: https://wiki.company.com/runbooks/high-error-rate
          dashboard: "https://grafana.company.com/d/api-dashboard"

Alert Best Practices:

Do	Don't
✅ Set meaningful `for` duration (avoid flapping)	❌ Alert on every blip
✅ Include runbook links in annotations	❌ Leave operators guessing
✅ Route by severity to appropriate channels	❌ Send everything to PagerDuty
✅ Alert on symptoms (error rate high)	❌ Alert on causes (CPU high)
✅ Test alerts before deploying	❌ Find out alerts don't work during incident

4. Scaling Operations

4.1 Understanding Autoscaling

💡 Types of Autoscaling

The platform supports three types of autoscaling:

HPA (Horizontal Pod Autoscaler): Scales pods within a deployment

VPA (Vertical Pod Autoscaler): Adjusts pod resource requests

Cluster Autoscaler: Adds/removes nodes from the cluster

4.2 Manual Node Scaling

⚠️ When to Scale Manually

Usually, let the Cluster Autoscaler handle scaling. Manual scaling is for:

Preparing for known traffic spikes

Cost optimization (scaling down during off-hours)

Emergency situations

Scale cluster nodes:

# View current node count
kubectl get nodes | wc -l

# Scale the workload node pool
az aks nodepool scale \
  --resource-group rg-threehorizons-dev \
  --cluster-name aks-threehorizons-dev \
  --name workload \
  --node-count 5

# Verify new nodes are Ready
watch kubectl get nodes

Understanding the scaling process:

4.3 Configuring Horizontal Pod Autoscaler

Create an HPA for a deployment:

# Example: Autoscale based on CPU usage
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
  namespace: production
spec:
  # Target deployment
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app

  # Min and max replicas
  minReplicas: 3
  maxReplicas: 20

  # Scaling triggers
  metrics:
    # Scale when CPU usage > 70%
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

    # Scale when memory usage > 80%
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

  # Scale-down settings (prevent flapping)
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
        - type: Percent
          value: 50       # Max 50% reduction at once
          periodSeconds: 60

Apply and verify:

# Apply the HPA
kubectl apply -f my-app-hpa.yaml

# Check HPA status
kubectl get hpa -n production

# Watch HPA in action
kubectl get hpa -n production -w

Expected output:

NAME         REFERENCE           TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
my-app-hpa   Deployment/my-app   23%/70%   3         20        5          2m

5. Backup and Recovery

5.1 What Gets Backed Up

💡 Backup Strategy

We use a "belt and suspenders" approach:

Terraform state: In Azure Storage (versioned)

Git: All configs in version control

Velero: Kubernetes resources and PV snapshots

Azure Backup: Managed services (PostgreSQL, etc.)

5.2 Velero Backup Commands

Check backup status:

# List all backups
velero backup get

# Check backup details
velero backup describe <backup-name>

# Check backup logs
velero backup logs <backup-name>

Create manual backup:

# Backup everything
velero backup create full-backup-$(date +%Y%m%d)

# Backup specific namespace
velero backup create myapp-backup-$(date +%Y%m%d) \
  --include-namespaces myapp-production

# Backup with specific labels
velero backup create critical-apps-$(date +%Y%m%d) \
  --selector app.kubernetes.io/part-of=critical

5.3 Restore Procedures

⚠️ Before Restoring

Communicate to stakeholders that restoration is happening

Ensure you have the right backup identified

Decide: restore to same namespace or new namespace?

Restore from backup:

# List available backups
velero backup get

# Restore entire backup
velero restore create --from-backup full-backup-20241210

# Restore specific namespace
velero restore create --from-backup full-backup-20241210 \
  --include-namespaces production

# Restore to different namespace (create mapping)
velero restore create --from-backup full-backup-20241210 \
  --namespace-mappings production:production-restored

# Check restore status
velero restore describe <restore-name>

Disaster Recovery Runbook:

6. Secret Management

6.1 Understanding Secret Flow

💡 Where Secrets Live

Azure Key Vault: Source of truth for all secrets

External Secrets Operator: Syncs secrets to Kubernetes

Kubernetes Secrets: What applications actually read

6.2 Managing Secrets in Key Vault

Add a new secret:

# Set secret in Key Vault
az keyvault secret set \
  --vault-name kv-threehorizons-dev \
  --name "database-password" \
  --value "super-secret-password-123"

# Verify secret was created
az keyvault secret show \
  --vault-name kv-threehorizons-dev \
  --name "database-password" \
  --query "value"

Create ExternalSecret to sync it:

# externalsecret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: my-app-secrets
  namespace: my-app
spec:
  # How often to sync
  refreshInterval: 1h

  # Which secret store to use
  secretStoreRef:
    kind: ClusterSecretStore
    name: azure-key-vault

  # Target Kubernetes Secret
  target:
    name: my-app-secrets

  # What to sync
  data:
    # Local key name : Key Vault secret name
    - secretKey: DATABASE_PASSWORD
      remoteRef:
        key: database-password

    - secretKey: API_KEY
      remoteRef:
        key: my-app-api-key

Apply and verify:

# Apply the ExternalSecret
kubectl apply -f externalsecret.yaml

# Check sync status
kubectl get externalsecret my-app-secrets -n my-app

# Verify Kubernetes Secret was created
kubectl get secret my-app-secrets -n my-app

# View secret contents (base64 encoded)
kubectl get secret my-app-secrets -n my-app -o jsonpath='{.data.DATABASE_PASSWORD}' | base64 -d

6.3 Secret Rotation

💡 Why Rotate Secrets?

Compliance requirements

After personnel changes

After suspected compromise

Best practice: every 90 days

Secret rotation procedure:

7. User Management

7.1 Access Control Model

💡 RBAC (Role-Based Access Control)

We use RBAC at two levels:

Azure RBAC: Who can access Azure resources

Kubernetes RBAC: Who can access Kubernetes resources

7.2 Adding a New User

Step 1: Add to Microsoft Entra ID Group

# Get user's Object ID
az ad user show --id "user@company.com" --query id -o tsv

# Add to appropriate group
# For operators:
az ad group member add \
  --group "Platform-Operators" \
  --member-id "user-object-id"

# For admins:
az ad group member add \
  --group "Platform-Admins" \
  --member-id "user-object-id"

Step 2: Create Kubernetes RoleBinding (if namespace-specific)

# rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: team-alpha-access
  namespace: team-alpha
subjects:
  # Bind to Microsoft Entra ID group
  - kind: Group
    name: "team-alpha-developers"  # Microsoft Entra ID group name
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: edit  # Kubernetes built-in role
  apiGroup: rbac.authorization.k8s.io

Step 3: Verify access

# User should run:
az aks get-credentials --resource-group rg-XXX --name aks-XXX

# Test access
kubectl auth can-i create pods -n team-alpha
# Expected: yes

kubectl auth can-i create pods -n other-team
# Expected: no

7.3 Removing User Access

# Remove from Microsoft Entra ID group
az ad group member remove \
  --group "Platform-Operators" \
  --member-id "user-object-id"

# Verify removal
az ad group member check \
  --group "Platform-Operators" \
  --member-id "user-object-id"
# Expected: false

# User's access will be revoked next time they try to authenticate
# For immediate revocation, delete their kubeconfig credentials

8. Certificate Management

8.1 Certificate Types

Certificate Type	Purpose	Managed By	Rotation
TLS for Ingress	HTTPS for web apps	cert-manager	Auto (Let's Encrypt)
Kubernetes CA	Internal cluster TLS	AKS	Auto (Azure)
Key Vault Certs	Custom certificates	Key Vault	Manual or auto

8.2 Checking Certificate Status

# List all certificates managed by cert-manager
kubectl get certificates -A

# Check specific certificate details
kubectl describe certificate my-tls-cert -n my-app

# Check certificate expiry
kubectl get certificate -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.status.notAfter}{"\n"}{end}'

8.3 Certificate Renewal

Automatic renewal (cert-manager):

cert-manager automatically renews certificates 30 days before expiry. If renewal fails:

# Check cert-manager logs
kubectl logs -n cert-manager deployment/cert-manager

# Force renewal
kubectl delete certificate my-tls-cert -n my-app
# cert-manager will recreate it

# Or delete the secret to trigger re-issuance
kubectl delete secret my-tls-cert -n my-app

Manual renewal (Key Vault certificates):

# Check certificate expiry
az keyvault certificate show \
  --vault-name kv-XXX \
  --name my-cert \
  --query "attributes.expires"

# Create new version (CSR or import)
az keyvault certificate create \
  --vault-name kv-XXX \
  --name my-cert \
  --policy @cert-policy.json

9. Cost Management

9.1 Understanding Costs

💡 Major Cost Drivers

In order of typical impact:

AKS Node VMs: 60-70% of cost

Azure OpenAI: Variable based on usage

Storage: Disks, blobs, logs

Network: Egress traffic

9.2 Cost Monitoring

Check current spend:

# Get current month spend
az consumption usage list \
  --subscription $SUBSCRIPTION_ID \
  --start-date $(date -v-30d +%Y-%m-%d) \
  --end-date $(date +%Y-%m-%d) \
  --query "[].{Name:instanceName, Cost:pretaxCost}" \
  --output table

Set up budget alerts:

# Create budget with alerts
az consumption budget create \
  --budget-name "platform-monthly" \
  --amount 5000 \
  --category Cost \
  --time-grain Monthly \
  --start-date 2024-01-01 \
  --end-date 2025-12-31 \
  --resource-group rg-threehorizons-dev \
  --notification-key-1 80Percent \
  --notification-threshold 80 \
  --notification-operator GreaterThan \
  --contact-emails "platform-team@company.com" \
  --notification-enabled true

9.3 Cost Optimization Tips

Optimization	Savings	Effort	Risk
Spot instances for workload pool	60-80%	Low	Medium (interruptions)
Reserved instances (1 year)	30-40%	Low	Low
Scale down dev at night	50%	Medium	Low
Right-size VMs	10-30%	Medium	Low
Optimize AI model usage	Variable	High	Low

Implement spot instances:

# In terraform/terraform.tfvars
additional_node_pools = {
  spot = {
    vm_size = "Standard_D4s_v5"
    count   = 3
    priority = "Spot"
    eviction_policy = "Delete"
    spot_max_price = -1  # Pay up to on-demand price
  }
}

10. Security Operations

10.1 Security Monitoring

Daily security checks:

# Check Defender for Cloud recommendations
az security assessment list \
  --query "[?status.code=='Unhealthy'].{Name:displayName, Status:status.code}" \
  --output table

# Check for security events in past 24 hours
az monitor activity-log list \
  --start-time $(date -v-1d +%Y-%m-%dT%H:%M:%SZ) \
  --query "[?authorization.action contains 'Microsoft.Security'].{Time:eventTimestamp, Action:authorization.action}" \
  --output table

10.2 Security Incident Response

11. Maintenance Windows

11.1 Scheduled Maintenance

Maintenance Type	Frequency	Window	Duration	Impact
AKS Upgrades	Quarterly	Saturday 2-6 AM	2-4 hours	Rolling (minimal)
Node Pool Updates	Monthly	Saturday 2-4 AM	1-2 hours	Rolling
Certificate Rotation	As needed	Any time	Minutes	None
Helm Chart Updates	Weekly	Wednesday 10 PM	30 min	Rolling

11.2 Pre-Maintenance Checklist

12. Incident Response

12.1 Incident Severity Levels

Level	Definition	Response Time	Examples
SEV1	Platform down	15 min	Cluster unreachable, data loss
SEV2	Major degradation	1 hour	Multiple apps failing, high error rate
SEV3	Minor degradation	4 hours	Single app affected, non-critical
SEV4	Informational	Next business day	Cosmetic issues, questions

12.2 Incident Response Procedure

13. Runbook: Common Procedures

13.1 Restart a Stuck Deployment

# Check current status
kubectl get deployment my-app -n production

# Restart all pods (rolling)
kubectl rollout restart deployment/my-app -n production

# Watch rollout progress
kubectl rollout status deployment/my-app -n production

# If rollout fails, rollback
kubectl rollout undo deployment/my-app -n production

13.2 Force Sync ArgoCD Application

# Hard refresh - re-read from Git
kubectl patch application my-app -n argocd --type merge \
  -p '{"operation": {"initiatedBy": {"username": "admin"}, "sync": {"syncStrategy": {"apply": {"force": true}}}}}'

# Or use ArgoCD CLI
argocd app sync my-app --force

13.3 Drain and Cordon a Node

# Mark node as unschedulable (cordon)
kubectl cordon node-name

# Safely evict pods (drain)
kubectl drain node-name --ignore-daemonsets --delete-emptydir-data

# After maintenance, uncordon
kubectl uncordon node-name

13.4 Emergency Scale Up

# Immediately add nodes
az aks nodepool scale \
  --resource-group rg-XXX \
  --cluster-name aks-XXX \
  --name workload \
  --node-count 10

# Scale specific deployment
kubectl scale deployment my-app -n production --replicas=10

13.5 View and Follow Logs

# Follow logs for a deployment
kubectl logs -f deployment/my-app -n production

# View logs from crashed pod
kubectl logs deployment/my-app -n production --previous

# View logs with timestamps
kubectl logs -f deployment/my-app -n production --timestamps

# View logs from specific container in multi-container pod
kubectl logs -f deployment/my-app -n production -c sidecar-container

13.6 Emergency Rollback

# View rollout history
kubectl rollout history deployment/my-app -n production

# Rollback to previous version
kubectl rollout undo deployment/my-app -n production

# Rollback to specific version
kubectl rollout undo deployment/my-app -n production --to-revision=3

# Verify rollback
kubectl rollout status deployment/my-app -n production

Summary

This Administrator Guide covered:

Daily Operations: Health checks, monitoring, checklists
Monitoring: Accessing Grafana, Prometheus, ArgoCD
Scaling: Manual and automatic scaling procedures
Backup/Recovery: Velero operations, disaster recovery
Secrets: Key Vault and External Secrets management
Users: RBAC and access control
Certificates: TLS and cert-manager
Costs: Monitoring and optimization
Security: Monitoring and incident response
Maintenance: Windows and procedures
Incidents: Response and escalation
Runbooks: Common operational procedures

For specific troubleshooting scenarios, see the Troubleshooting Guide.

🤖 Using Copilot Agents for Administration

The platform includes AI agents that assist with day-2 operations:

Task	Agent	Example Prompt
Monitoring & alerts	`@sre`	"Show me the current error rate and latency for all services"
Security operations	`@security`	"Audit RBAC assignments on the AKS cluster"
Scaling decisions	`@terraform`	"Help me scale the AKS nodepool to 5 nodes"
Cost optimization	`@architect`	"Suggest cost savings for our current Azure setup"
Pipeline management	`@devops`	"Help me set up a new GitHub Actions workflow"
Documentation updates	`@docs`	"Update the README to reflect the new module we added"

Tip: Use @sre as your daily operations companion. It will triage issues, check metrics, and suggest fixes.

Document	Description
Troubleshooting Guide	Diagnostic workflows for common platform issues
Performance Tuning Guide	Optimization recommendations for all components
Deployment Guide	Step-by-step platform deployment instructions
Module Reference	Detailed inputs/outputs for all Terraform modules
Runbooks	Operational runbooks for common procedures

Next Steps

Review performance tuning: Optimize cluster resources — see Performance Tuning Guide
Set up runbooks: Familiarize with operational procedures — see Runbooks
Configure alerting: Set up alerting rules and notification channels

Document Version: 2.0.0 Last Updated: December 2025 Maintainer: Platform Engineering Team

FilesExpand file tree

ADMINISTRATOR_GUIDE.md

Latest commit

History