Skip to content

SlyPex/graph-gym-hpa

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

75 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Graph‑RL Autoscaler: GNN‑Driven Agent for Cluster‑Based Microservice Scaling

Status Checks GitHub commit activity Collaborators License

Description πŸ“

Introduction

Current autoscalers (e.g., the Kubernetes HPA) are mostly reactive and do not account for inter-service dependencies. To address this limitation, this project combines two complementary ideas:

  • Reinforcement Learning: allows an agent to learn by interacting with a real-world environment, mainly a Kubernetes cluster.
  • Graph Neural Networks: because microservices and their call relationships form a graph, GNNs are well suited to model and learn inter-service dependencies.

Prerequisites πŸ“‹

Before starting, ensure you have:

  • Active Kubernetes cluster (v1.30+)

  • Python installed (v3.11+)

  • Storage Requirements:

    • ~7GB for python packages and dependencies
    • ~25MB per model checkpoint (Checkpoints saved every N steps)
    • Minimum of 15GB free disk space recommended

Cluster Stack πŸ—οΈ

training_loop

Figure: Shows the training loop of the agent and its interaction with the environment

Tool Utility
Container Orchestration
User Load Simulation
Benchmark Application
Injects a proxy at the pod level to monitor traffic between services
Gathers metrics from different deployed tools (istio, locust, onlineboutique)

Project Files πŸ“‚

β”œβ”€β”€ k8s_config_files
β”‚Β Β  β”œβ”€β”€ locust_files
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ Dockerfile
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ locustfile.py
β”‚Β Β  β”‚Β Β  └── locust.yaml
β”‚Β Β  β”œβ”€β”€ onlineboutique.yaml
β”‚Β Β  └── prometheus.yaml
β”œβ”€β”€ results
β”‚   β”œβ”€β”€ runs/
β”‚   β”‚   └── <run_name>/
β”‚   β”‚       β”œβ”€β”€ models/
β”‚   β”‚       β”œβ”€β”€ run.log
β”‚   β”‚       └── results.csv
β”‚   └── tensorboard/
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ setup.py
└── src
  β”œβ”€β”€ gym_hpa
  β”‚   β”œβ”€β”€ gnn
  β”‚   β”‚   β”œβ”€β”€ gnn.py
  β”‚   β”‚   └── graphCreation.py
  β”‚Β Β  β”œβ”€β”€ paths.py
  β”‚Β Β  └── rl_environments
  β”‚Β Β      β”œβ”€β”€ deployment.py
  β”‚Β Β      β”œβ”€β”€ online_boutique.py
  β”‚Β Β      └── util.py
  └── policies
      β”œβ”€β”€ run
      β”‚   └── run.py
      └── util
            └── util.py
  • k8s_config_files: A configuration files directory in order to properly deploy the cluster stack.

  • results: A directory where the outputs of a training are saved (eg: Training Logs, Episode Metrics).

  • requirements.txt: A generated file from pip-compile command provided by the pip-tools package, which contains the required modules in order to run this project.

  • setup.py: A Python script used to describe a package (metadata, dependencies, packaging instructions) in our case, the local package gym-hpa.

  • src: Holds the whole codebase of the developed framework.

    • gym_hpa:

      • gnn: the graphCreation and the gnn code is under this directory.
      • rl_environments: Holds the logic of handling the OnlineBoutique app and its interaction with the K8s cluster.
    • policies/run : The main start point of this project to start a training/testing of an agent.

Usage πŸš€

The following steps require files from this repository. First, clone the project and access the directory:

  • git clone https://github.com/SlyPex/graph-gym-hpa.git
  • cd graph-gym-hpa/

Cluster Setup πŸ› οΈ

Our cluster consists of the following VMs (Nodes):

  • One Masternode (10 CPU Cores, 10 GB of RAM)
  • Two Workers (8 CPU Cores, 8 GB of RAM)

Important

The following steps assume an existing Kubernetes cluster has already been set up, the previous specs and the setup method (kubeadm, Minikube, kind, etc.) is irrelevant, just make sure that the given resources (CPU Cores, RAM) to the cluster are more than enough to handle the previous stack.

  1. Start by installing istio using istioctl command line tool, follow the steps at Installation steps using istioctl
  2. Deploy prometheus using the file prometheus.yaml
  • kubectl apply -f k8s_config_files/prometheus.yaml

Note

This prometheus.yaml file should work out-of-the-box with istio, because it's the same file from the addons provided by istio project with some minor changes :

  • Number of replicas is set to 2 to assure high availability
  • Added a prometheus service of type NodePort to ensure the continuous connectivity with prometheus API.
  • Changed Prometheus to scrape and aggregate metrics at the service level instead of the pod level to reduce scrape target cardinality and lower RAM usage. (RAM usage plateaus around 3.7 GB and stabilizes at about 1.8 GB)
  1. Deploy the benchmark application Online Boutique:
  • Create a new namespace named onlineboutique
    • kubectl create ns onlineboutique 
  • Label the newly created namespace so that istio can inject the sidecars
    • kubectl label namespace onlineboutique istio-injection=enabled
  • Finally, deploy the application using the file onlineboutique.yaml
    • kubectl apply -n onlineboutique -f k8s_config_files/onlineboutique.yaml
  1. Deploy locust the load generator via this file locust.yaml
  • kubectl apply -f k8s_config_files/locust_files/locust.yaml

Note

Locust pod runs two containers, Locust v2.10.2 and a locust-exporter that exposes metrics to (such as latency, avg response time) Prometheus , The exporter requires Locust v2.10.2, which is why we use that version
Locust load generation is implemented in the locustfile.py file which is also packed within a docker image using the Dockerfile, In case of any changes needed you can adjust these files to your need and use your custom docker image by setting this line with your built image, e.g. ghcr.io/<org>/locust:vX.Y.Z.

Caution

All the yaml files used to deploy the previous stack have a nodeAffinity that prevents their deployments from being scheduled on the control-plane node (masternode VM), This may cause issues in some setups; please double-check that the nodeAffinity meets your cluster topology.

Agent Setup & Training 🧠

Important

Before Training - Critical Setup:

  1. Kubernetes API Access: Ensure the Kubernetes API is accessible from your training machine

  2. Prometheus Accessibility: Verify Prometheus is reachable at localhost:31090

    kubectl get svc -n istio-system prometheus
    curl http://localhost:31090/-/healthy
  1. Install the required packages listed under the file requirements.txt and the local package gym-hpa in order to run the framework
  • pip install -r requirements.txt && pip install -e .
    
  1. Change the directory to where the run.py script is:
  • cd src/policies/run
    
  1. Finally, launch a training
  • python run.py --training --total_steps 1000 --alg a2c
    

Tip

Available Options: Run python run.py -h (or --help) to list all the available options and their possible values.

Monitor Resources: Keep an eye on CPU and RAM usage to avoid OOM (Out of Memory) errors:

# Monitor node resources
kubectl top nodes

# Monitor pods sorted by memory (all namespaces)
kubectl top pods -A --sort-by=memory

# Monitor pods sorted by CPU (all namespaces)
kubectl top pods -A --sort-by=cpu

# Watch resources in real-time
watch -n 2 'kubectl top nodes && echo && kubectl top pods -A --sort-by=memory'

License πŸ“„

This project is a derivative work of Gym-HPA and is licensed for non-commercial educational and research use only.

See LICENSE.md for complete terms.

For commercial use, contact:

Collaborators 🀝

About

Leveraging GNN-based dependency modeling over real-world Kubernetes cluster to build robust RL-driven autoscalers for microservices.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages

  • Python 99.6%
  • Dockerfile 0.4%