Current autoscalers (e.g., the Kubernetes HPA) are mostly reactive and do not account for inter-service dependencies. To address this limitation, this project combines two complementary ideas:
- Reinforcement Learning: allows an agent to learn by interacting with a real-world environment, mainly a Kubernetes cluster.
- Graph Neural Networks: because microservices and their call relationships form a graph, GNNs are well suited to model and learn inter-service dependencies.
Before starting, ensure you have:
-
Active Kubernetes cluster (v1.30+)
-
Python installed (v3.11+)
-
Storage Requirements:
- ~7GB for python packages and dependencies
- ~25MB per model checkpoint (Checkpoints saved every N steps)
- Minimum of 15GB free disk space recommended
βββ k8s_config_files
βΒ Β βββ locust_files
βΒ Β βΒ Β βββ Dockerfile
βΒ Β βΒ Β βββ locustfile.py
βΒ Β βΒ Β βββ locust.yaml
βΒ Β βββ onlineboutique.yaml
βΒ Β βββ prometheus.yaml
βββ results
β βββ runs/
β β βββ <run_name>/
β β βββ models/
β β βββ run.log
β β βββ results.csv
β βββ tensorboard/
βββ requirements.txt
βββ setup.py
βββ src
βββ gym_hpa
β βββ gnn
β β βββ gnn.py
β β βββ graphCreation.py
βΒ Β βββ paths.py
βΒ Β βββ rl_environments
βΒ Β βββ deployment.py
βΒ Β βββ online_boutique.py
βΒ Β βββ util.py
βββ policies
βββ run
β Β Β βββ run.py
βββ util
βββ util.py
-
k8s_config_files: A configuration files directory in order to properly deploy the cluster stack. -
results: A directory where the outputs of a training are saved (eg: Training Logs, Episode Metrics). -
requirements.txt: A generated file frompip-compilecommand provided by thepip-toolspackage, which contains the required modules in order to run this project. -
setup.py: A Python script used to describe a package (metadata, dependencies, packaging instructions) in our case, the local package.
-
src: Holds the whole codebase of the developed framework.
The following steps require files from this repository. First, clone the project and access the directory:
-
git clone https://github.com/SlyPex/graph-gym-hpa.git
-
cd graph-gym-hpa/
Our cluster consists of the following VMs (Nodes):
- One Masternode (10 CPU Cores, 10 GB of RAM)
- Two Workers (8 CPU Cores, 8 GB of RAM)
Important
The following steps assume an existing Kubernetes cluster has already been set up, the previous specs and the setup method (kubeadm, Minikube, kind, etc.) is irrelevant, just make sure that the given resources (CPU Cores, RAM) to the cluster are more than enough to handle the previous stack.
- Start by installing
using
istioctlcommand line tool, follow the steps at - Deploy prometheus using the file
-
kubectl apply -f k8s_config_files/prometheus.yaml
Note
This file should work out-of-the-box with istio, because it's the same file from the addons provided by istio project with some minor changes :
- Number of replicas is set to 2 to assure high availability
- Added a prometheus service of type NodePort to ensure the continuous connectivity with prometheus API.
- Changed Prometheus to scrape and aggregate metrics at the service level instead of the pod level to reduce scrape target cardinality and lower RAM usage. (RAM usage plateaus around 3.7 GB and stabilizes at about 1.8 GB)
- Deploy the benchmark application Online Boutique:
- Create a new namespace named
onlineboutique-
kubectl create ns onlineboutique
-
- Label the newly created namespace so that istio can inject the sidecars
-
kubectl label namespace onlineboutique istio-injection=enabled
-
- Finally, deploy the application using the file
-
kubectl apply -n onlineboutique -f k8s_config_files/onlineboutique.yaml
-
-
kubectl apply -f k8s_config_files/locust_files/locust.yaml
Note
Locust pod runs two containers, Locust v2.10.2 and a locust-exporter that exposes metrics to (such as latency, avg response time) Prometheus ,
The exporter requires Locust v2.10.2, which is why we use that version
Locust load generation is implemented in the file which is also packed within a docker image using the
,
In case of any changes needed you can adjust these files to your need and use your custom docker image by setting this
with your built image, e.g.
ghcr.io/<org>/locust:vX.Y.Z.
Caution
All the yaml files used to deploy the previous stack have a nodeAffinity that prevents their deployments from being scheduled on the control-plane node (masternode VM), This may cause issues in some setups; please double-check that the nodeAffinity meets your cluster topology.
Important
Before Training - Critical Setup:
-
Kubernetes API Access: Ensure the Kubernetes API is accessible from your training machine
- Set up kubectl proxy if needed:
kubectl proxy --port=8080 - Update
HOSTindeployment.pyif using a different endpoint - See Kubernetes API Access Guide
- Set up kubectl proxy if needed:
-
Prometheus Accessibility: Verify Prometheus is reachable at
localhost:31090kubectl get svc -n istio-system prometheus curl http://localhost:31090/-/healthy
- If you modified the NodePort in
prometheus.yaml, updatePROMETHEUS_URLin:
- If you modified the NodePort in
- Install the required packages listed under the file
and the local package
in order to run the framework
-
pip install -r requirements.txt && pip install -e .
- Change the directory to where the
run.pyscript is:
-
cd src/policies/run
- Finally, launch a training
-
python run.py --training --total_steps 1000 --alg a2c
Tip
Available Options: Run python run.py -h (or --help) to list all the available options and their possible values.
Monitor Resources: Keep an eye on CPU and RAM usage to avoid OOM (Out of Memory) errors:
# Monitor node resources
kubectl top nodes
# Monitor pods sorted by memory (all namespaces)
kubectl top pods -A --sort-by=memory
# Monitor pods sorted by CPU (all namespaces)
kubectl top pods -A --sort-by=cpu
# Watch resources in real-time
watch -n 2 'kubectl top nodes && echo && kubectl top pods -A --sort-by=memory'This project is a derivative work of Gym-HPA and is licensed for non-commercial educational and research use only.
See LICENSE.md for complete terms.
For commercial use, contact:
- Original work: Ghent University & IMEC (info@imec.be)
- This fork: (s.meharzi@esi-sba.dz)
