These are container-based supplementary services.
We use NGINX as our reverse proxy, which forwards users' HTTPS requests from their web browsers to our various backend services.
We are currently offering these web services:
- Determined AI Master
- Homepage
- Nextcloud
- Determined AI
- Gitea
- Harbor
- Grafana
- NGINX
- Prometheus
- node-exporter
- cAdvisor
- DCGM-Exporter
- V2Ray Exporter
- frp
Install the Compose plugin
to enable GPU support instead of using the older version of docker-compose in Ubuntu (20.04).
sudo apt install docker-compose-pluginCheck the notes to configure env variables for Gitea.
Check the notes to install Harbor.
P.S. The Harbor service is not in the all-in-one file, thus needs to be launched separately.
Check the note to add the configuration files
Fix the ACL permissions:
sudo chown -R 472:0 grafana/*
sudo chown -R 1000:1000 prometheus/*
sudo chown -R 999:0 wandb/volContains some key configurations in /etc
To launch the all-in-one services, simply run the command on the management node:
docker compose up -dTo rebuild one service, for example, the NGINX reverse proxy, run
docker compose build nginxTo force recreate some services (when changing some configurations), run
docker compose up -d --force-recreate --remove-orphans [service1 service2 ...]To force recreate all services:
docker compose up -d --force-recreate --remove-orphansThis docker-compose.yaml starts monitoring tools similar to the Determined AI Docs - Configure Determined with Prometheus and Grafana, except that in configure cAdvisor and dcgm-exporter, the official document uses provider: startup_script: | that only works with GCP and Azure provider, while we use our own on-premise cluster.
Instead of using that start-up script, we need to manually launch this docker-compose.yaml on each agent node (Maybe we can use Ansible in the future).
Monitoring tools:
These tools will run on the cluster agents to be monitored.
On every node that needs to be monitored:
Copy docker-compose.yaml in node-exporter to every node, then run
# Using `docker compose` instead of `docker-compose`
docker compose up -d --force-recreate --remove-orphansto collect data from every machine.
Update static_configs[targets] in prometheus/config/prometheus.yml if any new nodes are added to the cluster.
Scraping Determined-AI-master's metrics (/prom/det-state-metrics) with Determined-AI API needs a bearer_token. You can get this token by:
curl -s "http://10.0.1.66:8080/api/v1/auth/login" \
-H 'Content-Type: application/json' \
--data-binary '{"username":"admin","password":"********"}'Then you can use this token in prometheus.yaml.
Reference:
Determined AI Docs - Configure Determined with Prometheus and Grafana
Although Determined-AI's det-state-metrics (to view it in your browser you need to log in to https://gpu.cvgl.lab first) provides enough information about tasks and containers, the official document and repo did not provide a Grafana dashboard that integrates these data with cAdvisor and dcgm-exporter to provide usage statistics by individual users or tasks. Further development is required for more precise cluster management.
For example, in https://gpu.cvgl.lab/prom/det-state-metrics, each job will have an allocation_id. With this allocation_id, you can get the corresponding container_id in det_container_id_allocation_id.
With this container_id, you can:
- Get
container_runtime_idindet_container_id_runtime_container_id - Get
gpu_uuidindet_gpu_uuid_container_id
With container_runtime_id, you can get container stats of this job with cAdvisor;`
With gpu_uuid, you can get GPU stats of this job with dcgm-exporter.
TODOs:
- A Grafana dashboard that integrates and visualizes these data
- A management watchdog that utilizes these data and kills tasks
https://github.com/stefan0us/xray-traefik
https://github.com/nginx/nginx
https://github.com/determined-ai/determined
https://github.com/nextcloud/server
https://github.com/go-gitea/gitea
https://github.com/goharbor/harbor
https://github.com/XTLS/Xray-core
https://github.com/grafana/grafana
https://github.com/prometheus/prometheus
https://github.com/prometheus/node_exporter
https://github.com/google/cadvisor
https://github.com/NVIDIA/dcgm-exporter
https://github.com/wi1dcard/v2ray-exporter
https://github.com/soulteary/docker-flare