Skip to content

Latest commit

 

History

History
231 lines (148 loc) · 7.31 KB

File metadata and controls

231 lines (148 loc) · 7.31 KB

Supplementary Services

These are container-based supplementary services.

Services

We use NGINX as our reverse proxy, which forwards users' HTTPS requests from their web browsers to our various backend services.

We are currently offering these web services:

Core service

  • Determined AI Master

Web service

Background services

  • NGINX
  • Prometheus
  • node-exporter
  • cAdvisor
  • DCGM-Exporter
  • V2Ray Exporter
  • frp

HOW-TO

Requirements

Install the Compose plugin to enable GPU support instead of using the older version of docker-compose in Ubuntu (20.04).

sudo apt install docker-compose-plugin

First-time configurations

1. Gitea

Check the notes to configure env variables for Gitea.

2. Harbor

Check the notes to install Harbor.

P.S. The Harbor service is not in the all-in-one file, thus needs to be launched separately.

3. Xray

Check the note to add the configuration files

4. Grafana, Prometheus and Wandb

Fix the ACL permissions:

sudo chown -R 472:0 grafana/*

sudo chown -R 1000:1000 prometheus/*

sudo chown -R 999:0 wandb/vol

5. System-configurations

Contains some key configurations in /etc

6. All-in-one services (except Harbor and node-exporter)

To launch the all-in-one services, simply run the command on the management node:

docker compose up -d

To rebuild one service, for example, the NGINX reverse proxy, run

docker compose build nginx

To force recreate some services (when changing some configurations), run

docker compose up -d --force-recreate --remove-orphans [service1 service2 ...]

To force recreate all services:

docker compose up -d --force-recreate --remove-orphans

7. Set up endpoints for Node-exporter and other monitoring services

7.1. Introduction

This docker-compose.yaml starts monitoring tools similar to the Determined AI Docs - Configure Determined with Prometheus and Grafana, except that in configure cAdvisor and dcgm-exporter, the official document uses provider: startup_script: | that only works with GCP and Azure provider, while we use our own on-premise cluster.

Instead of using that start-up script, we need to manually launch this docker-compose.yaml on each agent node (Maybe we can use Ansible in the future).

Monitoring tools:

These tools will run on the cluster agents to be monitored.

7.2. Run

On every node that needs to be monitored:

Copy docker-compose.yaml in node-exporter to every node, then run

# Using `docker compose` instead of `docker-compose`
docker compose up -d --force-recreate --remove-orphans

to collect data from every machine.

Update static_configs[targets] in prometheus/config/prometheus.yml if any new nodes are added to the cluster.

7.3. Prometheus authentication for Determined AI (Bearer token)

Scraping Determined-AI-master's metrics (/prom/det-state-metrics) with Determined-AI API needs a bearer_token. You can get this token by:

curl -s "http://10.0.1.66:8080/api/v1/auth/login" \
  -H 'Content-Type: application/json' \
  --data-binary '{"username":"admin","password":"********"}'

Then you can use this token in prometheus.yaml.

Reference:

Determined AI Docs - Configure Determined with Prometheus and Grafana

Determined AI Docs - REST API - Authentication

Notes

Although Determined-AI's det-state-metrics (to view it in your browser you need to log in to https://gpu.cvgl.lab first) provides enough information about tasks and containers, the official document and repo did not provide a Grafana dashboard that integrates these data with cAdvisor and dcgm-exporter to provide usage statistics by individual users or tasks. Further development is required for more precise cluster management.

For example, in https://gpu.cvgl.lab/prom/det-state-metrics, each job will have an allocation_id. With this allocation_id, you can get the corresponding container_id in det_container_id_allocation_id.

With this container_id, you can:

  • Get container_runtime_id in det_container_id_runtime_container_id
  • Get gpu_uuid in det_gpu_uuid_container_id

With container_runtime_id, you can get container stats of this job with cAdvisor;`

With gpu_uuid, you can get GPU stats of this job with dcgm-exporter.

TODOs:

  • A Grafana dashboard that integrates and visualizes these data
  • A management watchdog that utilizes these data and kills tasks

Acknowledgments

https://github.com/stefan0us/xray-traefik

https://github.com/nginx/nginx

https://github.com/determined-ai/determined

https://github.com/nextcloud/server

https://github.com/go-gitea/gitea

https://github.com/goharbor/harbor

https://github.com/XTLS/Xray-core

https://github.com/grafana/grafana

https://github.com/prometheus/prometheus

https://github.com/prometheus/node_exporter

https://github.com/google/cadvisor

https://github.com/NVIDIA/dcgm-exporter

https://github.com/wi1dcard/v2ray-exporter

https://github.com/soulteary/docker-flare

https://github.com/fatedier/frp

https://github.com/snowdreamtech/frp