Prometheus and Grafana for monitoring

Installing and running Prometheus exporters, Prometheus database and Grafana

To collect useful system metrics for stability and performance monitoring, we advise to use Prometheus. For visualizing the metrics collected by Prometheus you could use Grafana.

The System controller receives a lot of internal metrics from the Agile Live components. These can be pulled by a Prometheus instance from the endpoint https://system-controller-host:8080/metrics.

It is also possible to install external exporters for various hardware and OS metrics.

Prometheus

Installation

Use this guide to install Prometheus: https://prometheus.io/docs/prometheus/latest/installation/.

Configuration

Prometheus should be configured to poll or scrape the system controller with something like this in the prometheus.yml file:

scrape_configs:
  - job_name: 'system_controller_exporter'
    scrape_interval: 5s
    scheme: https
    tls_config:
      insecure_skip_verify: true
    static_configs:
    - targets: ['system-controller-host:8080']

External exporters

Node Exporter

Node Exporter is an exporter used for general hardware and OS metrics, such as CPU load and memory usage.

Instructions for installation and configuration can be found here: https://github.com/prometheus/node_exporter

Add a new scrape_config in prometheus.yml like so:

  - job_name: 'node_exporter'
    scrape_interval: 15s
    static_configs:
    - targets: ['node-exporter-host:9100']

DCGM Exporter

This exporter uses Nvidia DCGM to gather metrics from Nvidia GPUs. Includes encoder and decoder utilization.

More info and installation instructions to be found here: https://github.com/NVIDIA/dcgm-exporter

Add a new scrape_config in prometheus.yml like so:

  - job_name: 'dcgm_exporter'
    scrape_interval: 15s
    static_configs:
    - targets: ['dcgm-exporter-host:9400']

Grafana

Installation of Grafana is described here: https://grafana.com/docs/grafana/latest/setup-grafana/installation/

As a start, the following Dashboards can be used to visualize the Node Exporter and DCGM Exporter data:

Example of running Node Exporter and DCGM Exporter with Docker Compose

To simplify setup of the Node Exporter and DCGM Exporter on multiple machines to monitor, the following example Docker Compose file can be used. First, after a normal installation of Docker and the Docker Compose plugin, the Nvidia Container Toolkit must be installed and configured to allow access to the Nvidia GPU from inside a Docker container:

sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Then the following example docker-compose.yml file can be used to start both the Node Exporter and the DCGM Exporter:

version: '3.8'

services:
  node_exporter:
    image: quay.io/prometheus/node-exporter:latest
    container_name: node_exporter
    command:
      - '--path.rootfs=/host'
    network_mode: host
    pid: host
    restart: unless-stopped
    volumes:
      - '/:/host:ro,rslave'

  dcgm_exporter:
    image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.3-3.3.1-ubuntu22.04
    container_name: dcgm_exporter
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [ gpu ]
    restart: unless-stopped
    environment:
      - DCGM_EXPORTER_NO_HOSTNAME=1
    cap_add:
      - SYS_ADMIN
    ports:
      - "9400:9400"

Start the Docker containers as usual with docker compose up -d. To verify the exporters work, you can use Curl to access the metrics data like: curl localhost:9100/metrics for the Node Exporter and curl localhost:9400/metrics for the DCGM exporter. Note that the DCGM exporter might take several seconds before the first metrics are collected, resulting in that the first requests might yield an empty response body.