This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Prometheus and Grafana for monitoring

Installing and running Prometheus exporters, Prometheus database and Grafana

    To collect useful system metrics for stability and performance monitoring, we advise to use Prometheus. For visualizing the metrics collected by Prometheus you could use Grafana.

    The System controller receives a lot of internal metrics from the Agile Live components. These can be pulled by a Prometheus instance from the endpoint https://system-controller-host:8080/metrics.

    It is also possible to install external exporters for various hardware and OS metrics.

    Prometheus

    Installation

    Use this guide to install Prometheus: https://prometheus.io/docs/prometheus/latest/installation/.

    Configuration

    Prometheus should be configured to poll or scrape the system controller with something like this in the prometheus.yml file:

    scrape_configs:
      - job_name: 'system_controller_exporter'
        scrape_interval: 5s
        scheme: https
        tls_config:
          insecure_skip_verify: true
        static_configs:
        - targets: ['system-controller-host:8080']
    

    External exporters

    Node Exporter

    Node Exporter is an exporter used for general hardware and OS metrics, such as CPU load and memory usage.

    Instructions for installation and configuration can be found here: https://github.com/prometheus/node_exporter

    Add a new scrape_config in prometheus.yml like so:

      - job_name: 'node_exporter'
        scrape_interval: 15s
        static_configs:
        - targets: ['node-exporter-host:9100']
    

    DCGM Exporter

    This exporter uses Nvidia DCGM to gather metrics from Nvidia GPUs. Includes encoder and decoder utilization.

    More info and installation instructions to be found here: https://github.com/NVIDIA/dcgm-exporter

    Add a new scrape_config in prometheus.yml like so:

      - job_name: 'dcgm_exporter'
        scrape_interval: 15s
        static_configs:
        - targets: ['dcgm-exporter-host:9400']
    

    Grafana

    Installation of Grafana is described here: https://grafana.com/docs/grafana/latest/setup-grafana/installation/

    As a start, the following Dashboards can be used to visualize the Node Exporter and DCGM Exporter data:

    Example of running Node Exporter and DCGM Exporter with Docker Compose

    To simplify setup of the Node Exporter and DCGM Exporter on multiple machines to monitor, the following example Docker Compose file can be used. First, after a normal installation of Docker and the Docker Compose plugin, the Nvidia Container Toolkit must be installed and configured to allow access to the Nvidia GPU from inside a Docker container:

    sudo apt-get install -y nvidia-container-toolkit
    sudo nvidia-ctk runtime configure --runtime=docker
    sudo systemctl restart docker
    

    Then the following example docker-compose.yml file can be used to start both the Node Exporter and the DCGM Exporter:

    version: '3.8'
    
    services:
      node_exporter:
        image: quay.io/prometheus/node-exporter:latest
        container_name: node_exporter
        command:
          - '--path.rootfs=/host'
        network_mode: host
        pid: host
        restart: unless-stopped
        volumes:
          - '/:/host:ro,rslave'
    
      dcgm_exporter:
        image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.3-3.3.1-ubuntu22.04
        container_name: dcgm_exporter
        deploy:
          resources:
            reservations:
              devices:
                - driver: nvidia
                  count: all
                  capabilities: [ gpu ]
        restart: unless-stopped
        environment:
          - DCGM_EXPORTER_NO_HOSTNAME=1
        cap_add:
          - SYS_ADMIN
        ports:
          - "9400:9400"
    

    Start the Docker containers as usual with docker compose up -d. To verify the exporters work, you can use Curl to access the metrics data like: curl localhost:9100/metrics for the Node Exporter and curl localhost:9400/metrics for the DCGM exporter. Note that the DCGM exporter might take several seconds before the first metrics are collected, resulting in that the first requests might yield an empty response body.