Metrics & Monitoring Guide

Monitoring architecture and metrics collection
You're viewing a development version of manager, the latest released version is v1.4.1
Go to the latest released version

Overview

The CDN Manager includes a comprehensive monitoring stack based on VictoriaMetrics for time-series data storage, Telegraf for metrics collection, and Grafana for visualization. This guide describes the monitoring architecture and how to access and use the monitoring capabilities.

Architecture

Components

ComponentPurpose
TelegrafMetrics collector running on each node, gathering system and application metrics
VictoriaMetrics AgentMetrics scraper and forwarder; scrapes Prometheus endpoints and forwards to VictoriaMetrics
VictoriaMetrics (Short-term)Time-series database for operational dashboards (30-90 day retention)
VictoriaMetrics (Long-term)Time-series database for billing and compliance (1+ year retention)
GrafanaVisualization and dashboard platform
AlertmanagerAlert routing and notification management

Metrics Flow

The following diagram illustrates how metrics flow through the monitoring stack:

flowchart TB
    subgraph External["External Sources"]
        Streamers[Streamers/External Clients]
    end

    subgraph Cluster["Kubernetes Cluster"]
        Telegraf[Telegraf DaemonSet]

        subgraph Applications["Application Components"]
            Director[CDN Director]
            Kafka[Kafka]
            Redis[Redis]
            Manager[ACD Manager]
            Alertmanager[Alertmanager]
        end

        VMAgent[VictoriaMetrics Agent]

        subgraph Storage["Storage"]
            VMShort[VictoriaMetrics<br/>Short-term]
           VMLong[VictoriaMetrics<br/>Long-term]
        end
    end

    Grafana[Grafana]

    Streamers -->|Push metrics| Telegraf
    Telegraf -->|remote_write| VMShort
    Telegraf -->|remote_write| VMLong

    Director -->|Scrape| VMAgent
    Kafka -->|Scrape| VMAgent
    Redis -->|Scrape| VMAgent
    Manager -->|Scrape| VMAgent
    Alertmanager -->|Scrape| VMAgent

    VMAgent -->|remote_write| VMShort
    VMAgent -->|remote_write| VMLong

    VMShort -->|Query| Grafana
    VMLong -->|Query| Grafana

Metrics Flow Summary:

  1. External metrics ingestion:

    • External clients (streamers) push metrics to Telegraf
    • Telegraf forwards metrics via remote_write to both VictoriaMetrics instances
  2. Internal metrics scraping:

    • VictoriaMetrics Agent scrapes Prometheus endpoints from:
      • CDN Director instances
      • Kafka cluster
      • Redis
      • ACD Manager components
      • Alertmanager
    • VMAgent forwards scraped metrics via remote_write to both VictoriaMetrics instances
  3. Data visualization:

    • Grafana queries both VictoriaMetrics databases depending on the dashboard requirements
    • Operational dashboards use short-term storage
    • Billing and compliance dashboards use long-term storage

Accessing Grafana

Grafana is deployed as part of the metrics stack and accessible via the ingress:

URL: https://<manager-host>/grafana

Default credentials are listed in the Glossary.

Important: Change all default passwords after first login.

Metrics Collection

Application Metrics

Applications expose metrics on Prometheus-compatible endpoints. VictoriaMetrics Agent (VMAgent) scrapes these endpoints and forwards metrics to VictoriaMetrics via remote_write.

System Metrics

Telegraf collects system-level metrics including:

  • CPU usage
  • Memory utilization
  • Disk I/O
  • Network statistics
  • Process metrics

Kubernetes Metrics

Cluster metrics are collected including:

  • Pod resource usage
  • Node status
  • Deployment status
  • Persistent volume usage

Grafana Dashboards

Accessing Dashboards

After logging into Grafana:

  1. Navigate to Dashboards in the left menu
  2. Browse available dashboards
  3. Click on a dashboard to view metrics

Dashboard Types

The included dashboards provide visibility into:

  • Cluster Health: Overall cluster resource utilization
  • Application Performance: Request rates, latency, error rates
  • Component Status: Individual component health indicators

CDN Director Metrics

Director DNS Names in Grafana

CDN Director instances are identified in Grafana by their DNS name, which is derived from the name field in global.hosts.routers:

global:
  hosts:
    routers:
      - name: my-router-1
        address: 192.0.2.1

The DNS name used in Grafana dashboards will be: my-router-1.external

This naming convention is automatically applied for all configured directors.

Metrics Retention

VictoriaMetrics is configured with default retention policies. For custom retention settings, modify the VictoriaMetrics configuration in your values.yaml:

acd-metrics:
  victoria-metrics-single:
    retentionPeriod: "3"  # Retention period in months

Troubleshooting

Metrics Not Appearing

If metrics are not appearing in Grafana:

  1. Check Telegraf pods:

    kubectl get pods -l app.kubernetes.io/component=telegraf
    
  2. Check Telegraf logs:

    kubectl logs -l app.kubernetes.io/component=telegraf
    
  3. Verify VictoriaMetrics is running:

    kubectl get pods -l app.kubernetes.io/component=victoria-metrics
    
  4. Check application metrics endpoints:

    kubectl exec <pod-name> -- curl localhost:8080/metrics
    

Dashboard Loading Issues

If dashboards fail to load:

  1. Check Grafana pods:

    kubectl get pods -l app.kubernetes.io/component=grafana
    
  2. Review Grafana logs:

    kubectl logs -l app.kubernetes.io/component=grafana
    
  3. Verify datasource configuration in Grafana UI

Next Steps

After setting up monitoring:

  1. Operations Guide - Day-to-day operational procedures
  2. Troubleshooting Guide - Resolve monitoring issues
  3. API Guide - Access metrics via API