Metrics & Monitoring Guide

Monitoring architecture and metrics collection
You're viewing a development version of manager, the latest released version is v1.4.1
Go to the latest released version

Overview

The CDN Manager includes a comprehensive monitoring stack based on VictoriaMetrics for time-series data storage, Telegraf for metrics collection, and Grafana for visualization. This guide describes the monitoring architecture and how to access and use the monitoring capabilities.

GuideDescription
Grafana DashboardsUsing and customising the built-in and advanced Grafana dashboards
Grafana Authentication & RolesConfiguring Grafana authentication, roles, and permissions
Alerts & AlarmsConfiguring and managing alerts and alarms

Architecture

Components

ComponentPurpose
TelegrafMetrics collector running on each node, gathering system and application metrics
VictoriaMetrics AgentMetrics scraper and forwarder; scrapes Prometheus endpoints and forwards to VictoriaMetrics
VictoriaMetrics (Short-term)Time-series database for operational dashboards (30-90 day retention)
VictoriaMetrics (Long-term)Time-series database for billing and compliance (1+ year retention)
GrafanaVisualization and dashboard platform; deployed as two replicas for high availability
AlertmanagerAlert routing and notification management

Metrics Flow

The following diagram illustrates how metrics flow through the monitoring stack:

flowchart TB
    subgraph External["External Sources"]
        Streamers[Streamers/External Clients]
    end

    subgraph Cluster["Kubernetes Cluster"]
        Telegraf[Telegraf DaemonSet]

        subgraph Applications["Application Components"]
            Director[CDN Director]
            Kafka[Kafka]
            Redis[Redis]
            Manager[ACD Manager]
            Alertmanager[Alertmanager]
        end

        VMAgent[VictoriaMetrics Agent]

        subgraph Storage["Storage"]
            VMShort[VictoriaMetrics<br/>Short-term]
            VMLong[VictoriaMetrics<br/>Long-term]
        end

        Grafana[Grafana<br/>2 replicas, HA]
        PostgreSQL[(PostgreSQL)]
        Zitadel[Zitadel]
    end

    Streamers -->|Push metrics| Telegraf
    Telegraf -->|remote_write| VMShort
    Telegraf -->|remote_write| VMLong

    Director -->|Scrape| VMAgent
    Kafka -->|Scrape| VMAgent
    Redis -->|Scrape| VMAgent
    Manager -->|Scrape| VMAgent
    Alertmanager -->|Scrape| VMAgent

    VMAgent -->|remote_write| VMShort
    VMAgent -->|remote_write| VMLong

    VMShort -->|Query| Grafana
    VMLong -->|Query| Grafana

    Grafana <-->|Shared state| PostgreSQL
    Grafana -->|OAuth2 / OIDC| Zitadel

Metrics Flow Summary:

  1. External metrics ingestion:

    • External clients (streamers) push metrics to Telegraf
    • Telegraf forwards metrics via remote_write to both VictoriaMetrics instances
  2. Internal metrics scraping:

    • VictoriaMetrics Agent scrapes Prometheus endpoints from:
      • CDN Director instances
      • Kafka cluster
      • Redis
      • ACD Manager components
      • Alertmanager
    • VMAgent forwards scraped metrics via remote_write to both VictoriaMetrics instances
  3. Data visualization:

    • Grafana queries both VictoriaMetrics databases depending on the dashboard requirements
    • Operational dashboards use short-term storage
    • Billing and compliance dashboards use long-term storage

Metrics Collection

Application Metrics

Applications expose metrics on Prometheus-compatible endpoints. VictoriaMetrics Agent (VMAgent) scrapes these endpoints and forwards metrics to VictoriaMetrics via remote_write.

System Metrics

Telegraf collects system-level metrics including:

  • CPU usage
  • Memory utilization
  • Disk I/O
  • Network statistics
  • Process metrics

Kubernetes Metrics

Cluster metrics are collected including:

  • Pod resource usage
  • Node status
  • Deployment status
  • Persistent volume usage

Metrics Retention

VictoriaMetrics is configured with default retention policies. For custom retention settings, modify the VictoriaMetrics configuration in your values.yaml:

acd-metrics:
  victoria-metrics-single:
    retentionPeriod: "3"  # Retention period in months

Troubleshooting

Metrics Not Appearing

If metrics are not appearing in Grafana:

  1. Check Telegraf pods:

    kubectl get pods -l app.kubernetes.io/component=telegraf
    
  2. Check Telegraf logs:

    kubectl logs -l app.kubernetes.io/component=telegraf
    
  3. Verify VictoriaMetrics is running:

    kubectl get pods -l app.kubernetes.io/component=victoria-metrics
    
  4. Check application metrics endpoints:

    kubectl exec <pod-name> -- curl localhost:8080/metrics
    

For dashboard and authentication issues, see the Grafana Dashboards and Grafana Authentication & Roles guides.

Next Steps

After setting up monitoring:

  1. Grafana Authentication & Roles - Configure SSO and permissions before accessing Grafana
  2. Grafana Dashboards - Explore and customise dashboards
  3. Alerts & Alarms - Set up alerting and notifications
  4. Operations Guide - Day-to-day operational procedures
  5. Troubleshooting Guide - Resolve monitoring issues
  6. API Guide - Access metrics via API

Grafana Authentication & Roles

Configuring Grafana authentication, roles, and permissions via Zitadel

Grafana Dashboards

Using and customising Grafana dashboards

Alerts & Alarms

Configuring and managing alerts and alarms