Metrics & Monitoring Guide

Monitoring architecture and metrics collection

You're viewing a development version of manager, the latest released version is v1.4.1

Overview

The CDN Manager includes a comprehensive monitoring stack based on VictoriaMetrics for time-series data storage, Telegraf for metrics collection, and Grafana for visualization. This guide describes the monitoring architecture and how to access and use the monitoring capabilities.

Quick Links

Guide	Description
Grafana Dashboards	Using and customising the built-in and advanced Grafana dashboards
Grafana Authentication & Roles	Configuring Grafana authentication, roles, and permissions
Alerts & Alarms	Configuring and managing alerts and alarms

Architecture

Components

Component	Purpose
Telegraf	Metrics collector running on each node, gathering system and application metrics
VictoriaMetrics Agent	Metrics scraper and forwarder; scrapes Prometheus endpoints and forwards to VictoriaMetrics
VictoriaMetrics (Short-term)	Time-series database for operational dashboards (30-90 day retention)
VictoriaMetrics (Long-term)	Time-series database for billing and compliance (1+ year retention)
Grafana	Visualization and dashboard platform; deployed as two replicas for high availability
Alertmanager	Alert routing and notification management

Metrics Flow

The following diagram illustrates how metrics flow through the monitoring stack:

flowchart TB
    subgraph External["External Sources"]
        Streamers[Streamers/External Clients]
    end

    subgraph Cluster["Kubernetes Cluster"]
        Telegraf[Telegraf DaemonSet]

        subgraph Applications["Application Components"]
            Director[CDN Director]
            Kafka[Kafka]
            Redis[Redis]
            Manager[ACD Manager]
            Alertmanager[Alertmanager]
        end

        VMAgent[VictoriaMetrics Agent]

        subgraph Storage["Storage"]
            VMShort[VictoriaMetrics<br/>Short-term]
            VMLong[VictoriaMetrics<br/>Long-term]
        end

        Grafana[Grafana<br/>2 replicas, HA]
        PostgreSQL[(PostgreSQL)]
        Zitadel[Zitadel]
    end

    Streamers -->|Push metrics| Telegraf
    Telegraf -->|remote_write| VMShort
    Telegraf -->|remote_write| VMLong

    Director -->|Scrape| VMAgent
    Kafka -->|Scrape| VMAgent
    Redis -->|Scrape| VMAgent
    Manager -->|Scrape| VMAgent
    Alertmanager -->|Scrape| VMAgent

    VMAgent -->|remote_write| VMShort
    VMAgent -->|remote_write| VMLong

    VMShort -->|Query| Grafana
    VMLong -->|Query| Grafana

    Grafana <-->|Shared state| PostgreSQL
    Grafana -->|OAuth2 / OIDC| Zitadel

Metrics Flow Summary:

External metrics ingestion:
- External clients (streamers) push metrics to Telegraf
- Telegraf forwards metrics via remote_write to both VictoriaMetrics instances
Internal metrics scraping:
- VictoriaMetrics Agent scrapes Prometheus endpoints from:
  - CDN Director instances
  - Kafka cluster
  - Redis
  - ACD Manager components
  - Alertmanager
- VMAgent forwards scraped metrics via remote_write to both VictoriaMetrics instances
Data visualization:
- Grafana queries both VictoriaMetrics databases depending on the dashboard requirements
- Operational dashboards use short-term storage
- Billing and compliance dashboards use long-term storage

Metrics Collection

Application Metrics

Applications expose metrics on Prometheus-compatible endpoints. VictoriaMetrics Agent (VMAgent) scrapes these endpoints and forwards metrics to VictoriaMetrics via remote_write.

System Metrics

Telegraf collects system-level metrics including:

CPU usage
Memory utilization
Disk I/O
Network statistics
Process metrics

Kubernetes Metrics

Cluster metrics are collected including:

Pod resource usage
Node status
Deployment status
Persistent volume usage

Metrics Retention

VictoriaMetrics is configured with default retention policies. For custom retention settings, modify the VictoriaMetrics configuration in your values.yaml:

acd-metrics:
  victoria-metrics-single:
    retentionPeriod: "3"  # Retention period in months

Troubleshooting

Metrics Not Appearing

If metrics are not appearing in Grafana:

Check Telegraf pods:

kubectl get pods -l app.kubernetes.io/component=telegraf

Check Telegraf logs:

kubectl logs -l app.kubernetes.io/component=telegraf

Verify VictoriaMetrics is running:

kubectl get pods -l app.kubernetes.io/component=victoria-metrics

Check application metrics endpoints:

kubectl exec <pod-name> -- curl localhost:8080/metrics

For dashboard and authentication issues, see the Grafana Dashboards and Grafana Authentication & Roles guides.

Next Steps

After setting up monitoring:

Grafana Authentication & Roles - Configure SSO and permissions before accessing Grafana
Grafana Dashboards - Explore and customise dashboards
Alerts & Alarms - Set up alerting and notifications
Operations Guide - Day-to-day operational procedures
Troubleshooting Guide - Resolve monitoring issues
API Guide - Access metrics via API