1 - Getting Started

Introduction to AgileTV CDN Manager

Overview

The AgileTV CDN Manager (product code ESB3027) is a cloud-native control plane for managing CDN deployments. It provides centralized orchestration for authentication, configuration, routing, and metrics collection across CDN infrastructure.

Before You Start:

  • Deployment type: Lab (single-node) or Production (multi-node)? See Installation Guide
  • Hardware: Nodes meeting specifications for your deployment type
  • OS: RHEL 9 or compatible clone (Oracle Linux, AlmaLinux, Rocky Linux)
  • Software: Installation ISO from AgileTV customer portal; Extras ISO for air-gapped
  • Network: Firewall ports configured per Networking Guide

Deployment Models

Deployment ModelDescriptionTypical Use Case
Self-HostedK3s Kubernetes cluster on customer premisesProduction deployments
Lab/Single-NodeMinimal single-node installationAcceptance testing, demonstrations, development

Functionality remains consistent across deployment models.

Prerequisites

  • Installation ISO: Obtain esb3027-acd-manager-X.Y.Z.iso from AgileTV customer portal
  • Extras ISO (air-gapped): Obtain esb3027-acd-manager-extras-X.Y.Z.iso for offline installations
  • OS: RHEL 9 or compatible clone (Oracle Linux, AlmaLinux, Rocky Linux)
  • Kubernetes familiarity: Basic understanding of pods, deployments, and Helm charts

For detailed hardware, network, and operating system requirements, see the System Requirements Guide.

Installation

Ready to install? The Installation Guide provides step-by-step procedures for both lab and production deployments:

  • Lab/Single-Node: Quick deployment for testing and demonstrations
  • Production/Multi-Node: High-availability cluster with 3+ nodes

See the Installation Guide to get started.

Accessing the System

Following successful deployment, the following interfaces are available:

ServiceURL PathAuthentication
MIB Frontend/guiZitadel SSO
API Gateway/apiBearer token
Zitadel Console/ui/consoleSee Glossary
Grafana/grafanaSee Glossary

All services are accessed via https://<cluster-host><path>.

Note: A self-signed SSL certificate is deployed by default. When accessing services through a browser, you will need to accept the self-signed certificate warning. For production deployments, configure a valid SSL certificate before exposing the system to users.

Initial user configuration is performed through Zitadel. Refer to the Configuration Guide for authentication setup procedures. For detailed guidance on managing users, roles, and permissions in the Zitadel Console, see Zitadel’s User Management Documentation.

Documentation Navigation

The following guides provide detailed information for specific operational tasks:

GuideDescription
System RequirementsHardware, operating system, and network specifications
ArchitectureDetailed system architecture and scaling guidance
InstallationStep-by-step installation and upgrade procedures
ConfigurationSystem configuration and customization
Performance TuningOptimization tips for improved performance
API GuideREST API reference and integration examples
OperationsDay-to-day operational procedures
Metrics & MonitoringMonitoring dashboards and alerting configuration
TroubleshootingCommon issues and resolution procedures
GlossaryDefinitions of technical terms
Release NotesVersion-specific changes and known issues

2 - System Requirements Guide

Hardware, operating system, and networking requirements

Overview

This document specifies the hardware, operating system, and networking requirements for deploying the AgileTV CDN Manager (ESB3027). Requirements vary based on deployment type and node role within the cluster.

Cluster Sizing

Production Deployments

Production deployments require a minimum of three nodes to achieve high availability. The cluster architecture employs distinct node roles:

RoleDescription
Server Node (Control Plane Only)Runs control plane components (etcd, Kubernetes API server) only; does not host application workloads; requires separate Agent nodes
Server Node (Combined)Runs control plane components and hosts application workloads; default configuration
Agent NodeExecutes application workloads only; does not participate in cluster quorum

Server nodes can be deployed in either Control Plane Only or Combined role configurations. The choice depends on your deployment requirements:

  • Control Plane Only: Dedicated control plane nodes with lower resource requirements; requires separate Agent nodes for workloads
  • Combined: Server nodes run both control plane and workloads; minimum 3 nodes required for HA

Why Use Control Plane Only Nodes?

Dedicated Control Plane Only nodes provide several benefits for larger deployments:

  • Resource Isolation: Control plane components (etcd, API server, scheduler) run on dedicated hardware without competing with application workloads for CPU and memory
  • Stability: Application workload spikes or misbehaving pods cannot impact control plane performance
  • Security: Smaller attack surface on control plane nodes; fewer containers and services running
  • Predictable Performance: Control plane responsiveness remains consistent regardless of application load
  • Flexible Sizing: Control Plane Only nodes can use lower-specification hardware (2 cores, 4 GiB) since they don’t run application workloads

For most small to medium deployments, Combined role servers are simpler and more cost-effective. Control Plane Only nodes are recommended for larger deployments with significant workload requirements or where control plane stability is critical.

High Availability Considerations

Production deployments require 3 nodes running control plane (etcd) and 3 nodes capable of running workloads. These can be the same nodes (Combined role) or separate nodes (CP-Only + Agent).

Node Role Combinations:

ConfigurationControl Plane NodesWorkload NodesTotal Nodes
All Combined3 Combined servers3 Combined servers3
Separated3 CP-Only servers3 Agent nodes6
Hybrid2 CP-Only + 1 Combined1 Combined + 2 Agent5

Any combination works as long as you have 3 control plane nodes and 3 workload-capable nodes.

Note: Regardless of the deployment configuration, a minimum of 3 nodes capable of running workloads is required for production deployments. This ensures both high availability and sufficient capacity for application pods.

For detailed fault tolerance information and data replication strategies, see the Architecture Guide.

Hardware Requirements

Single-Node Lab Deployment

Lab deployments are intended for acceptance testing, demonstrations, and development only. These configurations are not suitable for production workloads.

ResourceMinimumRecommended
CPU8 cores12 cores
Memory16 GiB24 GiB
Disk*128 GiB128 GiB

Production Cluster - Server Node (Control Plane Only)

Server nodes dedicated to control plane functions have modest resource requirements:

ResourceMinimumRecommended
CPU2 cores4 cores
Memory4 GiB8 GiB
Disk*64 GiB128 GiB

These nodes run only control plane components and require separate Agent nodes to run application workloads.

Production Cluster - Server Node (Control Plane + Workloads)

Combined role nodes require resources for both control plane and application workloads:

ResourceMinimumRecommended
CPU4 cores16 cores
Memory8 GiB32 GiB
Disk*100 GiB500 GiB

Production Cluster - Agent Node

Agent nodes execute application workloads and require the following resources:

ResourceMinimumRecommended
CPU4 cores16 cores
Memory8 GiB32 GiB
Disk*100 GiB500 GiB

Storage Notes

* Disk Space: All disk space values must be available in the /var/lib/longhorn partition. It is recommended that /var/lib/longhorn be a separate partition on a fast SSD for optimal performance, though SSD is not strictly required.

Longhorn Capacity: Longhorn storage requires an additional 30% capacity headroom for internal operations and scaling. If less than 30% of the total partition capacity is available, Longhorn may mark volumes as “full” and prevent further writes. Plan disk capacity accordingly.

Storage Performance

For optimal performance, the following storage characteristics are recommended:

  • Disk Type: SSD or NVMe storage for Longhorn volumes
  • Filesystem: XFS or ext4 with default mount options
  • Partition Layout: Dedicated /var/lib/longhorn partition for persistent storage

Virtual machines and bare-metal hardware are both supported. Nested virtualization (running multiple nodes under a single hypervisor) may impact performance and is not recommended for production deployments.

Operating System Requirements

Supported Operating Systems

The CDN Manager supports Red Hat Enterprise Linux and compatible distributions:

Operating SystemStatus
Red Hat Enterprise Linux 9Supported
Red Hat Enterprise Linux 10Untested
Red Hat Enterprise Linux 8Not supported

Compatible Clones

The following RHEL-compatible distributions are supported when major version requirements are satisfied:

  • Oracle Linux 9
  • AlmaLinux 9
  • Rocky Linux 9

Air-Gapped Deployments

Important: For air-gapped deployments (no internet access), the OS installation ISO must be mounted on all nodes before running the installer or join commands. The installer needs to install one or more packages from the distribution’s repository.

Oracle Linux UEK Kernel

Note: For Oracle Linux 9.7 and later using the Unbreakable Enterprise Kernel (UEK), you must install the kernel-uek-modules-extra-netfilter-$(uname -r) package before running the installer:

# Mount OS ISO first (required for air-gapped)
mount -o loop /path/to/oracle-linux-9.iso /mnt/iso

# Install required kernel modules
dnf install kernel-uek-modules-extra-netfilter-$(uname -r)

This package provides netfilter kernel modules required by K3s and Longhorn.

SELinux

SELinux is supported when installed in “Enforcing” mode. The installation process will configure appropriate SELinux policies automatically.

Networking Requirements

Network Interface

Each cluster node must have at least one network interface card (NIC) configured as the default gateway. If the node lacks a pre-configured default route, one must be established prior to installation.

Port Requirements

The cluster requires the following network connectivity:

CategoryPortsPurpose
Inter-Node2379-2380, 6443, 8472/UDP, 10250, 5001, 9500, 8500etcd, API server, Flannel VXLAN, Kubelet, Spegel, Longhorn
External Access80, 443HTTP redirect and HTTPS ingress
Application (optional)6379, 8125 TCP/UDP, 9093, 9095Redis, Telegraf, Alertmanager, Kafka external

Important: Complete port requirements, network ranges, and firewall configuration procedures are provided in the Networking Guide. Do not expose VictoriaMetrics (8428, 8429), Grafana (3000), or PostgreSQL (5432) directly—access these services only through the secure HTTPS ingress (port 443).

Resource Planning

Calculating Cluster Capacity

When planning cluster capacity, consider the following factors:

  1. Base Overhead: Kubernetes system components consume approximately 1-2 cores and 2-4 GiB memory per node
  2. Application Workloads: Refer to individual component resource requirements in the Architecture Guide
  3. Headroom: Maintain 20-30% resource headroom for workload spikes and automatic scaling

Scaling Considerations

The CDN Manager supports horizontal scaling for most components. The Horizontal Pod Autoscaler (HPA) can automatically adjust replica counts based on resource utilization. Detailed scaling guidance is available in the Architecture Guide.

Example Production Deployment

A minimal production deployment with 3 server nodes (combined role) and 2 agent nodes would require:

Node TypeCountCPU TotalMemory TotalDisk Total
Server (Combined)312 cores24 GiB300 GiB
Agent28 cores16 GiB200 GiB
Total520 cores40 GiB500 GiB

This configuration provides:

  • High availability (survives loss of 1 server node)
  • Capacity for application workloads across all nodes
  • Headroom for horizontal scaling

Next Steps

After verifying system requirements:

  1. Review the Installation Guide for deployment procedures
  2. Consult the Networking Guide for firewall configuration
  3. Examine the Architecture Guide for component resource requirements

3 - Networking Guide

Network architecture and configuration guides

Network Architecture

Physical Network

Each cluster node must have at least one network interface card (NIC) configured as the default gateway. If the node lacks a pre-configured default route, it must be established prior to installation.

K3s requires a default route to auto-detect the node’s primary IP and for kube-proxy ClusterIP routing to function properly. If no default route exists, create a dummy interface as a workaround:

ip link add dummy0 type dummy
ip link set dummy0 up
ip addr add 203.0.113.254/31 dev dummy0
ip route add default via 203.0.113.255 dev dummy0 metric 1000

Overlay Network

Kubernetes creates virtual network interfaces for pods that are typically not associated with any specific firewalld zone. The cluster uses the following network ranges:

NetworkCIDRPurpose
Pod10.42.0.0/16Inter-pod communication
Service10.43.0.0/16Kubernetes service discovery

Firewall regulations should target the primary physical interface. The overlay network traffic is handled by Flannel VXLAN.

Port Requirements

Inter-Node Communication

The following ports must be permitted between all cluster nodes for Kubernetes and cluster infrastructure:

PortProtocolSourceDestinationPurpose
2379-2380TCPServer nodesServer nodesetcd cluster communication
6443TCPAll nodesServer nodesKubernetes API server
8472UDPAll nodesAll nodesFlannel VXLAN overlay network
10250TCPAll nodesAll nodesKubelet metrics and management
5001TCPAll nodesServer nodesSpegel registry mirror
9500-9503TCPAll nodesAll nodesLonghorn management API
8500-8504TCPAll nodesAll nodesLonghorn agent communication
10000-30000TCPAll nodesAll nodesLonghorn data replication
3260TCPAll nodesAll nodesLonghorn iSCSI
2049TCPAll nodesAll nodesLonghorn RWX (NFS)

Application Services Ports

The following ports must be accessible for application services within the cluster:

PortProtocolService
6379TCPRedis
9093TCPAlertmanager
9095TCPKafka
8086TCPVector (InfluxDB line protocol listener)

External Access Ports

The following ports must be accessible from external clients to cluster nodes:

PortProtocolService
80TCPHTTP ingress (Optional, redirects to HTTPS)
443TCPHTTPS ingress (Required, all services)
9095TCPKafka (external client connections)
6379TCPRedis (external client connections)
8086TCPVector (InfluxDB line protocol, external metrics senders)

Network Configuration Guides

Deployment Type

Choose the guide that matches your deployment architecture:

GuideDescriptionWho Should Use This
Configuring Segregated NetworksMulti-NIC deployments with air-gapped cluster backplaneMost users - If you have separate interfaces for cluster traffic and external internet access
Shared Interface SetupSingle-NIC deployments where all traffic shares one interfaceUsers with a single network interface for both cluster traffic and external access

Not sure which to use? If you have explicitly separate interfaces for cluster communication and external access, start with Configuring Segregated Networks. Only use the shared interface guide if your hardware is limited to a single NIC.

3.1 - Shared Interface Network Setup

Network configuration for standard single-NIC deployments where all traffic shares a single interface.

Overview

This guide covers network configuration for standard single-NIC deployments. In this architecture, all traffic—including internal cluster communication (East-West) and external internet access (North-South)—is routed through a single network interface.

Security Warning: Because all traffic shares the same interface and firewall zone, there is no physical or logical isolation between cluster management traffic and public-facing service traffic. For production environments requiring security isolation, see Configuring Segregated Networks.

Note: The installer script automatically detects if firewalld is enabled. If so, it will verify that the required inter-node ports are open through the firewall in the default zone before proceeding. If any required ports are missing, the installer will report an error and exit. Application service ports (such as Kafka, VictoriaMetrics, and Vector) are not checked by the installer as they are configurable.

For network architecture, port requirements, and general information, see the Network Architecture Overview section in the main Networking Guide.

firewall Configuration

Assign Interface to Default Zone

Assign your primary network interface to the default zone:

firewall-cmd --permanent --zone=public --change-interface=<interface>
firewall-cmd --reload

Replace <interface> with your actual interface name (e.g., eth0).

Configure Firewall Rules

In a shared interface setup, you must manually configure firewall rules for both internal cluster traffic and external access, as K3s does not automatically manage the public zone.

# 1. Allow pod and service networks (Internal CIDRs)
firewall-cmd --permanent --zone=public --add-source=10.42.0.0/16
firewall-cmd --permanent --zone=public --add-source=10.43.0.0/16

# 2. Kubernetes and Cluster Infrastructure (East-West Traffic)
# These ports must be opened manually for the cluster to function on a single interface.
firewall-cmd --permanent --zone=public --add-port=2379-2380/tcp
firewall-cmd --permanent --zone=public --add-port=6443/tcp
firewall-cmd --permanent --zone=public --add-port=8472/udp
firewall-cmd --permanent --zone=public --add-port=10250/tcp
firewall-cmd --permanent --zone=public --add-port=5001/tcp
firewall-cmd --permanent --zone=public --add-port=9500-9503/tcp
firewall-cmd --permanent --zone=public --add-port=8500-8504/tcp
firewall-cmd --permanent --zone=public --add-port=10000-30000/tcp
firewall-cmd --permanent --zone=public --add-port=3260/tcp
firewall-cmd --permanent --zone=public --add-port=2049/tcp

# 3. External Access Ports (North-South Traffic)
firewall-cmd --permanent --zone=public --add-port=80/tcp
firewall-cmd --permanent --zone=public --add-port=443/tcp
firewall-cmd --permanent --zone=public --add-port=9095/tcp
firewall-cmd --permanent --zone=public --add-port=6379/tcp
firewall-cmd --permanent --zone=public --add-port=8086/tcp

# Apply changes
firewall-cmd --reload

Verification

Verify all port rules are applied:

firewall-cmd --zone=public --list-all

Expected output:

public (active)
  target: default
  icmp-block-inversion: no
  interfaces: eth0
  sources: 10.42.0.0/16 10.43.0.0/16
  services: dhcpv6-client ssh
  ports: 80/tcp 443/tcp 9095/tcp 6379/tcp 8086/tcp
  protocols: 2379-2380/tcp 6443/tcp 8472/udp 10250/tcp 5001/tcp 9500-9503/tcp 8500-8504/tcp 10000-30000/tcp 3260/tcp 2049/tcp
  masquerade: no
  forward-ports:
  source-ports:
  icmp-blocks:
  rich-rules:

Note: Additional interfaces may appear in the zone (e.g., eth0 eth1) if firewalld auto-assigned them based on network configuration. This is expected and does not affect functionality.

Verify the interface is correctly assigned to the public zone:

firewall-cmd --get-active-zones

Expected output will show eth0 listed under the public zone:

public (active)
  interfaces: eth0

Troubleshooting

Expected output will show eth0 listed under the public zone:

public (active)
  interfaces: eth0

Troubleshooting

Nodes Cannot Communicate

Verify firewall rules allow inter-node traffic in the public zone:

firewall-cmd --list-all

Test basic connectivity between nodes:

ping <node-ip>

Post-Installation Troubleshooting

Once the cluster is installed, if you encounter issues with pod-to-pod communication or service access, verify the following:

  1. Flannel Interface: Ensure the flannel.1 interface is up and has the correct IP addresses.
  2. Network Routes: Verify that the pod and service CIDR routes are present in the routing table.
  3. Firewall Rules: Ensure all required Kubernetes and cluster ports are allowed in the public zone.

For detailed troubleshooting of Kubernetes-specific components (like Ingress or Pod connectivity), please refer to the Kubernetes Troubleshooting Guide.

3.2 - Configuring Segregated Networks

Multi-NIC deployment guide for air-gapped or segregated network setups

Overview

This guide covers configuring a cluster with separate interfaces for internal cluster communication and external internet access (also known as segregated or dual-homed deployments). In this setup, eth1 handles the internal cluster traffic (pod-to-pod, control plane) while eth0 provides public internet access.

Security Benefit: This configuration provides physical isolation between East-West (cluster) and North-South (external) traffic. The trusted zone allows unrestricted internal communication, while the public zone handles external access with controlled port exposure.

When configuring segregated networks with K3s, proper interface binding is essential. K3s uses the --flannel-iface flag to ensure pod traffic stays on the private network, and the --node-external-ip flag to advertise the public address for external access. Server nodes additionally require --advertise-address=<ETH1_IP> to ensure the API server advertises its internal/private address; without this flag, k3s promotes the external IP to the advertise address when --node-external-ip is set, causing the kubernetes service ClusterIP endpoint to register as an address that is unreachable from within the cluster.

Important: K3s manages pod masquerading and service routing automatically. You only need to configure firewalld zones correctly and pass the proper flags to the K3s installer.

Complete, step-by-step instructions follow.

Prerequisites

Before starting, ensure:

  • Operating system is installed and updated on all nodes
  • Network connectivity between nodes is available
  • SSH access is configured for all cluster nodes

Configure Firewalld Zones

This guide configures separate zones for internal cluster traffic and external access.

Assign Interfaces to Zones

K3s uses trusted zone for the internal network to allow unrestricted pod-to-pod and control plane traffic:

# Assign eth0 (external/internet) to public zone
firewall-cmd --permanent --zone=public --change-interface=eth0

# Assign eth1 (internal/cluster) to trusted zone
firewall-cmd --permanent --zone=trusted --change-interface=eth1

# Allow pod and service CIDRs in trusted zone (required for pod communication)
firewall-cmd --permanent --zone=trusted --add-source=10.42.0.0/16
firewall-cmd --permanent --zone=trusted --add-source=10.43.0.0/16

# Reload firewall
firewall-cmd --reload

Configure Firewall Ports

Open the necessary ports on the public zone for external access:

# External access ports
firewall-cmd --permanent --zone=public --add-port=80/tcp
firewall-cmd --permanent --zone=public --add-port=443/tcp
firewall-cmd --permanent --zone=public --add-port=9095/tcp
firewall-cmd --permanent --zone=public --add-port=6379/tcp
firewall-cmd --permanent --zone=public --add-port=8086/tcp

# Apply changes
firewall-cmd --reload

Note: K3s automatically creates iptables rules for internal cluster ports (6443, 10250, 2379-2380, 8472, 5001, 9500-9503, 8500-8504, 10000-30000, 3260, 2049) when using --flannel-iface=eth1. Pod and service CIDRs (10.42.0.0/16 and 10.43.0.0/16) are already allowed in the trusted zone via the --add-source commands above.

Verify Zone Configuration

firewall-cmd --zone=public --list-all
firewall-cmd --zone=trusted --list-all

Expected output for public zone:

public (active)
  target: default
  icmp-block-inversion: no
  interfaces: eth0 eth2
  sources: 
  services: dhcpv6-client ssh cockpit
  ports: 80/tcp 443/tcp 9095/tcp 6379/tcp 8086/tcp
  protocols: 
  forward: yes
  masquerade: no
  forward-ports: 
  source-ports: 
  icmp-blocks: 
  rich rules:

Expected output for trusted zone:

trusted (active)
  target: ACCEPT
  icmp-block-inversion: no
  interfaces: eth1
  sources: 10.42.0.0/16 10.43.0.0/16
  services: ssh mdns
  ports: 
  protocols: 
  forward: yes
  masquerade: no
  forward-ports: 
  source-ports: 
  icmp-blocks: 
  rich rules:

Note: Additional interfaces may appear in a zone (e.g., eth0 eth2) if firewalld auto-assigned them based on network configuration. This is expected and does not affect functionality.

Single-NIC Alternative

If you only have a single network interface, see the Shared Interface Setup guide instead. This guide is specifically for multi-NIC deployments with separate interfaces for cluster and external traffic.

Troubleshooting

Verify Zone Configuration

If pods cannot communicate with services, verify the trusted zone has the correct sources configured:

firewall-cmd --zone=trusted --list-all

Expected output:

trusted (active)
  target: ACCEPT
  icmp-block-inversion: no
  interfaces: eth1
  sources: 10.42.0.0/16 10.43.0.0/16
  services: ssh mdns
  ports: 
  protocols: 
  forward: yes
  masquerade: no
  forward-ports: 
  source-ports: 
  icmp-blocks: 
  rich rules:

Ensure both 10.42.0.0/16 (pod network) and 10.43.0.0/16 (service network) are listed under sources. If missing, re-run:

firewall-cmd --permanent --zone=trusted --add-source=10.42.0.0/16
firewall-cmd --permanent --zone=trusted --add-source=10.43.0.0/16
firewall-cmd --reload

4 - Architecture Guide

Detailed system architecture and component overview

Overview

The AgileTV CDN Manager (ESB3027) is a cloud-native Kubernetes application designed for managing CDN operations. This guide provides a detailed description of the system architecture, component interactions, and scaling considerations.

High-Level Architecture

The CDN Manager follows a microservices architecture deployed on Kubernetes. The system is organized into logical layers:

graph LR
    Clients[API Clients] --> Ingress[Ingress Controller]
    Ingress --> Manager[Core Manager]
    Ingress --> Frontend[MIB Frontend]
    Ingress --> Grafana[Grafana]
    Manager --> Redis[(Redis)]
    Manager --> Kafka[(Kafka)]
    Manager --> PostgreSQL[(PostgreSQL)]
    Manager --> Zitadel[Zitadel IAM]
    Manager --> Confd[Configuration Service]
    Grafana --> VM[(VictoriaMetrics)]
    Confd -.-> Gateway[NGinx Gateway]
    Gateway --> Director[CDN Director]

Component Architecture

Ingress Layer

The ingress layer manages all incoming traffic to the cluster:

ComponentRole
Ingress ControllerPrimary ingress for all cluster traffic; routes requests to internal services based on path
NGinx GatewayReverse proxy for routing traffic to external CDN Directors; used by MIB Frontend to communicate with remote Confd instances on CDN Director nodes

Traffic flow:

  • API clients and Operator UI connect via the Ingress Controller at /api and /gui paths respectively
  • Grafana dashboards are accessed via the Ingress Controller at /grafana
  • Zitadel authentication console is accessed via the Ingress Controller at /ui/console
  • MIB Frontend uses NGinx Gateway when communicating with external Confd instances on CDN Director nodes

Application Services

The application layer contains the core CDN Manager services:

ComponentRoleScaling
Core ManagerMain REST API server (v1/v2 endpoints); handles authentication, configuration, routing, and discoveryHorizontally scalable via HPA
MIB FrontendWeb-based configuration GUI for operatorsHorizontally scalable via HPA
ConfdConfiguration service for routing configuration; synchronizes with Core Manager applicationSingle instance
GrafanaMonitoring and visualization dashboardsSingle instance
Selection Input WorkerConsumes selection input events from Kafka and updates configurationSingle instance
Metrics AggregatorCollects and aggregates metrics from CDN componentsSingle instance
TelegrafSystem-level metrics collection from cluster nodesDaemonSet (one per node)
AlertmanagerAlert routing and notification managementSingle instance

Data Layer

The data layer provides persistent and ephemeral storage:

ComponentRoleScaling
RedisIn-memory caching, session storage, and ephemeral stateMaster + replicas (read-only)
KafkaEvent streaming for selection input and metrics; provides durable message queueController cluster (odd count)
PostgreSQLPersistent configuration and state storage3-node cluster with HA
VictoriaMetrics (Analytics)Real-time and short-term metrics for operational dashboardsSingle instance
VictoriaMetrics (Billing)Long-term metrics retention (1+ years) for billing and license complianceSingle instance

External Integrations

ComponentRole
Zitadel IAMIdentity and access management; provides OAuth2/OIDC authentication
CDN Director (ESB3024)Edge routing infrastructure; receives configuration from Confd

Detailed Component Descriptions

Core Manager

The Core Manager is the central application server that exposes the REST API. It is implemented in Rust using the Actix-web framework.

Key Responsibilities:

  • Authentication and session management via Zitadel
  • Configuration document storage and retrieval
  • Selection input CRUD operations
  • Routing rule evaluation and GeoIP lookups
  • Service discovery for CDN Directors and edge servers
  • Operator UI helper endpoints

API Endpoints:

  • /api/v1/auth/* - Authentication (login, token, logout)
  • /api/v1/configuration - Configuration management
  • /api/v1/selection_input/* - Selection input operations
  • /api/v2/selection_input/* - Enhanced selection input with list operations
  • /api/v1/routing/* - Routing evaluation and validation
  • /api/v1/discovery/* - Host and namespace discovery
  • /api/v1/metrics - System metrics
  • /api/v1/health/* - Liveness and readiness probes
  • /api/v1/operator_ui/* - Operator helper endpoints

Runtime Modes: The Core Manager supports multiple runtime modes, each deployed as a separate container:

  • http-server - Primary HTTP API server (default)
  • metrics-aggregator - Background worker for metrics collection
  • selection-input - Background worker for Kafka selection input consumption

MIB Frontend

The MIB Frontend provides a web-based GUI for configuration management.

Key Features:

  • Intuitive web interface for CDN configuration
  • Real-time configuration validation
  • Integration with Zitadel for SSO authentication
  • Uses NGinx Gateway for external Director communication

Confd (Configuration Service)

Confd provides routing configuration services and synchronizes with the Core Manager application.

Key Responsibilities:

  • Hosts the service configuration for routing decisions
  • Provides API and CLI for configuration management
  • Synchronizes routing configuration with Core Manager
  • Maintains configuration state in PostgreSQL

Selection Input Worker

The Selection Input Worker processes selection input events from the Kafka stream.

Key Responsibilities:

  • Consumes messages from the selection_input Kafka topic
  • Validates and transforms input data
  • Updates configuration in the data store
  • Maintains message ordering within partitions

Scaling Limitation: The Selection Input Worker cannot be scaled beyond a single consumer per Kafka partition, as message ordering must be preserved.

Metrics Aggregator

The Metrics Aggregator collects and processes metrics from CDN components.

Key Responsibilities:

  • Polls metrics from Director instances
  • Aggregates usage statistics
  • Writes data to VictoriaMetrics (Analytics) for dashboards
  • Writes long-term data to VictoriaMetrics (Billing) for compliance

Telegraf

Telegraf is deployed as a DaemonSet to collect host-level metrics.

Key Responsibilities:

  • CPU, memory, disk, and network metrics from each node
  • Container-level resource usage
  • Kubernetes cluster metrics
  • Forwards metrics to VictoriaMetrics

Grafana

Grafana provides visualization and dashboard capabilities.

Features:

  • Pre-built dashboards for CDN monitoring
  • Custom dashboard support
  • VictoriaMetrics as data source
  • Alerting integration with Alertmanager

Access: https://<host>/grafana

Alertmanager

Alertmanager handles alert routing and notifications.

Key Responsibilities:

  • Receives alerts from Grafana and other sources
  • Deduplicates and groups alerts
  • Routes to notification channels (email, webhook, etc.)
  • Manages alert silencing and inhibition

Data Storage

Redis

Redis provides in-memory storage for:

  • User sessions and authentication tokens
  • Ephemeral configuration cache
  • Real-time state synchronization

Deployment: Master + read replicas for high availability

Kafka

Kafka provides durable event streaming for:

  • Selection input events
  • Metrics data streams
  • Inter-service communication

Deployment: Controller cluster with 3 replicas for production, 1 replica for lab deployments

Node Affinity: Kafka replicas must be scheduled on separate nodes to ensure high availability. The Helm chart configures pod anti-affinity rules to enforce this distribution.

Topics:

  • selection_input - Selection input events
  • metrics - Metrics data streams

Note: For lab/single-node deployments, the Kafka replica count must be set to 1 in the Helm values. Production deployments require 3 replicas for fault tolerance.

PostgreSQL

PostgreSQL provides persistent storage for:

  • Configuration documents
  • User and permission data
  • System state

Deployment: 3-node cluster managed by Cloudnative PG (CNPG) operator

High Availability: The CNPG operator manages automatic failover and ensures high availability:

  • One primary node handles read/write operations
  • Two replica nodes provide redundancy and can be promoted to primary on failure
  • Automatic failover occurs within seconds of primary node failure
  • Synchronous replication ensures data consistency

Note: The PostgreSQL cluster is deployed and managed automatically by the CNPG operator. Manual intervention is typically not required for normal operations.

VictoriaMetrics

Two VictoriaMetrics instances serve different purposes:

VictoriaMetrics (Analytics):

  • Real-time and short-term metrics storage
  • Supports Grafana dashboards
  • Retention: Configurable (typically 30-90 days)

VictoriaMetrics (Billing):

  • Long-term metrics retention
  • Billing and license compliance data
  • Retention: Minimum 1 year

Authentication and Authorization

Zitadel Integration

Zitadel provides identity and access management:

Authentication Flow:

  1. User accesses MIB Frontend or API
  2. Redirected to Zitadel for authentication
  3. Zitadel validates credentials and issues session token
  4. Session token exchanged for access token
  5. Access token included in API requests (Bearer authentication)

Default Credentials: See the Glossary for default login credentials.

Access Paths:

  • Zitadel Console: /ui/console
  • API authentication: /api/v1/auth/*

CORS Configuration

Zitadel enforces Cross-Origin Resource Sharing (CORS) policies. The external hostname configured in Zitadel must match the first entry in global.hosts.manager in the Helm values.

Network Architecture

Traffic Flow

graph TB
    External[External Clients] --> Ingress[Ingress Controller]
    External --> Redis[(Redis)]
    External --> Kafka[(Kafka)]
    External --> Telegraf[Telegraf]
    Ingress --> Manager[Core Manager]
    Ingress --> Frontend[MIB Frontend]
    Ingress --> Grafana[Grafana]
    Ingress --> Zitadel[Zitadel]

Note: Certain services (Redis, Kafka, Telegraf) can be accessed directly by external clients without traversing the ingress controller. This is typically used for metrics collection, event streaming, and direct data access scenarios.

Internal Communication

All internal services communicate over the Kubernetes overlay network (Flannel VXLAN). Services discover each other via Kubernetes DNS.

External Communication

  • CDN Directors: Accessed via NGinx Gateway for simplified routing
  • MaxMind GeoIP: Local database files (no external calls)

Scaling

Horizontal Pod Autoscaler (HPA)

The following components support automatic horizontal scaling via HPA:

ComponentMinimumMaximumScale Metrics
Core Manager38CPU (50%), Memory (80%)
NGinx Gateway24CPU (75%), Memory (80%)
MIB Frontend24CPU (75%), Memory (90%)

Note: HPA is enabled by default in the Helm chart. The default configuration is tuned for production deployments. Adjust min/max values based on expected load and available cluster capacity.

Manual Scaling

Components can also be scaled manually by setting replica counts in the Helm values:

manager:
  replicaCount: 3
mib-frontend:
  replicaCount: 2

Important: When manually setting replica counts, you must disable the Horizontal Pod Autoscaler (HPA) for the corresponding component. If HPA remains enabled, it will override manual replica settings. To disable HPA, set autoscaling.hpa.enabled: false for the component in your Helm values.

Components That Do Not Scale

The following components do not support horizontal scaling:

ComponentReason
ConfdSingle instance required for configuration consistency
PostgreSQLCloudnative PG cluster; scaled by adding replicas via operator configuration
KafkaScaled by adding controllers, not via replica count
VictoriaMetricsStateful; single instance per role
RedisMaster is single; replicas are read-only
GrafanaSingle instance sufficient for dashboard access
AlertmanagerSingle instance for alert routing
Selection Input WorkerKafka message ordering requires single consumer
Metrics AggregatorSingle instance for consistent metrics aggregation

Node Scaling

Additional Agent nodes can be added to the cluster at any time to increase workload capacity. Kubernetes automatically schedules pods to nodes with available resources.

Cluster Balancing

The CDN Manager deployment includes the Kubernetes Descheduler to maintain balanced resource utilization across cluster nodes:

  • Automatic Rebalancing: The descheduler periodically analyzes pod distribution and evicts pods from overutilized nodes
  • Node Balance: Helps prevent resource hotspots by redistributing workloads across available nodes
  • Integration with HPA: Works in conjunction with Horizontal Pod Autoscaler to optimize both pod count and placement

The descheduler runs as a background process and does not require manual intervention under normal operating conditions.

Resource Configuration

For detailed resource preset configurations and planning guidance, see the Configuration Guide.

High Availability

Server Node Redundancy

Production deployments require a minimum of 3 Server nodes:

  • Survives loss of 1 server node
  • Maintains quorum for etcd and Kafka

For enhanced availability, use 5 Server nodes:

  • Survives loss of 2 server nodes
  • Recommended for critical production environments

For large-scale deployments, 7 or more Server nodes can be used:

  • Survives loss of 3+ server nodes
  • Suitable for high-capacity production environments

Pod Distribution

Kubernetes automatically distributes pods across nodes to maximize availability:

  • Pods with the same deployment are scheduled on different nodes when possible
  • Pod Disruption Budgets (PDB) ensure minimum availability during maintenance

Data Replication

ComponentReplication Strategy
RedisSingle instance (backup via Longhorn snapshots)
KafkaReplicated partitions (default: 3)
PostgreSQL3-node cluster via Cloudnative PG
VictoriaMetricsSingle instance (backup via snapshots)
LonghornSingle replica with pod-node affinity

Longhorn Storage: Longhorn volumes are configured with a single replica by default. Pod scheduling is configured with node affinity to prefer scheduling pods on the same node as their persistent volume data. This approach optimizes I/O performance while maintaining data locality.

Next Steps

After understanding the architecture:

  1. Installation Guide - Deploy the CDN Manager
  2. Configuration Guide - Configure components for your environment
  3. Operations Guide - Day-to-day operational procedures
  4. Performance Tuning Guide - Optimize system performance
  5. Metrics & Monitoring - Set up monitoring and alerting

5 - Installation Guide

Step-by-step installation and upgrade procedures

Overview

This guide provides detailed instructions for installing the AgileTV CDN Manager (ESB3027) in various deployment scenarios. The installation process varies depending on the target environment and desired configuration.

Estimated Installation Time:

Deployment TypeTime
Single-Node (Lab)~15 minutes
Multi-Node (3 servers)~30 minutes

Actual installation time may vary depending on hardware performance, network speed, and whether air-gapped procedures are required.

Note: These estimates assume the operating system is already installed on all nodes. OS installation is outside the scope of this guide.

Installation Types

Installation TypeDescriptionUse Case
Single-Node (Lab)Minimal installation on a single hostAcceptance testing, demonstrations, development
Multi-Node (Production)Full high-availability cluster with 3+ server nodesProduction deployments

Installation Process Summary

The installation follows a sequential process:

  1. Prepare the host system - Verify requirements and mount the installation ISO
  2. Install the Kubernetes cluster - Deploy K3s, Longhorn storage, and PostgreSQL
  3. Join additional nodes (production only) - Expand the cluster for HA or capacity
  4. Deploy the Manager application - Install the CDN Manager Helm chart
  5. Post-installation configuration - Configure authentication, networking, and users
GuideDescription
Installation ChecklistStep-by-step checklist to track progress
Single-Node InstallationLab and acceptance testing deployment
Multi-Node InstallationProduction high-availability deployment
Air-Gapped DeploymentAir-gapped environment installation
Helm Chart InstallationCommon helm chart deployment steps
Upgrade GuideUpgrading from previous versions
Next StepsPost-installation configuration tasks

Prerequisites

Before beginning installation, ensure the following requirements are met:

  • Hardware: Nodes meeting the System Requirements including CPU, memory, and disk specifications
  • Operating System: RHEL 9 or compatible clone (details); air-gapped deployments require the OS ISO mounted on all nodes
  • Network: Proper firewall configuration between nodes (port requirements, firewall configuration)
  • Software: Installation ISO obtained from AgileTV; air-gapped deployments also require the Extras ISO
  • Kernel Tuning: For production deployments, apply recommended sysctl settings (Performance Tuning Guide)

We recommend using the Installation Checklist to track your progress through the installation process.

Getting Help

If you encounter issues during installation:

5.1 - Installation Checklist

Step-by-step checklist to track installation progress

Overview

Use this checklist to track your installation progress. Print this page or keep it open during your installation to ensure all steps are completed correctly.

Pre-Installation

Hardware and Software

  • Verify hardware meets System Requirements
  • Confirm operating system is supported (RHEL 9 or compatible clone)
  • Configure firewall rules between nodes (details)
  • Apply recommended sysctl settings (details)
  • Obtain installation ISO (esb3027-acd-manager-X.Y.Z.iso)

Air-Gapped Deployments

  • Obtain Extras ISO (esb3027-acd-manager-extras-X.Y.Z.iso)
  • Mount OS ISO on all nodes before installation
  • Verify OS packages are accessible from mounted ISO

Special Requirements

  • Oracle Linux UEK: Install kernel-uek-modules-extra-netfilter-$(uname -r) package
  • Control Plane Only nodes: Set SKIP_REQUIREMENTS_CHECK=1 if below lab minimums
  • SELinux: Set to “Enforcing” mode before running installer (cannot enable after)

Cluster Installation

Single-Node Deployment

Follow the Single-Node Installation Guide.

  • Mount installation ISO (Step 1)
  • Install the base cluster (Step 2)
  • Verify cluster status (Step 3)
  • Air-gapped only: Load container images (Step 4)
  • Create configuration file (Step 5)
  • Optional: Load MaxMind GeoIP databases (Step 6)
  • Deploy the Manager Helm chart (Step 7)
  • Verify deployment (Step 8)

Multi-Node Deployment

Follow the Multi-Node Installation Guide.

Primary Server Node

  • Mount installation ISO (Step 1)
  • Install the base cluster (Step 2)
  • Verify system pods are running (Step 2)
  • Retrieve the node token (Step 3)

Additional Server Nodes

  • Mount installation ISO (Step 5)
  • Join the cluster (Step 5)
  • Verify each node joins (Step 5)
  • Optional: Taint Control Plane Only nodes (Step 5b)

Agent Nodes (Optional)

  • Mount installation ISO (Step 6)
  • Join the cluster as an agent (Step 6)
  • Verify each agent joins (Step 6)

Cluster Verification

  • Verify all nodes are ready (Step 7)
  • Verify system pods running on all nodes (Step 7)
  • Air-gapped only: Load container images on each node (Step 9)

Application Deployment

  • Create configuration file (Step 10)
  • Optional: Load MaxMind GeoIP databases (Step 11)
  • Optional: Configure TLS certificates from trusted CA (Step 12)
  • Deploy the Manager Helm chart (Step 13)
  • Verify all pods are running and distributed (Step 14)
  • Configure DNS records for manager hostname (Step 15)

Post-Installation

Initial Access

  • Access the system via HTTPS
  • Accept self-signed certificate warning (if using default certificate)
  • Log in with default credentials (see Glossary)

Security Configuration

  • Create new administrator account in Zitadel
  • Delete or secure the default admin account
  • Configure additional users and permissions
  • Review Zitadel Administrator Documentation for role assignments

Monitoring and Operations

  • Access Grafana dashboards at /grafana
  • Review pre-built monitoring dashboards
  • Configure alerting rules (optional)
  • Set up notification channels (optional)

Next Steps

  • Review Next Steps Guide for additional configuration
  • Configure CDN routing rules
  • Set up GeoIP-based routing (if using MaxMind databases)
  • Review Operations Guide for day-to-day procedures

Troubleshooting

If you encounter issues during installation:

  1. Check pod status: kubectl describe pod <pod-name>
  2. Review logs: kubectl logs <pod-name>
  3. Check cluster events: kubectl get events --sort-by='.lastTimestamp'
  4. Review the Troubleshooting Guide for common issues

5.2 - Single-Node Installation

Lab and acceptance testing deployment

Warning: Single-node deployments are for lab environments, acceptance testing, and demonstrations only. This configuration is not suitable for production workloads. For production deployments, see the Multi-Node Installation Guide, which requires a minimum of 3 server nodes for high availability.

Air-Gapped Deployment? This guide assumes internet connectivity. For air-gapped deployments, see the Air-Gapped Deployment Guide for additional requirements and procedures.

Overview

This guide describes the installation of the AgileTV CDN Manager on a single node. This configuration is intended for lab environments, acceptance testing, and demonstrations only. It is not suitable for production workloads.

Prerequisites

Hardware Requirements

Refer to the System Requirements Guide for hardware specifications. Single-node deployments require the “Single-Node (Lab)” configuration.

Operating System

Refer to the System Requirements Guide for supported operating systems.

Software Access

  • Installation ISO: esb3027-acd-manager-X.Y.Z.iso
  • Extras ISO (air-gapped only): esb3027-acd-manager-extras-X.Y.Z.iso

Network Configuration

Ensure that required firewall ports are configured before installation. See the Networking Guide for complete firewall configuration requirements.

SELinux

If SELinux is to be used, it must be set to “Enforcing” mode before running the installer script. The installer will configure appropriate SELinux policies automatically. SELinux cannot be enabled after installation.

Installation Steps

Step 1: Mount the ISO

Create a mount point and mount the installation ISO:

mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

Replace X.Y.Z with the actual version number.

Step 2: Install the Base Cluster

Run the installer to set up the K3s Kubernetes cluster:

/mnt/esb3027/install

This installs:

  • K3s Kubernetes distribution
  • Longhorn distributed storage
  • Cloudnative PG operator for PostgreSQL
  • Base system dependencies

The installer will configure the node as both a server and agent node.

Step 3: Verify Cluster Status

After the installer completes, verify that all components are operational before proceeding. This verification serves as an important checkpoint to confirm the installation is progressing correctly.

1. Verify the node is ready:

kubectl get nodes

Expected output:

NAME         STATUS   ROLES                       AGE   VERSION
k3s-server   Ready    control-plane,etcd,master   2m    v1.33.4+k3s1

2. Verify system pods in both namespaces are running:

# Check kube-system namespace (Kubernetes core components)
kubectl get pods -n kube-system

# Check longhorn-system namespace (distributed storage)
kubectl get pods -n longhorn-system

All pods should show Running status. If any pods are still Pending or ContainerCreating, wait until they are ready. Proceeding with incomplete system pods can cause subsequent steps to fail in unpredictable ways.

This verification confirms:

  • K3s cluster is operational
  • Longhorn distributed storage is running
  • Cloudnative PG operator is deployed
  • All core components are healthy before continuing

Step 4: Air-Gapped Deployments (If Applicable)

If deploying in an air-gapped environment, load container images from the extras ISO:

mkdir -p /mnt/esb3027-extras
mount -o loop,ro esb3027-acd-manager-extras-X.Y.Z.iso /mnt/esb3027-extras
/mnt/esb3027-extras/load-images

Step 5: Deploy the Manager Helm Chart

For complete instructions on deploying the CDN Manager Helm chart, including configuration file setup, MaxMind GeoIP database loading, TLS certificate configuration, deployment commands, and verification steps, see the Helm Chart Installation Guide.

This guide covers the common deployment steps that apply to all installation types. After completing the helm chart installation steps, proceed to Post-Installation below.

Post-Installation

After installation completes, proceed to the Next Steps guide for:

  • Initial user configuration
  • Accessing the web interfaces
  • Configuring authentication
  • Setting up monitoring

Accessing the System

Refer to the Accessing the System section in the Getting Started guide for service URLs and default credentials.

Note: A self-signed SSL certificate is deployed by default. You will need to accept the certificate warning in your browser.

Troubleshooting

If pods fail to start:

  1. Check pod status: kubectl describe pod <pod-name>
  2. Review logs: kubectl logs <pod-name>
  3. Verify resources: kubectl top pods

See the Troubleshooting Guide for additional assistance.

Next Steps

After successful installation:

  1. Next Steps Guide - Post-installation configuration
  2. Configuration Guide - System configuration
  3. Operations Guide - Day-to-day operations

Appendix: Lab Configuration File

The installation ISO includes a pre-built lab configuration at /mnt/esb3027/values-lab.yaml, designed specifically for single-node deployments. It handles single-replica settings for Kafka and Zitadel, resource sizing, and TLS configuration automatically.

Copy it as your starting point:

cp /mnt/esb3027/values-lab.yaml ~/values.yaml

At minimum, update these two fields to match your environment before deploying:

global:
  hosts:
    manager:
      - host: manager.local   # Replace with your hostname or IP

zitadel:
  zitadel:
    configmapConfig:
      ExternalDomain: manager.local   # Must match global.hosts.manager[0].host

These two values must match exactly or authentication will fail. For a full description of all available options in values-lab.yaml, see the Configuration Guide.

5.3 - Multi-Node Installation

Production high-availability deployment

Overview

This guide describes the installation of the AgileTV CDN Manager across multiple nodes for production deployments. This configuration provides high availability and horizontal scaling capabilities.

Air-Gapped Deployment? This guide assumes internet connectivity. For air-gapped deployments, see the Air-Gapped Deployment Guide for additional requirements and procedures.

Prerequisites

Hardware Requirements

Refer to the System Requirements Guide for hardware specifications. Production deployments require:

  • Minimum 3 Server nodes (Control Plane Only or Combined role)
  • Optional Agent nodes for additional workload capacity

Operating System

Refer to the System Requirements Guide for supported operating systems.

Software Access

  • Installation ISO: esb3027-acd-manager-X.Y.Z.iso (for each node)
  • Extras ISO (air-gapped only): esb3027-acd-manager-extras-X.Y.Z.iso

Network Configuration

Ensure that required firewall ports are configured between all nodes before installation. See the Configuring Segregated Networks guide for the standard firewall configuration.

Note: When using segregated networks, the K3s API server on the primary node will be reachable via its internal/private interface. Consequently, when joining additional nodes, the <primary-server-ip> provided to the join script must be the internal/private IP address of the primary node to ensure the join request is routed correctly through the private network.

Single-NIC Deployments: If your nodes have only a single network interface, see the Shared Interface Setup guide instead. This guide assumes segregated networks with separate interfaces for cluster traffic (eth1) and external access (eth0).

Segregated Network Configuration

If your nodes have multiple network interfaces and you want to use a separate interface for cluster traffic (not the default route interface), configure the INSTALL_K3S_EXEC environment variable before installing the cluster or joining nodes.

For segregated networks (private cluster network on eth1 + public external access on eth0), set all three K3s flags:

# For server nodes
export INSTALL_K3S_EXEC="server --node-ip=<ETH1_IP> --node-external-ip=<ETH0_IP> --flannel-iface=eth1 --advertise-address=<ETH1_IP>"

# For agent nodes  
export INSTALL_K3S_EXEC="agent --node-ip=<ETH1_IP> --node-external-ip=<ETH0_IP> --flannel-iface=eth1"

Where:

  • Mode: Use server for the primary node establishing the cluster, or for additional server nodes. Use agent for agent nodes joining the cluster.
  • --node-ip=<ETH1_IP>: The internal/private IP address of eth1 for cluster communication
  • --node-external-ip=<ETH0_IP>: The public IP address of eth0 for external access (LoadBalancer services, ingress)
  • --flannel-iface=eth1: The network interface name for Flannel VXLAN overlay traffic
  • --advertise-address=<ETH1_IP>: The address the API server uses to advertise itself to cluster members. Must be set to the internal/private IP address in a segregated-network deployment; without this flag, k3s defaults to the external IP when --node-external-ip is set, causing the kubernetes service endpoint to register as an unreachable address. This flag is required for server nodes only; agent nodes do not run an API server.

Set this variable on each node before running the install or join scripts.

SELinux

If SELinux is to be used, it must be set to “Enforcing” mode before running the installer script. The installer will configure appropriate SELinux policies automatically. SELinux cannot be enabled after installation.

Installation Steps

Step 1: Prepare the Primary Server Node

Mount the installation ISO on the primary server node:

mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

Replace X.Y.Z with the actual version number.

Step 2: Install the Base Cluster on Primary Server

Segregated Networks: If your node has multiple network interfaces, set the INSTALL_K3S_EXEC environment variable with the complete segregated network configuration before running the installer (see Segregated Network Configuration):

export INSTALL_K3S_EXEC="server --node-ip=<ETH1_IP> --node-external-ip=<ETH0_IP> --flannel-iface=eth1 --advertise-address=<ETH1_IP>"

Replace <ETH1_IP> with the internal/private IP address and <ETH0_IP> with the public IP address.

If your node has only a single network interface, do not set INSTALL_K3S_EXEC. K3s will use the default interface automatically.

Run the installer to set up the K3s Kubernetes cluster:

/mnt/esb3027/install

This installs:

  • K3s Kubernetes distribution
  • Longhorn distributed storage
  • Cloudnative PG operator for PostgreSQL
  • Base system dependencies

Important: After the installer completes, verify that all system pods in both namespaces are in the Running state before proceeding:

# Check kube-system namespace (Kubernetes core components)
kubectl get pods -n kube-system

# Check longhorn-system namespace (distributed storage)
kubectl get pods -n longhorn-system

All pods should show Running status. If any pods are still Pending or ContainerCreating, wait until they are ready. Proceeding with incomplete system pods can cause subsequent steps to fail in unpredictable ways.

This verification confirms:

  • K3s cluster is operational
  • Longhorn distributed storage is running
  • Cloudnative PG operator is deployed
  • All core components are healthy before continuing

Step 3: Retrieve the Node Token

Retrieve the node token for joining additional nodes:

cat /var/lib/rancher/k3s/server/node-token

Save this token for use on additional nodes. Also note the IP address of the primary server node.

Step 4: Server vs Agent Node Roles

Before joining additional nodes, determine which nodes will serve as Server nodes vs Agent nodes:

RoleControl PlaneWorkloadsHA QuorumUse Case
Server Node (Combined)Yes (etcd, API server)YesParticipatesDefault production role; minimum 3 nodes
Server Node (Control Plane Only)Yes (etcd, API server)NoParticipatesDedicated control plane; requires separate Agent nodes
Agent NodeNoYesNoAdditional workload capacity only

Guidance:

  • Combined role (default): Server nodes run both control plane and workloads; minimum 3 nodes required for HA
  • Control Plane Only: Dedicate nodes to control plane functions; requires at least 3 Server nodes plus 3+ Agent nodes for workloads
  • Agent nodes are required if using Control Plane Only servers; optional if using Combined role servers
  • For most deployments, 3 Server nodes (Combined role) with no Agent nodes is sufficient
  • Add Agent nodes to scale workload capacity without affecting control plane quorum

Proceed to Step 5 to join Server nodes. Agent nodes are joined after all Server nodes are ready.

Step 5: Join Additional Server Nodes

On each additional server node:

  1. Mount the ISO:

    mkdir -p /mnt/esb3027
    mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027
    
  2. Join the cluster:

Segregated Networks: If your node has multiple network interfaces, set the INSTALL_K3S_EXEC environment variable with the complete segregated network configuration before running the join script (see Segregated Network Configuration):

export INSTALL_K3S_EXEC="server --node-ip=<ETH1_IP> --node-external-ip=<ETH0_IP> --flannel-iface=eth1 --advertise-address=<ETH1_IP>"

Replace <ETH1_IP> with the internal/private IP address and <ETH0_IP> with the public IP address.

If your node has only a single network interface, do not set INSTALL_K3S_EXEC. K3s will use the default interface automatically.

Note for Segregated Networks: When joining nodes in a segregated network environment, ensure the <primary-server-ip> used in the join command is the internal/private IP address (the eth1 address) of the primary server. Using the external IP will cause the join attempt to fail as the service will be listening on the private interface.

Run the join script:

/mnt/esb3027/join-server https://<primary-server-ip>:6443 <node-token>

Replace <primary-server-ip> with the IP address of the primary server and <node-token> with the token retrieved in Step 3.

  1. Verify the node joined successfully:
kubectl get nodes

Repeat for each server node. A minimum of 3 server nodes is required for high availability.

Step 5b: Taint Control Plane Only Nodes (Optional)

If you are using dedicated Control Plane Only nodes (not Combined role), apply taints to prevent workload scheduling:

kubectl taint nodes <node-name> CriticalAddonsOnly=true:NoSchedule

Apply this taint to each Control Plane Only node. Verify taints are applied:

kubectl describe nodes | grep -A 5 "Taints"

Note: This step is only required if you want dedicated control plane nodes. For Combined role deployments, do not apply taints.

Important: Control Plane Only Server nodes can be deployed with lower hardware specifications (2 cores, 4 GiB, 64 GiB) than the installer’s default minimum requirements. If your Control Plane Only Server nodes do not meet the Single-Node Lab configuration minimums (8 cores, 16 GiB, 128 GiB), you must set the SKIP_REQUIREMENTS_CHECK environment variable before running the installer or join command:

# For the primary server node
export SKIP_REQUIREMENTS_CHECK=1
/mnt/esb3027/install

# For additional Control Plane Only Server nodes
export SKIP_REQUIREMENTS_CHECK=1
/mnt/esb3027/join-server https://<primary-server-ip>:6443 <node-token>

Note: This applies to Server nodes only. Agent nodes have separate minimum requirements.

Step 6: Join Agent Nodes (Optional)

On each agent node:

  1. Mount the ISO:

    mkdir -p /mnt/esb3027
    mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027
    
  2. Join the cluster as an agent:

Segregated Networks: If your node has multiple network interfaces, set the INSTALL_K3S_EXEC environment variable with the complete segregated network configuration before running the join script (see Segregated Network Configuration):

export INSTALL_K3S_EXEC="agent --node-ip=<ETH1_IP> --node-external-ip=<ETH0_IP> --flannel-iface=eth1"

Replace <ETH1_IP> with the internal/private IP address and <ETH0_IP> with the public IP address.

If your node has only a single network interface, do not set INSTALL_K3S_EXEC. K3s will use the default interface automatically.

Run the join script:

/mnt/esb3027/join-agent https://<primary-server-ip>:6443 <node-token>

Note for Segregated Networks: When joining nodes in a segregated network environment, ensure the <primary-server-ip> used in the join command is the internal/private IP address (the eth1 address) of the primary server. Using the external IP will cause the join attempt to fail as the service will be listening on the private interface.

  1. Verify the node joined successfully from an existing server node:
    kubectl get nodes
    

Agent nodes provide additional workload capacity but do not participate in the control plane quorum.

Step 7: Verify Cluster Status

After all nodes are joined, verify the cluster is operational:

1. Verify all nodes are ready:

kubectl get nodes

Expected output:

NAME                 STATUS   ROLES                       AGE   VERSION
k3s-server-0         Ready    control-plane,etcd,master   5m    v1.33.4+k3s1
k3s-server-1         Ready    control-plane,etcd,master   3m    v1.33.4+k3s1
k3s-server-2         Ready    control-plane,etcd,master   2m    v1.33.4+k3s1
k3s-agent-1          Ready    <none>                      1m    v1.33.4+k3s1
k3s-agent-2          Ready    <none>                      1m    v1.33.4+k3s1

2. Verify system pods in both namespaces are running:

# Check kube-system namespace (Kubernetes core components)
kubectl get pods -n kube-system

# Check longhorn-system namespace (distributed storage)
kubectl get pods -n longhorn-system

All pods should show Running status. If any pods are still Pending or ContainerCreating, wait until they are ready.

This verification confirms:

  • K3s cluster is operational across all nodes
  • Longhorn distributed storage is running
  • Cloudnative PG operator is deployed
  • All core components are healthy before proceeding to application deployment

Step 9: Air-Gapped Deployments (If Applicable)

If deploying in an air-gapped environment, on each node:

mkdir -p /mnt/esb3027-extras
mount -o loop,ro esb3027-acd-manager-extras-X.Y.Z.iso /mnt/esb3027-extras
/mnt/esb3027-extras/load-images

Step 10: Deploy the Manager Helm Chart

For complete instructions on deploying the CDN Manager Helm chart, including configuration file setup, MaxMind GeoIP database loading, TLS certificate configuration, deployment commands, and verification steps, see the Helm Chart Installation Guide.

This guide covers the common deployment steps that apply to all installation types. After completing the helm chart installation steps, proceed to Post-Installation below.

Step 15: Configure DNS (Optional)

Add DNS records for the manager hostname. For high availability, configure multiple A records pointing to different server nodes:

manager.example.com.  IN  A  <server-1-ip>
manager.example.com.  IN  A  <server-2-ip>
manager.example.com.  IN  A  <server-3-ip>

Alternatively, configure a load balancer to distribute traffic across nodes.

Post-Installation

After installation completes, proceed to the Next Steps guide for:

  • Initial user configuration
  • Accessing the web interfaces
  • Configuring authentication
  • Setting up monitoring

Accessing the System

Refer to the Accessing the System section in the Getting Started guide for service URLs and default credentials.

Note: A self-signed SSL certificate is deployed by default. For production deployments, configure a valid SSL certificate before exposing the system to users.

High Availability Considerations

Pod Distribution

The Helm chart configures pod anti-affinity rules to ensure:

  • Kafka controllers are scheduled on separate nodes
  • PostgreSQL cluster members are distributed across nodes
  • Application pods are spread across available nodes

Data Replication and Failure Tolerance

For detailed information on data replication strategies and failure scenario tolerance, refer to the Architecture Guide and System Requirements Guide.

Troubleshooting

If pods fail to start or nodes fail to join:

  1. Check node status: kubectl get nodes
  2. Describe problematic pods: kubectl describe pod <pod-name>
  3. Review logs: kubectl logs <pod-name>
  4. Check cluster events: kubectl get events --sort-by='.lastTimestamp'

Nodes Ready but Workloads Cannot Reach the API Server (Segregated Networks)

Symptom: All nodes show Ready status, but cluster components (kubelet, controller-manager, scheduler) or workloads fail to communicate with the API server. Pods in kube-system or longhorn-system may fail to start or remain in a crash loop.

Cause: This is caused by omitting --advertise-address from the server-node INSTALL_K3S_EXEC. When --node-external-ip is set without --advertise-address, k3s defaults the API server’s advertise address to the external IP (eth0). In a segregated-network topology where nodes are not routable to each other over eth0, the kubernetes service ClusterIP endpoint registers as an unreachable address.

Diagnostic check:

kubectl get endpoints kubernetes -n default

If the IP shown is the eth0 (external) address rather than the eth1 (internal) address, the cluster was installed without --advertise-address.

Remediation: The kubernetes service endpoint cannot be corrected by reconfiguration alone. K3s must be reinstalled on all server nodes with the correct flags:

export INSTALL_K3S_EXEC="server --node-ip=<ETH1_IP> --node-external-ip=<ETH0_IP> --flannel-iface=eth1 --advertise-address=<ETH1_IP>"

After reinstallation, re-run the diagnostic check to confirm the endpoint IP is now the eth1 (internal) address.

See the Troubleshooting Guide for additional assistance.

Next Steps

After successful installation:

  1. Next Steps Guide - Post-installation configuration
  2. Configuration Guide - System configuration
  3. Operations Guide - Day-to-day operations

5.4 - Air-Gapped Deployment

Installation procedures for air-gapped environments

Overview

This guide describes the installation of the AgileTV CDN Manager in air-gapped environments (no internet access). Air-gapped deployments require additional preparation compared to connected deployments.

Key differences from connected deployments:

  • Both Installation ISO and Extras ISO are required on all nodes
  • OS installation ISO must be mounted on all nodes for package access
  • Container images must be loaded from the Extras ISO on each node
  • Additional firewall considerations for OS package repositories

Prerequisites

Required ISOs

Before beginning installation, obtain the following:

ISOFilenamePurpose
Installation ISOesb3027-acd-manager-X.Y.Z.isoKubernetes cluster and Manager application
Extras ISOesb3027-acd-manager-extras-X.Y.Z.isoContainer images for air-gapped environments
OS Installation ISORHEL 9 or compatible cloneOperating system packages (required on all nodes)

Hardware Requirements

Refer to the System Requirements Guide for hardware specifications.

  • Single-Node (Lab): Minimum 8 cores, 16 GiB RAM, 128 GiB disk
  • Multi-Node (Production): Minimum 3 Server nodes for high availability

Network Configuration

Air-gapped environments may have internal network mirrors for OS packages. If no internal mirror exists, the OS installation ISO must be mounted on each node to provide packages during installation.

Ensure that required firewall ports are configured before installation. See the Networking Guide for complete firewall configuration requirements.

SELinux

If SELinux is to be used, it must be set to “Enforcing” mode before running the installer script. The installer will configure appropriate SELinux policies automatically. SELinux cannot be enabled after installation.

Installation Steps

Step 1: Prepare All Nodes

On each node (primary server, additional servers, and agents):

  1. Mount the OS installation ISO:
mkdir -p /mnt/os
mount -o loop,ro /path/to/rhel-9.iso /mnt/os
  1. Configure local repository (if no internal mirror):
cat > /etc/yum.repos.d/local.repo <<EOF
[local]
name=Local OS Repository
baseurl=file:///mnt/os/BaseOS
enabled=1
gpgcheck=0
EOF

# Also configure AppStream if needed
cat >> /etc/yum.repos.d/local.repo <<EOF

[appstream]
name=AppStream Repository
baseurl=file:///mnt/os/AppStream
enabled=1
gpgcheck=0
EOF
  1. Verify repository is accessible:
dnf repolist
dnf makecache

Step 2: Prepare the Primary Server Node

Mount the installation ISOs on the primary server node:

# Mount Installation ISO
mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

# Mount Extras ISO
mkdir -p /mnt/esb3027-extras
mount -o loop,ro esb3027-acd-manager-extras-X.Y.Z.iso /mnt/esb3027-extras

Step 3: Install the Base Cluster on Primary Server

Run the installer to set up the K3s Kubernetes cluster:

/mnt/esb3027/install

This installs:

  • K3s Kubernetes distribution
  • Longhorn distributed storage
  • Cloudnative PG operator for PostgreSQL
  • Base system dependencies

Important: After the installer completes, verify that all system pods in both namespaces are in the Running state before proceeding:

# Check kube-system namespace (Kubernetes core components)
kubectl get pods -n kube-system

# Check longhorn-system namespace (distributed storage)
kubectl get pods -n longhorn-system

All pods should show Running status. If any pods are still Pending or ContainerCreating, wait until they are ready. Proceeding with incomplete system pods can cause subsequent steps to fail in unpredictable ways.

This verification confirms:

  • K3s cluster is operational
  • Longhorn distributed storage is running
  • Cloudnative PG operator is deployed
  • All core components are healthy before continuing

Step 4: Retrieve the Node Token

Retrieve the node token for joining additional nodes:

cat /var/lib/rancher/k3s/server/node-token

Save this token for use on additional nodes. Also note the IP address of the primary server node.

Step 5: Join Additional Server Nodes (Multi-Node Only)

On each additional server node:

  1. Mount the OS ISO:
mkdir -p /mnt/os
mount -o loop,ro /path/to/rhel-9.iso /mnt/os

# Configure local repository
cat > /etc/yum.repos.d/local.repo <<EOF
[local]
name=Local OS Repository
baseurl=file:///mnt/os/BaseOS
enabled=1
gpgcheck=0

[appstream]
name=AppStream Repository
baseurl=file:///mnt/os/AppStream
enabled=1
gpgcheck=0
EOF

dnf makecache
  1. Mount the Installation ISOs:
# Mount Installation ISO
mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

# Mount Extras ISO
mkdir -p /mnt/esb3027-extras
mount -o loop,ro esb3027-acd-manager-extras-X.Y.Z.iso /mnt/esb3027-extras
  1. Join the cluster:

Run the join script:

/mnt/esb3027/join-server https://<primary-server-ip>:6443 <node-token>

Replace <primary-server-ip> with the IP address of the primary server and <node-token> with the token retrieved in Step 4.

  1. Verify the node joined successfully:
kubectl get nodes

Repeat for each server node. A minimum of 3 server nodes is required for high availability.

Step 6: Join Agent Nodes (Optional)

On each agent node:

  1. Mount the OS ISO:
mkdir -p /mnt/os
mount -o loop,ro /path/to/rhel-9.iso /mnt/os

# Configure local repository
cat > /etc/yum.repos.d/local.repo <<EOF
[local]
name=Local OS Repository
baseurl=file:///mnt/os/BaseOS
enabled=1
gpgcheck=0

[appstream]
name=AppStream Repository
baseurl=file:///mnt/os/AppStream
enabled=1
gpgcheck=0
EOF

dnf makecache
  1. Mount the Installation ISOs:
# Mount Installation ISO
mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

# Mount Extras ISO
mkdir -p /mnt/esb3027-extras
mount -o loop,ro esb3027-acd-manager-extras-X.Y.Z.iso /mnt/esb3027-extras
  1. Join the cluster as an agent:

Run the join script:

/mnt/esb3027/join-agent https://<primary-server-ip>:6443 <node-token>
  1. Verify the node joined successfully from an existing server node:
kubectl get nodes

Agent nodes provide additional workload capacity but do not participate in the control plane quorum.

Step 7: Load Container Images

On each node in the cluster:

/mnt/esb3027-extras/load-images

This script loads all container images from the Extras ISO into the local container runtime.

Important: This step must be performed on every node (primary server, additional servers, and agents) before deploying the Manager application.

Step 8: Verify Cluster Status

After all nodes are joined and images are loaded, verify the cluster is operational:

1. Verify all nodes are ready:

kubectl get nodes

Expected output:

NAME                 STATUS   ROLES                       AGE   VERSION
k3s-server-0         Ready    control-plane,etcd,master   5m    v1.33.4+k3s1
k3s-server-1         Ready    control-plane,etcd,master   3m    v1.33.4+k3s1
k3s-server-2         Ready    control-plane,etcd,master   2m    v1.33.4+k3s1
k3s-agent-1          Ready    <none>                      1m    v1.33.4+k3s1

2. Verify system pods in both namespaces are running:

# Check kube-system namespace (Kubernetes core components)
kubectl get pods -n kube-system

# Check longhorn-system namespace (distributed storage)
kubectl get pods -n longhorn-system

All pods should show Running status.

3. Verify container images are loaded:

crictl images | grep acd-manager

Step 9: Deploy the Manager Helm Chart

For complete instructions on deploying the CDN Manager Helm chart, including configuration file setup, MaxMind GeoIP database loading, TLS certificate configuration, deployment commands, and verification steps, see the Helm Chart Installation Guide.

This guide covers the common deployment steps that apply to all installation types. After completing the helm chart installation steps, proceed to Post-Installation below.

Post-Installation

After installation completes, proceed to the Next Steps guide for:

  • Initial user configuration
  • Accessing the web interfaces
  • Configuring authentication
  • Setting up monitoring

Accessing the System

Refer to the Accessing the System section in the Getting Started guide for service URLs and default credentials.

Note: A self-signed SSL certificate is deployed by default. You will need to accept the certificate warning in your browser.

Updating MaxMind GeoIP Databases

If using GeoIP-based routing, load the MaxMind databases:

/mnt/esb3027/generate-maxmind-volume

The utility will prompt for the database file locations and volume name. Reference the volume in your values.yaml:

manager:
  maxmindDbVolume: maxmind-geoip-2026-04

See the Operations Guide for database update procedures.

Troubleshooting

Image Pull Errors

If pods fail with image pull errors:

  1. Verify the load-images script completed successfully on all nodes
  2. Check container runtime image list:
    crictl images | grep <image-name>
    
  3. Ensure image tags in Helm chart match tags on the Extras ISO

OS Package Errors

If the installer reports missing OS packages:

  1. Verify OS ISO is mounted on the affected node
  2. Check repository configuration:
    dnf repolist
    dnf info <package-name>
    
  3. Ensure the ISO matches the installed OS version

Longhorn Volume Issues

If Longhorn volumes fail to mount:

  1. Verify all nodes have the load-images script completed
  2. Check Longhorn system pods:
    kubectl get pods -n longhorn-system
    
  3. Review Longhorn UI via port-forward:
    kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80
    

Next Steps

After successful installation:

  1. Next Steps Guide - Post-installation configuration
  2. Configuration Guide - System configuration
  3. Operations Guide - Day-to-day operational procedures
  4. Troubleshooting Guide - Common issues and resolution

5.5 - Helm Chart Installation

Common procedure for deploying the CDN Manager Helm chart across all deployment types

Overview

This guide covers the common steps for deploying the CDN Manager Helm chart. These steps apply to all deployment types (single-node, multi-node, and air-gapped) after the Kubernetes cluster is fully operational.

Prerequisites: This guide assumes the Kubernetes cluster is already installed and all system pods are running. If you haven’t installed the cluster yet, refer to:

Prerequisites

Before proceeding, verify the following:

  • Cluster operational: All nodes show Ready status
  • System pods running: All pods in kube-system and longhorn-system namespaces are Running
  • ISO mounted: Installation ISO is mounted at /mnt/esb3027
  • Extras ISO mounted (air-gapped only): Extras ISO is mounted at /mnt/esb3027-extras and images are loaded on all nodes

Step 1: Create Configuration File

The installation ISO includes environment-specific configuration files as the recommended starting points. Choose the file that matches your deployment type:

DeploymentStarting fileCopy command
Single-node lab/mnt/esb3027/values-lab.yamlcp /mnt/esb3027/values-lab.yaml ~/values.yaml
Multi-node production/mnt/esb3027/values-production.yamlcp /mnt/esb3027/values-production.yaml ~/values.yaml

After copying, edit ~/values.yaml for your environment. The two fields that must be updated in either file are:

global:
  hosts:
    manager:
      - host: manager.example.com   # Your manager hostname

zitadel:
  zitadel:
    configmapConfig:
      ExternalDomain: manager.example.com   # Must match global.hosts.manager[0].host exactly

Important: global.hosts.manager[0].host and zitadel.zitadel.configmapConfig.ExternalDomain must match exactly or authentication will fail due to CORS policy violations.

For a full description of what each file configures and the complete list of required changes per environment, see the Configuration Guide.

Complete reference: /mnt/esb3027/values.yaml documents every available option with its default value. Use this as a reference, not as a starting point for your configuration.

Split configuration files: For better organisation, split your configuration into multiple files and specify them with repeated --values flags. Later files override earlier files:

helm install acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml \
  --values ~/values-tls.yaml

Step 2: Load MaxMind GeoIP Databases (Optional)

If you plan to use GeoIP-based routing or validation features, load the MaxMind GeoIP databases. The following databases are used by the manager:

  • GeoIP2-City.mmdb - The City Database
  • GeoLite2-ASN.mmdb - The ASN Database
  • GeoIP2-Anonymous-IP.mmdb - The VPN and Anonymous IP Database

Create the Kubernetes volume using the helper utility:

/mnt/esb3027/generate-maxmind-volume

The utility will prompt for:

  1. Location of GeoIP2-City.mmdb
  2. Location of GeoLite2-ASN.mmdb
  3. Location of GeoIP2-Anonymous-IP.mmdb
  4. Name of the volume

After running this command, reference the volume in your configuration file:

manager:
  maxmindDbVolume: maxmind-db-volume

Replace maxmind-db-volume with the volume name you specified when running the utility.

Tip: When naming the volume, include a revision number or date (e.g., maxmind-db-volume-2026-04 or maxmind-db-volume-v2). This simplifies future updates: create a new volume with an updated name, update the values.yaml to reference the new volume, and delete the old volume after verification.

Step 3: Configure TLS Certificates (Optional)

For production deployments, configure a valid TLS certificate from a trusted Certificate Authority (CA). A self-signed certificate is deployed by default if no certificate is provided.

Method 1: Create TLS Secret Manually

Create a Kubernetes TLS secret with your certificate and key:

kubectl create secret tls acd-manager-tls --cert=tls.crt --key=tls.key

Method 2: Helm-Managed Secret

Add the certificate directly to your values.yaml:

ingress:
  secrets:
    acd-manager-tls: |
      -----BEGIN CERTIFICATE-----
      ...
      -----END CERTIFICATE-----
  tls:
    - hosts:
        - manager.example.com
      secretName: acd-manager-tls

Configuring All Ingress Controllers

All ingress controllers must be configured with the same certificate secret and hostname:

ingress:
  hostname: manager.example.com
  tls: true
  secretName: acd-manager-tls

zitadel:
  ingress:
    tls:
      - hosts:
          - manager.example.com
        secretName: acd-manager-tls

confd:
  ingress:
    hostname: manager.example.com
    tls: true
    secretName: acd-manager-tls

mib-frontend:
  ingress:
    hostname: manager.example.com
    tls: true
    secretName: acd-manager-tls

Important: The hostname must match the first entry in global.hosts.manager for Zitadel CORS compatibility. The secret name has a maximum length of 53 characters.

Step 4: Deploy the Manager Helm Chart

Deploy the CDN Manager application:

helm install acd-manager /mnt/esb3027/charts/acd-manager --values ~/values.yaml

Real-time output: By default, helm install runs silently until completion. To see real-time output during deployment, add the --debug flag:

helm install acd-manager /mnt/esb3027/charts/acd-manager --values ~/values.yaml --debug

Monitor deployment:

kubectl get pods --watch

Wait for all pods to show Running status before proceeding.

Timeout handling: The default Helm timeout is 5 minutes. If the installation fails due to a rollout timeout, retry with a larger timeout value:

helm install acd-manager /mnt/esb3027/charts/acd-manager --values ~/values.yaml --timeout 10m

Retry failed installation: If a previous installation attempt failed and you receive an error that the release name is already in use, uninstall the previous release before retrying:

helm uninstall acd-manager
helm install acd-manager /mnt/esb3027/charts/acd-manager --values ~/values.yaml

Step 5: Verify Deployment

Verify all application pods are running:

kubectl get pods

Expected Output: Single-Node

NAME                                              READY   STATUS      RESTARTS   AGE
acd-manager-5b98d569d9-abc12                      1/1     Running     0          3m
acd-manager-confd-6fb78548c4-xnrh4                1/1     Running     0          3m
acd-manager-gateway-8bc8446fc-chs26               1/1     Running     0          3m
acd-manager-kafka-controller-0                    2/2     Running     0          3m
acd-manager-metrics-aggregator-76d96c4964-lwdcj   1/1     Running     0          3m
acd-manager-mib-frontend-7bdb69684b-6qxn8         1/1     Running     0          3m
acd-manager-postgresql-0                          1/1     Running     0          3m
acd-manager-redis-master-0                        2/2     Running     0          3m
acd-manager-redis-replicas-0                      2/2     Running     0          3m
acd-manager-selection-input-5fb694b857-qxt67      1/1     Running     0          3m
acd-manager-zitadel-8448b4c4fc-2pkd8              1/1     Running     0          3m
acd-manager-zitadel-init-hh6j7                    0/1     Completed   0          4m
acd-manager-zitadel-setup-nwp8k                   0/2     Completed   0          4m
alertmanager-0                                    1/1     Running     0          3m
grafana-6d948cfdc6-77ggk                          1/1     Running     0          3m
victoria-metrics-agent-dc87df588-tn8wv            1/1     Running     0          3m
victoria-metrics-alert-757c44c58f-kk9lp           1/1     Running     0          3m
victoria-metrics-longterm-server-0                1/1     Running     0          3m
victoria-metrics-server-0                         1/1     Running     0          3m

Expected Output: Multi-Node

NAME                                              READY   STATUS      RESTARTS   AGE
acd-cluster-postgresql-1                          1/1     Running     0               11m
acd-cluster-postgresql-2                          1/1     Running     0               11m
acd-cluster-postgresql-3                          1/1     Running     0               10m
acd-manager-5b98d569d9-2pbph                      1/1     Running     0               3m
acd-manager-5b98d569d9-m54f9                      1/1     Running     0               3m
acd-manager-5b98d569d9-pq26f                      1/1     Running     0               3m
acd-manager-confd-6fb78548c4-xnrh4                1/1     Running     0               3m
acd-manager-gateway-8bc8446fc-chs26               1/1     Running     0               3m
acd-manager-gateway-8bc8446fc-wzrml               1/1     Running     0               3m
acd-manager-kafka-controller-0                    2/2     Running     0               3m
acd-manager-kafka-controller-1                    2/2     Running     0               3m
acd-manager-kafka-controller-2                    2/2     Running     0               3m
acd-manager-metrics-aggregator-76d96c4964-lwdcj   1/1     Running     2               3m
acd-manager-mib-frontend-7bdb69684b-6qxn8         1/1     Running     0               3m
acd-manager-mib-frontend-7bdb69684b-pkjrw         1/1     Running     0               3m
acd-manager-redis-master-0                        2/2     Running     0               3m
acd-manager-redis-replicas-0                      2/2     Running     0               3m
acd-manager-selection-input-5fb694b857-qxt67      1/1     Running     2               3m
acd-manager-zitadel-8448b4c4fc-2pkd8              1/1     Running     0               3m
acd-manager-zitadel-8448b4c4fc-vchp9              1/1     Running     0               3m
acd-manager-zitadel-init-hh6j7                    0/1     Completed   0               4m
acd-manager-zitadel-setup-nwp8k                   0/2     Completed   0               4m
alertmanager-0                                    1/1     Running     0               3m
grafana-6d948cfdc6-77ggk                          1/1     Running     0               3m
telegraf-54779f5f46-2jfj5                         1/1     Running     0               3m
victoria-metrics-agent-dc87df588-tn8wv            1/1     Running     0               3m
victoria-metrics-alert-757c44c58f-kk9lp           1/1     Running     0               3m
victoria-metrics-longterm-server-0                1/1     Running     0               3m
victoria-metrics-server-0                         1/1     Running     0               3m

Pod Distribution Verification

Verify pods are distributed across nodes:

kubectl get pods -o wide

Expected Behavior

  • Init pods (such as zitadel-init and zitadel-setup) will show Completed status after successful initialization. This is expected behavior.
  • Multi-node deployments: Some pods may enter CrashLoopBackoff state during initial deployment depending on the timing of other containers starting up. This is expected behavior as some services wait for dependencies (such as databases or Kafka) to become available. The deployment should stabilize automatically after a few minutes.
  • Restart counts: Some pods may show restart counts as they wait for dependencies to become available. This is normal during initial deployment.

Next Steps

After successful deployment:

  1. Next Steps Guide - Post-installation configuration
  2. Getting Started Guide - Accessing the system
  3. Configuration Guide - System configuration
  4. Operations Guide - Day-to-day operations

5.6 - Upgrade Guide

Upgrading the CDN Manager to a newer version

Overview

This guide describes the procedure for upgrading the AgileTV CDN Manager (ESB3027) to a newer version. The upgrade process involves updating the Kubernetes cluster components and redeploying the Helm chart with the new version.

Prerequisites

Backup Requirements

Before beginning any upgrade, ensure you have:

  • PostgreSQL Backup: Verify recent backups are available via the Cloudnative PG operator
  • Configuration Backup: Save your current values.yaml file(s)
  • TLS Certificates: Ensure certificate files are backed up
  • MaxMind Volumes: Note the current volume names if using GeoIP databases

Version Compatibility

Review the Release Notes for the target version to check for:

  • Breaking changes requiring manual intervention
  • Required intermediate upgrade steps
  • New configuration options that should be set

Cluster Health

Verify the cluster is healthy before upgrading:

kubectl get nodes
kubectl get pods
kubectl get pvc

All nodes should show Ready status and all pods should be Running (or Completed for job pods).

Upgrade Methods

There are three upgrade methods available. Choose the one that best fits your situation:

MethodDowntimeUse Case
Rolling UpgradeMinimalPatch releases; minor version upgrades; configuration updates
Clean UpgradeBriefMajor version upgrades; component changes; troubleshooting
Full ReinstallExtendedCluster rebuilds; troubleshooting persistent issues; ensuring clean state

Method Selection Guidance:

  • Rolling Upgrade (Method 1) is the default choice for most upgrades. Use this for patch releases (e.g., 1.6.0 → 1.6.1) and even minor version upgrades (e.g., 1.4.0 → 1.6.0) where no breaking changes are documented. This method preserves all existing resources and performs an in-place update. Note: This method supports Helm’s automatic rollback (helm rollback) if the upgrade fails, allowing quick recovery to the previous state.

  • Clean Upgrade (Method 2) is recommended for major version upgrades (e.g., 1.x → 2.x) or when the release notes indicate significant component changes. This method ensures all resources are recreated with the new version, avoiding potential issues with stale configurations. Also use this method when troubleshooting upgrade failures from Method 1.

  • Full Reinstall (Method 3) should only be used when a completely clean cluster state is required. This includes troubleshooting persistent cluster-level issues, recovering from failed upgrades that cannot be rolled back, or when migrating between significantly different deployment configurations. This method requires verified backups and should be planned for extended downtime.

Upgrade Steps

This method performs an in-place rolling upgrade with minimal downtime. All upgrade commands are executed from the primary server node.

Step 1: Obtain the New Installation ISO

Unmount the old ISO (if mounted) and mount the new installation ISO:

umount /mnt/esb3027 2>/dev/null || true
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

Replace X.Y.Z with the target version number.

Step 2: Update Containers and Cluster Software

Run the installation script to update the container images and cluster software:

/mnt/esb3027/install

Wait for the script to complete.

Step 2b: Air-Gapped Environments (If Applicable)

If deploying in an air-gapped environment, also mount and load the extras ISO:

# Mount the Extras ISO
mkdir -p /mnt/esb3027-extras
mount -o loop,ro esb3027-acd-manager-extras-X.Y.Z.iso /mnt/esb3027-extras

# Load container images from the extras ISO
/mnt/esb3027-extras/load-images

Replace X.Y.Z with the target version number.

Step 4: Review and Update Configuration

Compare the default values.yaml from the new ISO with your current configuration:

diff /mnt/esb3027/values.yaml ~/values.yaml

Update your configuration file to include any new required settings. Common updates include:

# ~/values.yaml
global:
  hosts:
    manager:
      - host: manager.example.com
    routers:
      - name: director-1
        address: 192.0.2.1

zitadel:
  zitadel:
    ExternalDomain: manager.example.com

# Add any new required settings for the target version

Important: Do not modify settings unrelated to the upgrade unless specifically documented in the release notes.

Step 5: Update MaxMind GeoIP Volumes (If Applicable)

If you use MaxMind GeoIP databases, use the utility from the new ISO to create an updated volume:

/mnt/esb3027/generate-maxmind-volume

Update your values.yaml to reference the new volume name:

manager:
  maxmindDbVolume: maxmind-geoip-2026-04

Tip: Using dated or versioned volume names (e.g., maxmind-geoip-2026-04) allows you to create new volumes during upgrades and delete old ones after verification.

Step 6: Update TLS Certificates (If Needed)

If your TLS certificates need renewal or the new version requires certificate updates, create or update the secret:

kubectl create secret tls acd-manager-tls --cert=tls.crt --key=tls.key --dry-run=client -o yaml | kubectl apply -f -

Step 7: Upgrade the Helm Release

Perform a Helm upgrade with the new chart:

helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml

Note: The upgrade performs a rolling update of each deployment in the chart. Deployments are upgraded one at a time, with pods being terminated and recreated sequentially. StatefulSets (PostgreSQL, Kafka, Redis) roll out one pod at a time to maintain data availability.

Monitor the upgrade progress:

kubectl get pods --watch

Wait for all pods to stabilize and show Running status before considering the upgrade complete. Some pods may temporarily enter CrashLoopBackoff during the transition as they wait for dependencies to become available.

Step 8: Verify the Upgrade

Check the deployed version:

helm list
kubectl get deployments -o wide

Verify application functionality:

  • Access the MIB Frontend and confirm it loads
  • Test API connectivity
  • Verify Grafana dashboards are accessible
  • Check that Zitadel authentication is working

Step 9: Clean Up

After confirming the upgrade is successful:

  1. Unmount the old ISO (if still mounted):

    umount /mnt/esb3027
    
  2. Delete old MaxMind volumes (if replaced):

    kubectl get pvc
    kubectl delete pvc <old-volume-name>
    
  3. Remove old configuration files if no longer needed.


Method 2: Clean Upgrade (Helm Uninstall/Install)

This method removes the existing Helm release before installing the new version. This is useful for major version upgrades or when troubleshooting upgrade issues. All upgrade commands are executed from the primary server node.

Warning: This method causes brief downtime as all resources are deleted before reinstallation.

Step 1: Obtain the New Installation ISO

Mount the new installation ISO:

umount /mnt/esb3027 2>/dev/null || true
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

Step 2: Backup Configuration

Save your current Helm values:

helm get values acd-manager -o yaml > ~/values-backup.yaml

Step 3: Uninstall the Existing Release

Remove the existing Helm release:

helm uninstall acd-manager

Wait for pods to terminate:

kubectl get pods --watch

Note: Helm uninstall does not remove PersistentVolumes (PVs) or PersistentVolumeClaims (PVCs). All data stored in PostgreSQL, Kafka, Redis, and Longhorn volumes is preserved during the uninstall process. When the new version is installed, it will reattach to the existing PVCs and restore data automatically.

Step 4: Review and Update Configuration

Compare the default values.yaml from the new ISO with your configuration:

diff /mnt/esb3027/values.yaml ~/values.yaml

Update your configuration file as needed.

Step 5: Install the New Release

Install the new version:

helm install acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml

Monitor the deployment:

kubectl get pods --watch

Wait for all pods to stabilize before proceeding.

Step 6: Verify the Upgrade

Verify the upgrade as described in Method 1, Step 8.

Method 3: Full Reinstall (Cluster Rebuild)

This method completely removes Kubernetes and reinstalls from scratch. Use only for cluster rebuilds or when other upgrade methods fail.

Warning: This method causes extended downtime and permanent data loss. The K3s uninstall process destroys all Longhorn PersistentVolumes (PVs) and PersistentVolumeClaims (PVCs). All data stored in PostgreSQL, Kafka, Redis, and application volumes will be permanently lost. Verified backups are required before proceeding.

Warning: This method should only be used when necessary. Ensure you have verified backups before proceeding.

Step 1: Stop Kubernetes Services

On all nodes (server and agent), stop the K3s service:

systemctl stop k3s

Step 2: Uninstall K3s (Server Nodes Only)

On the primary server node first, then each additional server node:

/usr/local/bin/k3s-uninstall.sh

Step 3: Clean Up Residual State (All Nodes)

On all nodes, remove residual state:

/usr/local/bin/k3s-kill-all.sh
rm -rf /var/lib/rancher/k3s/*

Warning: This removes all cluster data including Longhorn PersistentVolumes (PVs) and PersistentVolumeClaims (PVCs). All data stored in PostgreSQL, Kafka, Redis, and application volumes will be permanently lost. Ensure verified backups are available before proceeding.

Step 4: Reinstall K3s Cluster and Deploy Manager

Follow the installation procedure in the Installation Guide to reinstall the cluster and deploy the Helm chart. At this point, you are in the same state as a fresh installation:

  • Primary server installation
  • Additional server joins (if applicable)
  • Agent joins (if applicable)
  • Helm chart deployment

Note: The K3s node token is regenerated during reinstallation. Retrieve the new token from /var/lib/rancher/k3s/server/node-token on the primary server after installation if you need to join additional nodes.


Rollback Procedure

Rollback procedures vary by upgrade method:

Method 1 (Rolling Upgrade)

Use Helm’s built-in rollback command:

helm rollback acd-manager

This reverts to the previous Helm release revision automatically.

Or manually redeploy the previous version:

helm upgrade acd-manager /mnt/esb3027-old/helm/charts/acd-manager \
  --values ~/values.yaml

Note: If you use multiple --values files for organization, ensure they are specified in the same order as the original installation.

Method 2 (Clean Upgrade)

Reinstall the previous version:

helm uninstall acd-manager
helm install acd-manager /mnt/esb3027-old/helm/charts/acd-manager \
  --values ~/values-backup.yaml

Method 3 (Full Reinstall)

Rollback requires repeating the full cluster reinstall procedure using the old installation ISO. Follow Method 3 steps with the previous version’s ISO. Ensure verified backups are available before attempting.

Troubleshooting

Pods Fail to Start

  1. Check pod status and events:

    kubectl describe pod <pod-name>
    kubectl get events --sort-by='.lastTimestamp'
    
  2. Review pod logs:

    kubectl logs <pod-name>
    kubectl logs <pod-name> -p  # Previous instance logs
    

Database Migration Issues

If PostgreSQL migrations fail:

  1. Check Cloudnative PG cluster status:

    kubectl get clusters
    kubectl describe cluster <cluster-name>
    
  2. Review migration job logs:

    kubectl get jobs
    kubectl logs job/<migration-job-name>
    

Helm Upgrade Fails

If helm upgrade fails:

  1. Check Helm release status:

    helm status acd-manager
    helm history acd-manager
    
  2. Review the error message for specific failures

  3. Attempt rollback if necessary

Post-Upgrade

After a successful upgrade:

  1. Review the Release Notes for any post-upgrade tasks
  2. Update monitoring dashboards if new metrics are available
  3. Test all critical functionality
  4. Document the upgrade in your change management system

Next Steps

After completing the upgrade:

  1. Next Steps Guide - Review post-installation tasks
  2. Operations Guide - Day-to-day operational procedures
  3. Release Notes - Review new features and changes

5.7 - Next Steps

Post-installation configuration tasks

Overview

After completing the installation of the AgileTV CDN Manager (ESB3027), several post-installation configuration tasks must be performed before the system is ready for production use. This guide walks you through the essential next steps.

Prerequisites

Before proceeding, ensure:

  • The CDN Manager Helm chart is successfully deployed
  • All pods are in Running status
  • You have network access to the cluster hostname or IP
  • You have the default credentials available

Step 1: Access Zitadel Console

The first step is to configure user authentication through Zitadel Identity and Access Management (IAM).

  1. Navigate to the Zitadel Console:

    https://<manager-host>/ui/console
    

    Replace <manager-host> with your configured hostname (e.g., manager.local or manager.example.com).

    Important: The <manager-host> must match the first entry in global.hosts.manager from your Helm values exactly. Zitadel uses name-based virtual hosting and CORS validation. If the hostname does not match, authentication will fail.

  2. Log in with the default administrator credentials (also listed in the Glossary):

    • Username: admin@agiletv.dev
    • Password: Password1!
  3. Important: If prompted to configure Multi-Factor Authentication (MFA), you must skip this step for now. MFA is not currently supported. Attempting to configure MFA may lock you out of the administrator account.

  4. Security Recommendation: After logging in, create a new administrator account with proper roles. Once verified, disable or delete the default admin@agiletv.dev account. For details on required roles and administrator permissions, see Zitadel’s Administrator Documentation.

Zitadel requires an SMTP server to send email notifications and perform email validations.

  1. In the Zitadel Console, navigate to Settings > Default Settings

  2. Configure the SMTP settings:

    • SMTP Host: Your mail server hostname
    • SMTP Port: Typically 587 (TLS) or 465 (SSL)
    • SMTP Username: Mail account username
    • SMTP Password: Mail account password
    • Sender Address: Email address for outgoing mail (e.g., noreply@example.com)
  3. Save the configuration

Note: Without SMTP configuration, email-based user validation and password recovery features will not function.

Step 3: Create Additional User Accounts

Create user accounts for operators and administrators:

Tip: For detailed guidance on managing users, roles, and permissions in the Zitadel Console, see Zitadel’s User Management Documentation.

  1. In the Zitadel Console, navigate to Users > Add User

  2. Fill in the user details:

    • Username: Unique username
    • First Name: User’s first name
    • Last Name: User’s last name
    • Email: User’s email address (this is their login username)

    Known Issue: Due to a limitation in this release of Zitadel, the username must match the local part (the portion before the @) of the email address. For example, if the email is foo@example.com, the username must be foo.

    If these do not match, Zitadel may allow login with the mismatched local part while blocking the full email address. For instance, if username is foo but email is foo.bar@example.com, login with foo@example.com may succeed while foo.bar@example.com is blocked.

    Workaround: Always ensure the username matches the email local part exactly.

  3. Important: The following options must be configured:

    • Email Verified: Check this box to skip email verification
    • Set Initial Password: Enter a temporary password for the user

    Note: If you configured SMTP settings in Step 2, the user will receive an email asking to verify their address and set their initial password. If SMTP is not configured, you must check the “Email Verified” box and set an initial password manually, otherwise the user account will not be enabled.

  4. Click Create User

  5. Provide the user with:

    • Their username
    • The temporary password (if set manually)
    • The Zitadel Console URL
  6. Instruct the user to change their password on first login

Step 4: Configure User Roles and Permissions

Zitadel manages roles and permissions for accessing the CDN Manager:

  1. In the Zitadel Console, navigate to Roles

  2. Assign appropriate roles to users:

    • Admin: Full administrative access
    • Operator: Operational access without administrative functions
    • Viewer: Read-only access
  3. To assign a role:

    • Select the user
    • Click Add Role
    • Select the appropriate role
    • Save the assignment

Step 5: Access the MIB Frontend

The MIB Frontend is the web-based configuration GUI for CDN operators:

  1. Navigate to the MIB Frontend:

    https://<manager-host>/gui
    
  2. Log in using your Zitadel credentials

  3. Verify you can access the configuration interface

Step 6: Verify API Access

Test API connectivity to ensure the system is functioning:

curl -k https://<manager-host>/api/v1/health/ready

Expected response:

{
  "status": "ready"
}

See the API Guide for detailed API documentation.

Step 7: Configure TLS Certificates (If Not Done During Installation)

For production deployments, a valid TLS certificate from a trusted Certificate Authority should be configured. If you did not configure TLS certificates during installation, refer to Step 12: Configure TLS Certificates in the Installation Guide.

Step 8: Set Up Monitoring and Alerting

Configure monitoring dashboards and alerting:

  1. Access Grafana:

    • Navigate to https://<manager-host>/grafana
    • Log in with default credentials (also listed in the Glossary):
      • Username: admin
      • Password: edgeware
  2. Review Pre-built Dashboards:

    • System health dashboards are included by default
    • CDN metrics dashboards show routing and usage statistics

    Note: CDN Director instances automatically have DNS names configured for use in Grafana dashboards. The DNS name is derived from the name field in global.hosts.routers with .external appended. For example, a router named my-router-1 will have the DNS name my-router-1.external in Grafana configuration.

Step 9: Verify Kafka and PostgreSQL Health

Ensure the data layer components are healthy:

kubectl get pods

Verify the following pods are running:

ComponentPod Name PatternExpected Status
Kafkaacd-manager-kafka-controller-*Running (3 pods for production)
PostgreSQLacd-cluster-postgresql-0, acd-cluster-postgresql-1, acd-cluster-postgresql-2Running (3-node HA cluster)
Redisacd-manager-redis-master-*Running

All pods should show Running status with no restarts.

Step 10: Configure Availability Zones (Optional)

For improved network performance, configure availability zones to enable Topology Aware Hints. This optimizes service-to-pod routing by keeping traffic within the same zone when possible.

See the Performance Tuning Guide for detailed instructions on:

  • Labeling nodes with zone and region topology
  • Verifying topology configuration
  • Requirements for Topology Aware Hints to activate
  • Integration with pod anti-affinity rules

Note: This step is optional. If zone labels are not configured, the system will fall back to random load-balancing.

Step 11: Review System Configuration

Verify the initial configuration:

  1. Review Helm Values:

    helm get values acd-manager -o yaml
    
  2. Check Ingress Configuration:

    kubectl get ingress
    
  3. Verify Service Endpoints:

    kubectl get endpoints
    

Step 12: Document Your Deployment

Maintain documentation for your deployment:

  • Cluster hostname and IP addresses
  • Configuration file locations
  • User accounts and roles created
  • TLS certificate expiration dates
  • Backup procedures and schedules
  • Monitoring and alerting contacts

Next Steps

After completing post-installation configuration:

  1. Configuration Guide - Detailed system configuration options
  2. Operations Guide - Day-to-day operational procedures
  3. Metrics & Monitoring Guide - Comprehensive monitoring setup
  4. API Guide - REST API reference and integration examples

Troubleshooting

Cannot Access Zitadel Console

  • Verify DNS resolution or hosts file configuration
  • Check that Traefik ingress is running: kubectl get pods -n kube-system | grep traefik
  • Review Traefik logs: kubectl logs -n kube-system -l app.kubernetes.io/name=traefik

Authentication Failures

  • Verify Zitadel pods are healthy: kubectl get pods | grep zitadel
  • Check Zitadel logs: kubectl logs <zitadel-pod-name>
  • Ensure the external domain matches your hostname in Zitadel configuration

MIB Frontend Not Loading

  • Verify MIB Frontend pods are running: kubectl get pods | grep mib-frontend
  • Check for connectivity issues to Confd and API services
  • Review browser console for JavaScript errors

API Returns 401 Unauthorized

  • Verify you have a valid bearer token
  • Check token expiration
  • Ensure Zitadel authentication is functioning

For additional troubleshooting assistance, refer to the Troubleshooting Guide.

6 - Configuration Guide

Helm chart configuration reference

Overview

The CDN Manager is deployed via Helm chart with configuration supplied through values.yaml files. This guide explains the configuration structure, how to apply changes, and provides a reference for all configurable options.

Configuration Files

The installation ISO provides three configuration files at /mnt/esb3027/:

FilePurpose
values-lab.yamlRecommended starting point for single-node lab deployments
values-production.yamlRecommended starting point for multi-node production deployments
values.yamlComplete reference of all configurable options with their defaults

You only need to specify fields that differ from the defaults. Helm applies configuration hierarchically — values from your file override the chart’s built-in defaults, and any key you omit retains its default value.

Lab Configuration (values-lab.yaml)

values-lab.yaml is the recommended starting point for single-node lab, acceptance testing, and demonstration deployments. It pre-configures settings appropriate for a constrained single-node environment:

  • Single Kafka controller replica (the default 3 replicas require 3 separate nodes to satisfy pod anti-affinity rules)
  • Single Zitadel replica
  • Self-signed TLS by default, with real certificate configuration commented out for reference
  • Minimal resource requests suited to a single node

Copy the file to a writable location and edit it before deploying:

cp /mnt/esb3027/values-lab.yaml ~/values.yaml

The minimum required changes are:

  1. Set global.hosts.manager[0].host to your node’s hostname or IP address
  2. Set zitadel.zitadel.configmapConfig.ExternalDomain to the same value

These two values must match exactly or authentication will fail due to CORS policy violations. See Global Settings for details.

Production Configuration (values-production.yaml)

values-production.yaml is the recommended starting point for multi-node production deployments across a minimum three-node cluster. It pre-configures settings appropriate for a high-availability environment:

  • Three Zitadel replicas spread across nodes for HA
  • Production-grade resource requests and limits for all major components
  • Kafka with a dedicated single-replica StorageClass (avoiding unnecessary triple-redundancy on top of Kafka’s own quorum)
  • Manager HPA configured to scale between 3 and 8 replicas
  • TLS certificate configuration with clearly marked placeholders

Copy the file to a writable location and edit it before deploying:

cp /mnt/esb3027/values-production.yaml ~/values.yaml

The minimum required changes before deploying are:

  1. Set global.hosts.manager[0].host to your primary manager hostname
  2. Set zitadel.zitadel.configmapConfig.ExternalDomain to the same hostname
  3. Replace the placeholder TLS certificate and key in the ingress.secrets section, and update the secretName values in mibFrontend.ingress.extraTls and zitadel.ingress.tls to match
  4. Update global.hosts.routers with your CDN Director instances

See TLS Configuration and Global Settings for full details.

Hardware requirements: For per-node hardware specifications, refer to the System Requirements Guide. The System Requirements Guide is the authoritative source — the hardware comments in the header of values-production.yaml may not reflect the current requirements.

Complete Reference (values.yaml)

The full default values file at /mnt/esb3027/values.yaml documents every configurable option with its default value and inline comments. Use this as a reference when looking up available settings or understanding what the environment-specific files override.

Note: values.yaml is not intended to be used directly as your deployment configuration. Use values-lab.yaml or values-production.yaml as your starting point instead.

Configuration Merging

Helm merges configuration files from left to right, with later files overriding earlier values. This allows you to split your configuration into multiple files — for example, keeping TLS certificates separate from the main configuration:

# Multiple files merged left-to-right
helm install acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml \
  --values ~/values-tls.yaml

Individual Value Overrides

For temporary changes, you can override individual values with --set:

helm upgrade acd-manager /mnt/esb3027/helm/charts/acd-manager \
  --values ~/values.yaml \
  --set manager.logLevel=debug

Note: Using --set is discouraged for permanent changes, as the same arguments must be specified for every Helm operation.

Applying Configuration

Initial Installation

helm install acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml

Updating Configuration

helm upgrade acd-manager /mnt/esb3027/helm/charts/acd-manager \
  --values ~/values.yaml

Dry Run

Before applying changes, validate the configuration with a dry run:

helm upgrade acd-manager /mnt/esb3027/helm/charts/acd-manager \
  --values ~/values.yaml \
  --dry-run

Rollback

If an upgrade fails, rollback to the previous revision:

# View revision history
helm history acd-manager

# Rollback to previous revision
helm rollback acd-manager

# Rollback to specific revision
helm rollback acd-manager <revision_number>

Note: Rollback reverts the Helm release but does not modify your values.yaml file. You must manually revert configuration file changes.

Force Reinstall

If an upgrade fails and rollback is not sufficient, you can perform a clean reinstall:

helm uninstall acd-manager
helm install acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml

Warning: This is service-affecting as all pods will be destroyed and recreated.

Configuration Reference

Global Settings

The global section contains cluster-wide settings. The most critical configuration is global.hosts.

global:
  hosts:
    manager:
      - host: manager.local
    routers:
      - name: default
        address: 127.0.0.1
    edns_proxy: []
    geoip: []
KeyTypeDescription
global.hosts.managerArrayExternal IP addresses or DNS hostnames for all Manager cluster nodes
global.hosts.routersArrayCDN Director (ESB3024) instances
global.hosts.edns_proxyArrayEDNS Proxy addresses (currently unused)
global.hosts.geoipArrayGeoIP Proxy addresses for Frontend GUI

Important: The first entry in global.hosts.manager must match zitadel.zitadel.ExternalDomain exactly. Zitadel enforces CORS protection, and authentication will fail if these do not match.

Manager Configuration

Core Manager API server settings:

KeyTypeDefaultDescription
manager.image.registryStringghcr.ioContainer image registry
manager.image.repositoryStringedgeware/acd-managerContainer image repository
manager.image.tagStringImage tag override (uses latest if empty)
manager.logLevelStringinfoLog level (trace, debug, info, warn, error)
manager.replicaCountNumber1Number of replicas (HPA manages this when enabled)
manager.containerPorts.httpNumber80HTTP container port
manager.maxmindDbVolumeStringName of PVC containing MaxMind GeoIP databases

Manager Resources

The chart supports both resource presets and explicit resource specifications:

KeyTypeDefaultDescription
manager.resourcesPresetString`` (empty)Resource preset (see Resource Presets table). Ignored if manager.resources is set.
manager.resources.requests.cpuString300mCPU request
manager.resources.requests.memoryString512MiMemory request
manager.resources.limits.cpuString1CPU limit
manager.resources.limits.memoryString4GiMemory limit

Note: For production workloads, explicitly set manager.resources rather than using presets.

Manager Datastore

manager:
  datastore:
    type: redis
    namespace: "cdn_manager_ds"
    default_ttl: ""
    compression: zstd
KeyTypeDefaultDescription
manager.datastore.typeStringredisDatastore backend type
manager.datastore.namespaceStringcdn_manager_dsRedis namespace for manager data
manager.datastore.default_ttlString`` (empty)Default TTL for entries
manager.datastore.compressionStringzstdCompression algorithm (none, zstd, etc.)

Manager Discovery

manager:
  discovery: []
  # Example:
  # - namespace: "other"
  #   hosts:
  #     - other-host1
  #     - other-host2
  #   pattern: "other-.*"
KeyTypeDescription
manager.discoveryArrayArray of discovery host configurations. Each entry can specify hosts (list of hostnames), pattern (regex pattern), or both

Manager Tuning

manager:
  tuning:
    enable_cache_control: true
    cache_control_max_age: "5m"
    cache_control_miss_max_age: ""
KeyTypeDefaultDescription
manager.tuning.enable_cache_controlBooleantrueEnable cache control headers in responses
manager.tuning.cache_control_max_ageString5mMaximum age for cache control headers
manager.tuning.cache_control_miss_max_ageString`` (empty)Maximum age for cache control headers on cache misses

Manager Container Arguments

manager:
  args:
    - --config-file=/etc/manager/config.toml
    - http-server

Gateway Configuration

NGinx Gateway settings for external Director communication:

KeyTypeDefaultDescription
gateway.replicaCountNumber1Number of gateway replicas
gateway.resources.requests.cpuString100mCPU request
gateway.resources.requests.memoryString128MiMemory request
gateway.resources.limits.cpuString150mCPU limit
gateway.resources.limits.memoryString192MiMemory limit
gateway.service.typeStringClusterIPService type

MIB Frontend Configuration

Web-based configuration GUI settings:

KeyTypeDefaultDescription
mib-frontend.enabledBooleantrueEnable the frontend GUI
mib-frontend.frontend.resourcePresetStringnanoResource preset
mib-frontend.frontend.autoscaling.hpa.enabledBooleantrueEnable HPA
mib-frontend.frontend.autoscaling.hpa.minReplicasNumber2Minimum replicas
mib-frontend.frontend.autoscaling.hpa.maxReplicasNumber4Maximum replicas

Confd Configuration

Confd settings for configuration management:

KeyTypeDefaultDescription
confd.enabledBooleantrueEnable Confd
confd.service.ports.internalNumber15000Internal service port

VictoriaMetrics Configuration

Time-series database for metrics:

KeyTypeDefaultDescription
acd-metrics.enabledBooleantrueEnable metrics components
acd-metrics.victoria-metrics-single.enabledBooleantrueEnable VictoriaMetrics
acd-metrics.grafana.enabledBooleantrueEnable Grafana
acd-metrics.telegraf.enabledBooleantrueEnable Telegraf
acd-metrics.prometheus.enabledBooleantrueEnable Prometheus metrics

Ingress Configuration

Traffic exposure settings:

KeyTypeDefaultDescription
ingress.enabledBooleantrueEnable ingress record generation
ingress.pathTypeStringPrefixIngress path type
ingress.hostnameString`` (empty)Primary hostname (defaults to manager.local via global.hosts)
ingress.pathString/apiDefault path for ingress
ingress.tlsBooleanfalseEnable TLS configuration
ingress.selfSignedBooleanfalseGenerate self-signed certificate via Helm
ingress.secretsArrayCustom TLS certificate secrets

Ingress Extra Paths

The chart includes default extra paths for Confd and GeoIP:

ingress:
  extraPaths:
    - path: /confd
      pathType: Prefix
      backend:
        service:
          name: acd-manager-gateway
          port:
            name: http
    - path: /geoip
      pathType: Prefix
      backend:
        service:
          name: acd-manager-gateway
          port:
            name: http

TLS Certificate Secrets

For production TLS certificates:

ingress:
  secrets:
    - name: manager.local-tls
      key: |-
        -----BEGIN RSA PRIVATE KEY-----
        ...
        -----END RSA PRIVATE KEY-----
      certificate: |-
        -----BEGIN CERTIFICATE-----
        ...
        -----END CERTIFICATE-----
  tls: true

Resource Configuration

Resource Presets

Predefined resource configurations for common deployment sizes:

PresetRequest CPURequest MemoryLimit CPULimit MemoryEphemeral Storage Limit
nano100m128Mi150m192Mi2Gi
micro250m256Mi375m384Mi2Gi
small500m512Mi750m768Mi2Gi
medium500m1024Mi750m1536Mi2Gi
large1000m2048Mi1500m3072Mi2Gi
xlarge1000m3072Mi3000m6144Mi2Gi
2xlarge1000m3072Mi6000m12288Mi2Gi

Note: Limits are calculated as requests plus 50% (except for xlarge/2xlarge and ephemeral-storage).

Custom Resources

Override preset with custom values:

manager:
  resources:
    requests:
      cpu: "300m"
      memory: "512Mi"
    limits:
      cpu: "1"
      memory: "1Gi"

Note:

  • CPU values use millicores (1000m = 1 core)
  • Memory values use binary SI units (1024Mi = 1GiB)
  • Requests represent minimum guaranteed resources
  • Limits represent maximum consumable resources

Capacity Planning

When sizing resources:

  • Requests determine scheduling (node must have available capacity)
  • Limits prevent resource starvation
  • Maintain 20-30% cluster headroom for scaling
  • Total capacity = sum of all requests × replica count + headroom

Security Contexts

Pod Security Context

manager:
  podSecurityContext:
    enabled: true
    fsGroup: 1001
    fsGroupChangePolicy: Always
    sysctls: []
    supplementalGroups: []

Container Security Context

manager:
  containerSecurityContext:
    enabled: true
    runAsUser: 1001
    runAsGroup: 1001
    runAsNonRoot: true
    readOnlyRootFilesystem: true
    privileged: false
    allowPrivilegeEscalation: false
    capabilities:
      drop: ["ALL"]
    seccompProfile:
      type: "RuntimeDefault"

Health Probes

Probe Types

ProbePurposeFailure Action
startupProbeInitial startup verificationContainer restart
readinessProbeTraffic readiness checkRemove from load balancer
livenessProbeHealth monitoringContainer restart

Default Probe Configuration

Liveness Probe

manager:
  livenessProbe:
    enabled: true
    initialDelaySeconds: 5
    periodSeconds: 30
    timeoutSeconds: 10
    failureThreshold: 5
    successThreshold: 1
    httpGet:
      path: /api/v1/health/alive
      port: http

Readiness Probe

manager:
  readinessProbe:
    enabled: true
    initialDelaySeconds: 5
    periodSeconds: 10
    timeoutSeconds: 7
    failureThreshold: 3
    successThreshold: 1
    httpGet:
      path: /api/v1/health/ready
      port: http

Startup Probe

manager:
  startupProbe:
    enabled: true
    initialDelaySeconds: 0
    periodSeconds: 5
    timeoutSeconds: 3
    failureThreshold: 10
    successThreshold: 1
    httpGet:
      path: /api/v1/health/alive
      port: http

Autoscaling Configuration

Horizontal Pod Autoscaler (HPA)

manager:
  autoscaling:
    hpa:
      enabled: true
      minReplicas: 3
      maxReplicas: 8
      targetCPU: 50
      targetMemory: 80
KeyTypeDefaultDescription
manager.autoscaling.hpa.enabledBooleantrueEnable HPA
manager.autoscaling.hpa.minReplicasNumber3Minimum number of replicas
manager.autoscaling.hpa.maxReplicasNumber8Maximum number of replicas
manager.autoscaling.hpa.targetCPUNumber50Target CPU utilization percentage
manager.autoscaling.hpa.targetMemoryNumber80Target Memory utilization percentage

Network Policy

networkPolicy:
  enabled: true
  allowExternal: true
  allowExternalEgress: true
  addExternalClientAccess: true
KeyTypeDefaultDescription
networkPolicy.enabledBooleantrueEnable NetworkPolicy
networkPolicy.allowExternalBooleantrueAllow connections from any source (don’t require pod label)
networkPolicy.allowExternalEgressBooleantrueAllow pod to access any range of port and destinations
networkPolicy.addExternalClientAccessBooleantrueAllow access from pods with client label set to “true”

Pod Affinity and Anti-Affinity

manager:
  podAffinityPreset: ""
  podAntiAffinityPreset: soft
  nodeAffinityPreset:
    type: ""
    key: ""
    values: []
  affinity: {}
KeyTypeDefaultDescription
manager.podAffinityPresetString`` (empty)Pod affinity preset (soft or hard). Ignored if affinity is set
manager.podAntiAffinityPresetStringsoftPod anti-affinity preset (soft or hard). Ignored if affinity is set
manager.nodeAffinityPreset.typeString`` (empty)Node affinity preset type (soft or hard)
manager.affinityObject{}Custom affinity rules (overrides presets)

Service Configuration

service:
  type: ClusterIP
  ports:
    http: 80
  annotations:
    service.kubernetes.io/topology-mode: Auto
  externalTrafficPolicy: Cluster
  sessionAffinity: None
KeyTypeDefaultDescription
service.typeStringClusterIPService type
service.ports.httpNumber80HTTP service port
service.annotationsObjectservice.kubernetes.io/topology-mode: AutoService annotations
service.externalTrafficPolicyStringClusterExternal traffic policy

Persistence Configuration

persistence:
  enabled: false
  mountPath: /agiletv/manager/data
  storageClass: ""
  accessModes:
    - ReadWriteOnce
  size: 8Gi
KeyTypeDefaultDescription
persistence.enabledBooleanfalseEnable persistence using PVC
persistence.mountPathString/agiletv/manager/dataMount path
persistence.storageClassString`` (empty)Storage class (uses cluster default if empty)
persistence.sizeString8GiSize of data volume

RBAC and Service Account

rbac:
  create: false
  rules: []

serviceAccount:
  create: true
  name: ""
  automountServiceAccountToken: true
  annotations: {}

Metrics

metrics:
  enabled: false
  serviceMonitor:
    enabled: false
    namespace: ""
    annotations: {}
    labels: {}
    interval: ""
    scrapeTimeout: ""
KeyTypeDefaultDescription
metrics.enabledBooleanfalseEnable Prometheus metrics export
metrics.serviceMonitor.enabledBooleanfalseCreate Prometheus Operator ServiceMonitor

Next Steps

After configuration:

  1. Installation Guide - Deploy with your configuration
  2. Operations Guide - Day-to-day management
  3. Performance Tuning Guide - Optimize system performance
  4. Architecture Guide - Understand component relationships

7 - Performance Tuning Guide

Optimization tips for improving CDN Manager performance

Overview

This guide provides performance tuning recommendations for the AgileTV CDN Manager (ESB3027). While the default configuration is suitable for most deployments, certain environments may benefit from additional optimizations.

Network Topology Optimization

Topology Aware Hints

The CDN Manager uses Kubernetes Topology Aware Hints to prefer routing pods in the same zone as the source of network traffic. This reduces cross-zone latency and improves overall system responsiveness.

How It Works

When nodes are labeled with topology zones, Kubernetes automatically routes traffic to pods in the same zone when possible. This is particularly beneficial for:

  • Low-latency requirements: Keeps traffic local to reduce round-trip time
  • Cost optimization: Reduces cross-zone data transfer costs in cloud environments
  • Load distribution: Prevents hotspots by distributing load across zones

Configuring Availability Zones

Each node must have zone and region labels applied for Topology Aware Hints to function:

# Label a node with a zone
kubectl label nodes <node-name> topology.kubernetes.io/zone=us-east-1a

# Label a node with a region
kubectl label nodes <node-name> topology.kubernetes.io/region=us-east-1

Replace <node-name> with your actual node names and adjust the zone/region values to match your deployment geography.

Note: Labels applied via kubectl label are automatically persistent and will survive node restarts.

Verify Topology Configuration

Verify labels are applied:

kubectl get nodes --show-labels | grep topology.kubernetes.io

Verify EndpointSlices are being generated with hints:

kubectl get endpointslices

Requirements for Topology Aware Hints

For Topology Aware Hints to activate:

  • Minimum Nodes: At least one node must be labeled with each zone referenced by endpoints
  • Symmetry: The control plane checks for sufficient CPU capacity across zones to balance traffic
  • Zone Coverage: All zones with endpoints should have at least one ready node

Integration with Pod Anti-Affinity

Topology labels complement the pod anti-affinity rules already configured in the Helm chart:

  • Pod Anti-Affinity: Handles pod-to-node placement to ensure high availability
  • Topology Aware Hints: Handles service-to-pod traffic routing to keep requests within the same zone

Together, these features optimize both placement and routing for improved performance.

Fallback Behavior

If zone labels are not configured, the system falls back to random load-balancing across all available pods. This is functionally correct but may result in:

  • Increased cross-zone traffic
  • Higher latency for some requests
  • Less predictable performance characteristics

Kernel Network Tuning (sysctl)

For high-throughput deployments, tuning Linux kernel network parameters can significantly improve connection handling and overall system performance. These settings are particularly beneficial for environments with high connection rates or large numbers of concurrent connections.

Apply the following settings to optimize network performance:

# Networking
net.core.somaxconn = 1024
net.core.netdev_max_backlog = 2048
net.ipv4.tcp_max_syn_backlog = 2048

# Connection Tracking
net.netfilter.nf_conntrack_max = 131072
net.netfilter.nf_conntrack_tcp_timeout_established = 1200

# Port Reuse
net.ipv4.ip_local_port_range = 10240 65535
net.ipv4.tcp_tw_reuse = 1

# Memory Buffers
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608

Setting Descriptions

ParameterRecommended ValuePurpose
net.core.somaxconn1024Maximum socket listen backlog. Increases pending connection queue size.
net.core.netdev_max_backlog2048Maximum packets queued at network device level. Helps handle burst traffic.
net.ipv4.tcp_max_syn_backlog2048Maximum SYN requests queued. Improves handling of connection floods.
net.netfilter.nf_conntrack_max131072Maximum tracked connections. Prevents connection tracking table exhaustion.
net.netfilter.nf_conntrack_tcp_timeout_established1200Timeout for established connections (seconds). Reduces stale entry buildup.
net.ipv4.ip_local_port_range10240 65535Range of local ports for outbound connections. Expands available ephemeral ports.
net.ipv4.tcp_tw_reuse1Allows reusing TIME_WAIT sockets. Reduces port exhaustion under high load.
net.core.rmem_max8388608Maximum receive socket buffer size (8MB). Improves high-bandwidth transfers.
net.core.wmem_max8388608Maximum send socket buffer size (8MB). Improves high-bandwidth transfers.

Applying Settings

Temporary (Until Reboot)

Apply settings immediately but they will be lost on reboot:

sudo sysctl -w net.core.somaxconn=1024
sudo sysctl -w net.core.netdev_max_backlog=2048
# ... repeat for each parameter

Persistent (Across Reboots)

Add settings to /etc/sysctl.conf or a file in /etc/sysctl.d/:

# Create a dedicated config file
cat <<EOF | sudo tee /etc/sysctl.d/99-cdn-manager.conf
# CDN Manager Network Tuning
net.core.somaxconn = 1024
net.core.netdev_max_backlog = 2048
net.ipv4.tcp_max_syn_backlog = 2048
net.netfilter.nf_conntrack_max = 131072
net.netfilter.nf_conntrack_tcp_timeout_established = 1200
net.ipv4.ip_local_port_range = 10240 65535
net.ipv4.tcp_tw_reuse = 1
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
EOF

# Apply all settings
sudo sysctl -p /etc/sysctl.d/99-cdn-manager.conf

Kubernetes Considerations

For Kubernetes deployments, these sysctl settings can be applied via:

  1. Node-level configuration: Use DaemonSets or node provisioning scripts
  2. Pod-level safe sysctls: Some sysctls can be set per-pod via securityContext.sysctls
  3. Container runtime configuration: Configure via container runtime options

Note that some sysctls require privileged containers or node-level configuration.

Monitoring Impact

After applying these settings, monitor:

  • Connection establishment rates
  • TIME_WAIT socket count: netstat -n | grep TIME_WAIT | wc -l
  • Connection tracking table usage: cat /proc/sys/net/netfilter/nf_conntrack_count
  • Network buffer utilization via Grafana dashboards

Resource Configuration

Horizontal Pod Autoscaler (HPA)

The default HPA configuration is tuned for production workloads. For environments with variable load, consider adjusting the scale metrics:

ComponentDefault Scale MetricsTuning Consideration
Core ManagerCPU 50%, Memory 80%Lower CPU threshold for faster scale-out
NGinx GatewayCPU 75%, Memory 80%Increase for cost optimization
MIB FrontendCPU 75%, Memory 90%Adjust based on operator concurrency

For detailed HPA configuration, see the Architecture Guide.

Resource Requests and Limits

Ensure resource requests and limits are appropriately sized for your workload. Under-provisioned resources can cause:

  • Pod evictions during high load
  • Increased latency due to CPU throttling
  • Slow scaling responses

Refer to the Configuration Guide for preset configurations and planning guidance.

Database Optimization

PostgreSQL

The PostgreSQL cluster is managed by the Cloudnative PG operator. For improved performance:

  • Connection Pooling: The application uses connection pooling by default
  • Replica Usage: Read queries can be offloaded to replicas for read-heavy workloads
  • Backup Scheduling: Schedule backups during low-traffic periods to minimize I/O impact

Redis

Redis provides in-memory caching for sessions and ephemeral state:

  • Memory Allocation: Ensure sufficient memory for cache hit rates
  • Persistence: RDB snapshots are enabled; adjust frequency based on durability needs

Kafka

Kafka handles event streaming for selection input and metrics:

  • Partition Count: Default partitions are sized for typical workloads
  • Replication Factor: Production deployments use 3 replicas for fault tolerance
  • Consumer Groups: The Selection Input Worker is limited to one consumer per partition

Monitoring Performance

Key Metrics to Watch

Monitor the following metrics for performance insights:

  • API Response Time: Track via Grafana dashboards
  • Pod CPU/Memory Usage: Identify resource bottlenecks
  • Kafka Lag: Monitor consumer lag for selection input processing
  • Database Connections: Watch for connection pool exhaustion

Grafana Dashboards

Pre-built dashboards are available at https://<manager-host>/grafana:

  • System Health: Overall cluster and application health
  • CDN Metrics: Routing and usage statistics
  • Resource Utilization: CPU, memory, and network usage per component

Troubleshooting Performance Issues

High Latency

  1. Check pod distribution across nodes: kubectl get pods -o wide
  2. Verify topology labels are applied: kubectl get nodes --show-labels
  3. Review network latency between nodes
  4. Check for resource contention: kubectl top pods

Slow Scaling

  1. Verify HPA is enabled: kubectl get hpa
  2. Check cluster capacity for scheduling new pods
  3. Review HPA metrics: kubectl describe hpa acd-manager

Database Performance

  1. Check PostgreSQL cluster status: kubectl get pods -l app=postgresql
  2. Review slow query logs (if enabled)
  3. Monitor connection pool usage

Next Steps

After reviewing performance tuning:

  1. Architecture Guide - Understand component interactions
  2. Configuration Guide - Detailed configuration options
  3. Metrics & Monitoring Guide - Comprehensive monitoring setup
  4. Troubleshooting Guide - Resolve performance issues

8 - Operations Guide

Day-to-day operational procedures and maintenance tasks

Overview

This guide covers day-to-day operational procedures for managing the AgileTV CDN Manager (ESB3027). Topics include routine maintenance, backup procedures, log management, and common operational tasks.

Prerequisites

Before performing operations, ensure you have:

  • kubectl access to the cluster
  • helm CLI installed
  • Access to the node where values.yaml is stored
  • Appropriate RBAC permissions for administrative tasks

Cluster Access

There are two supported methods for accessing the Kubernetes cluster:

  1. SSH to a Server Node (Recommended for operations staff) - SSH into any Server node and run kubectl commands directly
  2. Remote kubectl - Install kubectl on your local machine and configure it to connect to the cluster remotely

The kubectl command-line tool is pre-configured on all Server nodes and can be used directly without additional setup:

# SSH to any Server node
ssh root@<server-ip>

# Run kubectl commands directly
kubectl get nodes
kubectl get pods

This method is recommended for day-to-day operations as it requires no local configuration and provides direct access to the cluster.

Method 2: Remote kubectl from Local Machine

To use kubectl from your local workstation or laptop:

Step 1: Install kubectl

Download and install kubectl for your operating system:

  • Official Documentation: Install kubectl
  • macOS (Homebrew): brew install kubectl
  • Linux: Download from the official Kubernetes release page
  • Windows: Download from the official Kubernetes release page

Step 2: Copy kubeconfig from Server Node

# Copy kubeconfig from any Server node
scp root@<server-ip>:/etc/rancher/k3s/k3s.yaml ~/.kube/config

Step 3: Update kubeconfig

Edit the kubeconfig file to point to the correct server address:

# Replace localhost with the actual server IP
# macOS/Linux:
sed -i '' 's/127.0.0.1/<server-ip>/g' ~/.kube/config  # macOS
sed -i 's/127.0.0.1/<server-ip>/g' ~/.kube/config    # Linux

# Or manually edit ~/.kube/config and change:
# server: https://127.0.0.1:6443
# to:
# server: https://<server-ip>:6443

Step 4: Verify connectivity

kubectl get nodes

Managing Multiple Clusters

If you manage multiple Kubernetes clusters from the same machine, you can maintain multiple kubeconfig files:

# Set KUBECONFIG environment variable to include multiple config files
export KUBECONFIG=~/.kube/config-prod:~/.kube/config-lab

# View all contexts
kubectl config get-contexts

# Switch between clusters
kubectl config use-context <context-name>

# View current context
kubectl config current-context

For more information, see the official Kubernetes documentation: Organizing Cluster Access

Helm Commands

Helm releases are managed cluster-wide:

# List all releases
helm list

# View release history
helm history acd-manager

# Get deployed values
helm get values acd-manager -o yaml

# Get deployed manifest
helm get manifest acd-manager

Note: If using remote kubectl, ensure helm is installed on your local machine. See Helm Installation for instructions.

Helm Commands

Helm releases are managed cluster-wide:

# List all releases
helm list

# View release history
helm history acd-manager

# Get deployed values
helm get values acd-manager -o yaml

# Get deployed manifest
helm get manifest acd-manager

Backup Procedures

PostgreSQL Backup

PostgreSQL is managed by the Cloudnative PG operator, which provides continuous backup capabilities.

# Check backup status
kubectl get backup

# Create manual backup
kubectl apply -f - <<EOF
apiVersion: postgresql.cnpg.io/v1
kind: Backup
metadata:
  name: manual-backup-$(date +%Y%m%d-%H%M%S)
spec:
  cluster:
    name: acd-cluster-postgresql
EOF

# List available backups
kubectl get backup -o wide

# Restore from backup (requires downtime)
# See Upgrade Guide for restore procedures

Longhorn Volume Backups

Longhorn provides snapshot and backup capabilities for persistent volumes:

# List all volumes
kubectl get volumes -n longhorn-system

# Create snapshot via Longhorn UI
# Port-forward to Longhorn UI (do not expose via ingress)
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80

# Access: http://localhost:8080
# WARNING: Longhorn UI grants access to sensitive storage information
# and should never be exposed through the ingress controller

Accessing Internal Services

For debugging and troubleshooting, you may need direct access to internal services.

PostgreSQL

PostgreSQL is managed by the Cloudnative PG operator. Connection details are stored in the acd-cluster-postgresql-app Secret:

# View connection details
kubectl describe secret acd-cluster-postgresql-app

# Extract individual fields
PG_HOST=$(kubectl get secret acd-cluster-postgresql-app -o jsonpath='{.data.host}' | base64 -d)
PG_USER=$(kubectl get secret acd-cluster-postgresql-app -o jsonpath='{.data.username}' | base64 -d)
PG_PASS=$(kubectl get secret acd-cluster-postgresql-app -o jsonpath='{.data.password}' | base64 -d)
PG_DB=$(kubectl get secret acd-cluster-postgresql-app -o jsonpath='{.data.dbname}' | base64 -d)

# Connect via psql
kubectl exec -it acd-cluster-postgresql-0 -- psql -U $PG_USER -d $PG_DB

Secret fields: The CNPG operator populates the following fields: username, password, host, port, dbname, uri, jdbc-uri, fqdn-uri, fqdn-jdbc-uri, pgpass.

Redis

Redis runs on port 6379 with no authentication:

# Connect via redis-cli
kubectl exec -it acd-manager-redis-master-0 -- redis-cli

# Or connect from another pod
kubectl run redis-test --rm -it --image=redis -- redis-cli -h acd-manager-redis-master

Kafka

kafka-topics.sh –bootstrap-server :9095 –list

The selection_input topic is pre-configured for selection input events.

Kubernetes Port Forwarding

For accessing internal Kubernetes services that are not exposed via ingress or services, use kubectl port-forward to create a secure tunnel from your local machine to the service.

Basic Port Forwarding

# Forward local port to a service
kubectl port-forward -n <namespace> svc/<service-name> <local-port>:<service-port>

# Example: Forward local port 8080 to Grafana (port 3000)
kubectl port-forward -n default svc/acd-manager-grafana 8080:3000

Note: “Local” refers to the machine where you run kubectl. This can be:

  • A Server node in the cluster (common for administrative tasks)
  • A remote machine with kubectl configured to access the cluster

Accessing the Forwarded Service

Once the port-forward is established, access the service at http://localhost:<local-port> from the machine where you ran kubectl port-forward.

If running on a Server node: To access the forwarded port from your local workstation:

  • Ensure the firewall on the Server node allows traffic on the forwarded port from your network
  • Use the Server node’s IP address instead of localhost from your workstation
# From your workstation (if firewall allows)
curl http://<server-node-ip>:<local-port>

For simplicity, consider running port-forward from your local machine (if kubectl is configured for remote cluster access) rather than from a Server node.

Background Port Forwarding

To run port-forward in the background:

kubectl port-forward -n <namespace> svc/<service-name> <local-port>:<service-port> &

Security Considerations

Port forwarding is recommended for:

  • Administrative interfaces (e.g., Longhorn UI) that should not be publicly exposed
  • Debugging and troubleshooting internal services
  • Temporary access to services without modifying ingress configuration

The port-forward tunnel remains active only while the kubectl port-forward command is running. Press Ctrl+C to terminate the tunnel.

Example: The Longhorn storage UI is intentionally not exposed via ingress due to security risks. Access it via port-forward:

kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80

Then navigate to http://localhost:8080 in your browser.

Longhorn Storage

Longhorn is a distributed block storage system for Kubernetes that provides persistent volumes for stateful applications such as PostgreSQL and Kafka.

Architecture

Longhorn deploys controller and replica engines on each node, forming a distributed storage system. When a volume is created, Longhorn replicates data across multiple nodes to ensure durability even in the event of node failures.

Storage Protocols:

  • iSCSI: Used for standard Read-Write-Once (RWO) volumes
  • NFS: Used for Read-Write-Many (RWX) volumes that can be mounted by multiple pods simultaneously

Configuration

The CDN Manager deploys Longhorn with a single replica configuration, which differs from the Longhorn default of 3 replicas. This configuration is optimized for the cluster architecture where:

  • Pod-node affinity is configured to schedule pods on the same node as their persistent volume data
  • This optimizes I/O performance by reducing network traffic
  • Data locality is maintained while still providing volume portability

Capacity Planning

Longhorn storage requires an additional 30% capacity headroom for internal operations and scaling. If less than 30% of the total partition capacity is available, Longhorn may mark volumes as “full” and prevent further writes.

For detailed storage requirements and disk partitioning guidance, see the System Requirements Guide.

Configuration Backup

Always backup your Helm values before making changes:

# Export current values
helm get values acd-manager -o yaml > ~/values-backup-$(date +%Y%m%d).yaml

# Backup custom values files
cp ~/values.yaml ~/values-backup-$(date +%Y%m%d).yaml

Backup Schedule Recommendations

ComponentFrequencyRetention
PostgreSQLDaily30 days
Longhorn SnapshotsBefore changes7 days
ConfigurationBefore each changeIndefinite

Updating MaxMind GeoIP Databases

The MaxMind GeoIP databases (GeoIP2-City, GeoLite2-ASN, GeoIP2-Anonymous-IP) are used for GeoIP-based routing and validation features. These databases should be updated periodically to ensure accurate IP geolocation data.

Prerequisites

  • Updated MaxMind database files (.mmdb format) obtained from MaxMind
  • Access to the cluster via kubectl
  • Helm CLI installed

Update Procedure

Step 1: Create New Volume with Updated Databases

Run the volume generation utility with a unique volume name that includes a revision identifier:

# Mount the installation ISO if not already mounted
mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

# Generate new volume with updated databases
/mnt/esb3027/generate-maxmind-volume

When prompted:

  1. Provide the paths to the three database files:
    • GeoIP2-City.mmdb
    • GeoLite2-ASN.mmdb
    • GeoIP2-Anonymous-IP.mmdb
  2. Enter a unique volume name with a revision number or date, for example:
    • maxmind-geoip-2026-04
    • maxmind-geoip-v2

Tip: Using a revision-based naming convention simplifies rollback if needed.

Step 2: Update Helm Configuration

Edit your values.yaml file to reference the new volume:

manager:
  maxmindDbVolume: maxmind-geoip-2026-04

Replace maxmind-geoip-2026-04 with the volume name you specified in Step 1.

Step 3: Apply Configuration Update

Upgrade the Helm release with the updated configuration:

helm upgrade acd-manager /mnt/esb3027/charts/acd-manager --values ~/values.yaml

Step 4: Rolling Restart (Optional)

To ensure all pods immediately use the new database files, perform a rolling restart of the manager deployment:

kubectl rollout restart deployment acd-manager

Monitor the rollout status:

kubectl rollout status deployment acd-manager

Step 5: Verify Update

Verify the pods are running with the new volume:

kubectl get pods
kubectl describe pod -l app.kubernetes.io/component=manager | grep -A 5 "Volumes"

Step 6: Clean Up Old Volume (Optional)

After verifying the new databases are working correctly, you can delete the old persistent volume:

# List persistent volumes to find the old one
kubectl get pv

# Delete the old volume
kubectl delete pv <old-volume-name>

Caution: Ensure the new volume is functioning correctly before deleting the old volume. Keep the old volume for at least 24-48 hours as a rollback option.

Rollback Procedure

If issues occur after updating the databases:

  1. Revert the maxmindDbVolume value in your values.yaml to the previous volume name
  2. Run helm upgrade with the reverted configuration
  3. Optionally restart the deployment: kubectl rollout restart deployment acd-manager

Update Frequency Recommendations

DatabaseRecommended Update Frequency
GeoIP2-CityWeekly or monthly
GeoLite2-ASNMonthly
GeoIP2-Anonymous-IPWeekly or monthly

MaxMind releases database updates on a regular schedule. Subscribe to MaxMind notifications to stay informed of new releases.

Log Management

Application Logs

# View manager logs
kubectl logs -l app.kubernetes.io/component=manager

# Follow logs in real-time
kubectl logs -l app.kubernetes.io/component=manager -f

# View logs from specific pod
kubectl logs <pod-name>

# View previous instance logs (after crash)
kubectl logs <pod-name> -p

# View logs with timestamps
kubectl logs <pod-name> --timestamps

# View logs from all containers in pod
kubectl logs <pod-name> --all-containers

Component-Specific Logs

# Zitadel logs
kubectl logs -l app.kubernetes.io/name=zitadel

# Gateway logs
kubectl logs -l app.kubernetes.io/component=gateway

# Confd logs
kubectl logs -l app.kubernetes.io/component=confd

# MIB Frontend logs
kubectl logs -l app.kubernetes.io/component=mib-frontend

# PostgreSQL logs
kubectl logs -l app.kubernetes.io/name=postgresql

# Kafka logs
kubectl logs -l app.kubernetes.io/name=kafka

# Redis logs
kubectl logs -l app.kubernetes.io/name=redis

Log Aggregation

Logs are collected by Telegraf and sent to VictoriaMetrics:

# Access Grafana for log visualization
# https://<manager-host>/grafana

# Query logs via Grafana Explore
# Select VictoriaMetrics datasource and use log queries

Log Rotation

Container logs are automatically rotated by Kubernetes:

  • Default max size: 10MB per container
  • Default max files: 5 rotated files
  • Total per pod: ~50MB maximum

Scaling Operations

Manual Scaling

Note: If HPA (Horizontal Pod Autoscaler) is enabled for a deployment, manual scaling changes will be overridden by the HPA. To manually scale, you must first disable the HPA.

# Check if HPA is enabled
kubectl get hpa

# Disable HPA before manual scaling
kubectl patch hpa acd-manager -p '{"spec": {"minReplicas": null, "maxReplicas": null}}'

# Or delete the HPA entirely
kubectl delete hpa acd-manager

# Scale manager replicas
kubectl scale deployment acd-manager --replicas=3

# Scale gateway replicas
kubectl scale deployment acd-manager-gateway --replicas=2

# Scale MIB frontend replicas
kubectl scale deployment acd-manager-mib-frontend --replicas=2

HPA Configuration

# View HPA status
kubectl get hpa

# Describe HPA details
kubectl describe hpa acd-manager

# Edit HPA configuration
kubectl edit hpa acd-manager

Configuration Updates

Updating Helm Values

# Edit values file
vi ~/values.yaml

# Validate with dry-run
helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml \
  --dry-run

# Apply changes
helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml

# Verify rollout
kubectl rollout status deployment/acd-manager

Rolling Back Changes

# View revision history
helm history acd-manager

# Rollback to previous revision
helm rollback acd-manager

# Rollback to specific revision
helm rollback acd-manager <revision>

# Verify rollback
helm history acd-manager

Certificate Management

Checking Certificate Expiration

# Check TLS secret expiration
kubectl get secret acd-manager-tls -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates

# Check via Grafana dashboard
# Certificate expiration metrics are available in Grafana

Renewing Certificates

# For Helm-managed self-signed certificates
helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml \
  --set ingress.selfSigned=true

# For manual certificates, update the secret
kubectl create secret tls acd-manager-tls \
  --cert=new-tls.crt \
  --key=new-tls.key \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart pods to pick up new certificate
kubectl rollout restart deployment acd-manager

Health Checks

Component Health

# Check all pods
kubectl get pods

# Check specific component
kubectl get pods -l app.kubernetes.io/component=manager

# Check persistent volumes
kubectl get pvc

# Check cluster status
kubectl get nodes

# Check ingress
kubectl get ingress

API Health Endpoints

# Liveness check
curl -k https://<manager-host>/api/v1/health/alive

# Readiness check
curl -k https://<manager-host>/api/v1/health/ready

Database Health

# PostgreSQL cluster status
kubectl get clusters -n default

# Check PostgreSQL pods
kubectl get pods -l app.kubernetes.io/name=postgresql

# Kafka cluster status
kubectl get pods -l app.kubernetes.io/name=kafka

# Redis status
kubectl get pods -l app.kubernetes.io/name=redis

Maintenance Windows

Planned Maintenance

Before performing maintenance:

  1. Notify users of potential service impact
  2. Verify backups are current
  3. Document the maintenance procedure
  4. Prepare rollback plan

Node Maintenance

# Cordon node to prevent new pods
kubectl cordon <node-name>

# Drain node (evict pods)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Perform maintenance

# Uncordon node
kubectl uncordon <node-name>

Cluster Upgrades

See the Upgrade Guide for cluster upgrade procedures.

Troubleshooting Quick Reference

Common Commands

# Describe problematic pod
kubectl describe pod <pod-name>

# View pod events
kubectl get events --sort-by='.lastTimestamp'

# Check resource usage
kubectl top pods
kubectl top nodes

# Exec into container
kubectl exec -it <pod-name> -- /bin/sh

# Check network policies
kubectl get networkpolicies

# Check service endpoints
kubectl get endpoints

Restarting Components

# Restart deployment
kubectl rollout restart deployment/<deployment-name>

# Restart statefulset
kubectl rollout restart statefulset/<statefulset-name>

# Delete pod (auto-recreated)
kubectl delete pod <pod-name>

Security Operations

Rotating Service Account Tokens

# Delete service account secret (auto-regenerated)
kubectl delete secret <service-account-token-secret>

# Tokens are automatically regenerated

Updating RBAC Permissions

# View current roles
kubectl get roles
kubectl get clusterroles

# View role bindings
kubectl get rolebindings
kubectl get clusterrolebindings

# Edit role
kubectl edit role <role-name>

Audit Log Access

# K3s audit logs location
/var/lib/rancher/k3s/server/logs/audit.log

# View recent audit events
tail -f /var/lib/rancher/k3s/server/logs/audit.log

Disaster Recovery

Pod Recovery

Pods are automatically recreated if they fail:

# Check pod status
kubectl get pods

# If pod is stuck in Terminating
kubectl delete pod <pod-name> --force --grace-period=0

# If pod is stuck in Pending, check resources
kubectl describe pod <pod-name>
kubectl get events --sort-by='.lastTimestamp'

Node Failure Recovery

When a node fails:

  1. Automatic: Pods are rescheduled on healthy nodes (after timeout)
  2. Manual: Force delete stuck pods
# Force delete pods on failed node
kubectl delete pod --all --force --grace-period=0 \
  --field-selector spec.nodeName=<failed-node>

Data Recovery

For data recovery scenarios, refer to:

  • PostgreSQL: Cloudnative PG backup/restore procedures
  • Longhorn: Volume snapshot restoration
  • Kafka: Partition replication handles node failures

Routine Maintenance Checklist

Daily

  • Review Grafana dashboards for anomalies
  • Check alert notifications
  • Verify backup completion

Weekly

  • Review pod restart counts
  • Check certificate expiration dates
  • Review log storage usage
  • Verify HPA is functioning correctly

Monthly

  • Test backup restoration procedure
  • Review and rotate credentials if needed
  • Update documentation if configuration changed
  • Review resource utilization trends

Next Steps

After mastering operations:

  1. Troubleshooting Guide - Deep dive into problem resolution
  2. Performance Tuning Guide - Optimize system performance
  3. Metrics & Monitoring Guide - Comprehensive monitoring setup
  4. API Guide - REST API reference and automation

9 - Metrics & Monitoring Guide

Monitoring architecture and metrics collection

Overview

The CDN Manager includes a comprehensive monitoring stack based on VictoriaMetrics for time-series data storage, Telegraf for metrics collection, and Grafana for visualization. This guide describes the monitoring architecture and how to access and use the monitoring capabilities.

GuideDescription
Grafana DashboardsUsing and customising the built-in and advanced Grafana dashboards
Grafana Authentication & RolesConfiguring Grafana authentication, roles, and permissions
Alerts & AlarmsConfiguring and managing alerts and alarms

Architecture

Components

ComponentPurpose
TelegrafMetrics collector running on each node, gathering system and application metrics
VictoriaMetrics AgentMetrics scraper and forwarder; scrapes Prometheus endpoints and forwards to VictoriaMetrics
VictoriaMetrics (Short-term)Time-series database for operational dashboards (30-90 day retention)
VictoriaMetrics (Long-term)Time-series database for billing and compliance (1+ year retention)
GrafanaVisualization and dashboard platform; deployed as two replicas for high availability
AlertmanagerAlert routing and notification management

Metrics Flow

The following diagram illustrates how metrics flow through the monitoring stack:

flowchart TB
    subgraph External["External Sources"]
        Streamers[Streamers/External Clients]
    end

    subgraph Cluster["Kubernetes Cluster"]
        Telegraf[Telegraf DaemonSet]

        subgraph Applications["Application Components"]
            Director[CDN Director]
            Kafka[Kafka]
            Redis[Redis]
            Manager[ACD Manager]
            Alertmanager[Alertmanager]
        end

        VMAgent[VictoriaMetrics Agent]

        subgraph Storage["Storage"]
            VMShort[VictoriaMetrics<br/>Short-term]
            VMLong[VictoriaMetrics<br/>Long-term]
        end

        Grafana[Grafana<br/>2 replicas, HA]
        PostgreSQL[(PostgreSQL)]
        Zitadel[Zitadel]
    end

    Streamers -->|Push metrics| Telegraf
    Telegraf -->|remote_write| VMShort
    Telegraf -->|remote_write| VMLong

    Director -->|Scrape| VMAgent
    Kafka -->|Scrape| VMAgent
    Redis -->|Scrape| VMAgent
    Manager -->|Scrape| VMAgent
    Alertmanager -->|Scrape| VMAgent

    VMAgent -->|remote_write| VMShort
    VMAgent -->|remote_write| VMLong

    VMShort -->|Query| Grafana
    VMLong -->|Query| Grafana

    Grafana <-->|Shared state| PostgreSQL
    Grafana -->|OAuth2 / OIDC| Zitadel

Metrics Flow Summary:

  1. External metrics ingestion:

    • External clients (streamers) push metrics to Telegraf
    • Telegraf forwards metrics via remote_write to both VictoriaMetrics instances
  2. Internal metrics scraping:

    • VictoriaMetrics Agent scrapes Prometheus endpoints from:
      • CDN Director instances
      • Kafka cluster
      • Redis
      • ACD Manager components
      • Alertmanager
    • VMAgent forwards scraped metrics via remote_write to both VictoriaMetrics instances
  3. Data visualization:

    • Grafana queries both VictoriaMetrics databases depending on the dashboard requirements
    • Operational dashboards use short-term storage
    • Billing and compliance dashboards use long-term storage

Metrics Collection

Application Metrics

Applications expose metrics on Prometheus-compatible endpoints. VictoriaMetrics Agent (VMAgent) scrapes these endpoints and forwards metrics to VictoriaMetrics via remote_write.

System Metrics

Telegraf collects system-level metrics including:

  • CPU usage
  • Memory utilization
  • Disk I/O
  • Network statistics
  • Process metrics

Kubernetes Metrics

Cluster metrics are collected including:

  • Pod resource usage
  • Node status
  • Deployment status
  • Persistent volume usage

Metrics Retention

VictoriaMetrics is configured with default retention policies. For custom retention settings, modify the VictoriaMetrics configuration in your values.yaml:

acd-metrics:
  victoria-metrics-single:
    retentionPeriod: "3"  # Retention period in months

Troubleshooting

Metrics Not Appearing

If metrics are not appearing in Grafana:

  1. Check Telegraf pods:

    kubectl get pods -l app.kubernetes.io/component=telegraf
    
  2. Check Telegraf logs:

    kubectl logs -l app.kubernetes.io/component=telegraf
    
  3. Verify VictoriaMetrics is running:

    kubectl get pods -l app.kubernetes.io/component=victoria-metrics
    
  4. Check application metrics endpoints:

    kubectl exec <pod-name> -- curl localhost:8080/metrics
    

For dashboard and authentication issues, see the Grafana Dashboards and Grafana Authentication & Roles guides.

Next Steps

After setting up monitoring:

  1. Grafana Authentication & Roles - Configure SSO and permissions before accessing Grafana
  2. Grafana Dashboards - Explore and customise dashboards
  3. Alerts & Alarms - Set up alerting and notifications
  4. Operations Guide - Day-to-day operational procedures
  5. Troubleshooting Guide - Resolve monitoring issues
  6. API Guide - Access metrics via API

9.1 - Grafana Authentication & Roles

Configuring Grafana authentication, roles, and permissions via Zitadel

Overview

Grafana authentication is delegated entirely to Zitadel via OAuth2/OIDC. Local username/password login is not available to end users. When a user logs into Grafana, they are redirected to Zitadel to authenticate, and their Grafana role is automatically determined by the Zitadel project roles assigned to their account.

The OIDC integration between Grafana and Zitadel is configured automatically at install time — no manual Zitadel application registration is required.

How It Works

During installation, an init container runs before Grafana starts and:

  1. Authenticates with Zitadel using a machine-account service key.
  2. Registers a Grafana OIDC application in the Zitadel project (or re-uses an existing one if already registered).
  3. Writes the resulting client_id and client_secret into a Kubernetes Secret, which Grafana picks up on startup.

This means the Grafana OIDC application in Zitadel is managed automatically and does not need to be created or modified manually.

Role Mapping

Grafana roles are mapped from Zitadel project roles using the following rule:

Zitadel Project RoleGrafana Role
grafana_adminAdmin — full access, can manage users, datasources, and dashboards
(any other role, or no role)Viewer — read-only access to dashboards

Note: There is no Grafana Editor role mapped by default. All authenticated users who are not explicitly granted grafana_admin receive Viewer access. If you need an Editor tier, see Customising the Role Mapping.

The mapping is enforced on every login. If a user’s Zitadel role changes, the change takes effect the next time they log into Grafana.

Prerequisites

Accessing Grafana

Grafana is accessible at:

https://<manager-host>/grafana

Important: Grafana must be accessed using the DNS name specified in the first entry of global.hosts.manager in your configuration. Accessing Grafana via an IP address or an alternative hostname will cause OAuth2 redirect URI mismatches and CORS errors, preventing login from completing successfully.

To log in:

  1. Navigate to https://<manager-host>/grafana
  2. Click “Login with Zitadel”
  3. Authenticate with your Zitadel account credentials

Granting Admin Access

By default, all Zitadel users who log into Grafana receive Viewer access. To grant a user Admin access, assign them the grafana_admin project role in Zitadel.

Step 1: Ensure the grafana_admin Role Exists

  1. Log into the Zitadel Console at https://<manager-host>/ui/console
  2. Navigate to Projects and open the ZITADEL project
  3. Click the Roles tab
  4. Check whether a role named grafana_admin already exists
  5. If it does not exist, click New Role and create it:
    • Key: grafana_admin
    • Display Name: Grafana Admin (or any label you prefer)
    • Click Save

Step 2: Assign the Role to a User

  1. In the Zitadel Console, navigate to Users and open the user you want to grant admin access to
  2. Click the Authorizations tab
  3. Click New Authorization
  4. Select the ZITADEL project
  5. Select the grafana_admin role
  6. Click Save

The user will have Grafana Admin access the next time they log in.

Revoking Admin Access

To demote a user back to Viewer, remove the grafana_admin authorization from their account:

  1. In the Zitadel Console, open the user’s Authorizations tab
  2. Find the grafana_admin authorization on the ZITADEL project
  3. Click the delete icon to remove it

The change takes effect on their next Grafana login.

Customising the Role Mapping

The role mapping expression is configured in values.yaml under grafana."grafana.ini".auth.generic_oauth.role_attribute_path. It uses JMESPath syntax evaluated against the OIDC token’s role claims.

The default expression is:

grafana:
  "grafana.ini":
    auth.generic_oauth:
      role_attribute_path: >-
        contains(keys("urn:zitadel:iam:org:project:roles"), 'grafana_admin') && 'Admin' || 'Viewer'

Example: Adding an Editor Tier

To map a grafana_editor Zitadel role to Grafana’s Editor role, create the grafana_editor role in Zitadel (following the same steps as above) and extend the expression:

grafana:
  "grafana.ini":
    auth.generic_oauth:
      role_attribute_path: >-
        contains(keys("urn:zitadel:iam:org:project:roles"), 'grafana_admin') && 'Admin'
        || contains(keys("urn:zitadel:iam:org:project:roles"), 'grafana_editor') && 'Editor'
        || 'Viewer'

Apply the change using the standard upgrade procedure in the Configuration Guide.

Blocking Unauthenticated Access

By default, role_attribute_strict is set to false, which means any authenticated Zitadel user can log into Grafana as a Viewer even if they have no explicit Grafana role assigned. To restrict Grafana access to only users who have been explicitly granted a role, set this to true:

grafana:
  "grafana.ini":
    auth.generic_oauth:
      role_attribute_strict: true

With role_attribute_strict: true, users who do not match any role in the role_attribute_path expression will be denied access entirely.

Managing Users in Grafana

User accounts in Grafana are created automatically on first login via Zitadel. There is no need to pre-create users in the Grafana UI.

To view and manage users who have logged in:

  1. Log into Grafana as an Admin
  2. Navigate to Administration > Users and access > Users

From here you can see each user’s current role, last login time, and authentication provider. Role changes should always be made via Zitadel (as described above) rather than directly in Grafana, as they will be overwritten on the user’s next login.

Break-Glass Admin Access

A local Grafana admin account is available as a break-glass fallback for situations where Zitadel is unavailable. This account is not accessible via the standard login page (which only shows the Zitadel SSO button).

To use the local admin account, navigate directly to:

https://<manager-host>/grafana/login

The default credentials are listed in the Glossary. Change the default password immediately after installation.

Security recommendation: The break-glass account should be used only for emergency access. Do not use it for routine administration.

Troubleshooting

OAuth2 Redirect URI Mismatch / CORS Errors

Grafana is registered in Zitadel with the redirect URI https://<manager-host>/grafana/login/generic_oauth, derived from the first entry of global.hosts.manager. Accessing Grafana via a different hostname or IP address will not match this URI and will cause the login to fail.

Resolution: Always access Grafana via the configured hostname. If the hostname has changed, re-run the helm upgrade to re-register the application with the updated URI.

User Receives Viewer Instead of Admin

The grafana_admin role is not included in the user’s Zitadel token.

Resolution:

  1. Confirm the grafana_admin role exists on the ZITADEL project in the Zitadel Console
  2. Confirm the role is assigned to the user under their Authorizations tab
  3. Ask the user to log out of Grafana and log back in — role changes are applied on the next login, not the current session

Login Fails with “Role not found” or Access Denied

role_attribute_strict may be set to true and the user has no matching Zitadel role.

Resolution: Either assign the user an appropriate Zitadel project role, or set role_attribute_strict: false in values.yaml to allow all authenticated users Viewer access.

Admin Role Assigned in Zitadel but User Still Gets Viewer

The grafana_admin role is correctly assigned to the user in Zitadel, but Grafana still grants them Viewer access. This indicates that role claims are not being included in the Zitadel userinfo response.

Grafana determines roles by calling the Zitadel userinfo endpoint (/oidc/v1/userinfo) and evaluating the urn:zitadel:iam:org:project:roles claim. Zitadel only includes this claim when the Grafana OIDC application has Access Token Role Assertions enabled. If the claim is absent, the role_attribute_path expression always falls through to 'Viewer'.

To verify and fix:

  1. Log into the Zitadel Console at https://<manager-host>/ui/console
  2. Navigate to Projects > ZITADEL > Applications > Grafana
  3. Open the Token Settings tab
  4. Ensure Access Token Role Assertions is enabled
  5. Save the change

The fix takes effect on the user’s next login — no Grafana or Helm changes are required.

Grafana OIDC App Not Registered in Zitadel

If the init container failed during installation, the Grafana OIDC application may not have been created in Zitadel.

Resolution: Check the init container logs for errors:

kubectl logs -l app.kubernetes.io/component=grafana --previous -c zitadel-oauth-setup

Common causes are Zitadel not being ready when the init container ran, or a machine-key permission issue. Re-running the helm upgrade will re-trigger the init container and attempt registration again.

Next Steps

  1. Grafana Dashboards - Using and customising dashboards
  2. Alerts & Alarms - Configure alerting and notifications
  3. Metrics & Monitoring Overview - Return to the monitoring overview

9.2 - Grafana Dashboards

Using and customising Grafana dashboards

Overview

Grafana is the primary visualization platform for the CDN Manager monitoring stack. It provides pre-built dashboards for cluster health, application performance, and billing analytics, and is accessible via the manager ingress.

Prerequisites

  • Grafana is deployed and running (verify with kubectl get pods -l app.kubernetes.io/component=grafana)
  • A Zitadel user account is available for login
  • Grafana is accessed via the correct DNS hostname (see Grafana Authentication & Roles)

Accessing Grafana

Grafana is accessible via the manager ingress:

URL: https://<manager-host>/grafana

To log in:

  1. Navigate to https://<manager-host>/grafana
  2. Click the “Login with Zitadel” button
  3. Authenticate with your Zitadel account credentials

Important: Grafana must be accessed using the DNS name specified in the first entry of global.hosts.manager in your configuration. Accessing Grafana via an IP address or an alternative hostname will cause OAuth2 redirect URI mismatches and CORS errors, preventing login from completing successfully.

For details on authentication and role configuration, see Grafana Authentication & Roles.

Standard Dashboards

Accessing Dashboards

After logging into Grafana:

  1. Navigate to Dashboards in the left menu
  2. Browse the folder structure to find the dashboard you need
  3. Click on a dashboard to open it

Dashboards are organised into the following folders:

  • Alerting — alert state history and alerting system health
  • Billing — redirect counts for billing analytics
  • CDN Manager — ACD Manager API performance
  • Hardware — host-level CPU, memory, disk, and network telemetry
  • Infrastructure — Kubernetes cluster, Kafka, Longhorn, and Redis health
  • Streaming — CDN routing, streamer performance, and QoE
  • Internal Debugging — low-level ACD Director diagnostics

Alerting

Active Alarms

A live view of all currently firing alerts. Shows the alert name, severity, affected host, and description. Use this as the first stop when investigating an active incident.

Alert Statistics

Historical view of alert firing activity over time. Shows which alert groups and individual rules have been firing, with timelines and trend charts. Useful for identifying recurring or flapping alerts.

vmalert

Operational health dashboard for the vmalert component itself. Covers evaluation rate, evaluation errors, alerting and recording rule counts, remote write throughput, and resource usage. Use this to verify the alerting pipeline is functioning correctly.


Billing

Billing Dashboard

Tracks redirect volumes for billing and usage analytics. Shows initial managed and unmanaged redirects, segment redirects, and endpoint redirects — both as totals and as ratios over time. Data is sourced from long-term VictoriaMetrics storage to support historical reporting.


CDN Manager

CDN Manager API

Health and performance dashboard for the ACD Manager REST API. Covers:

  • Overview: API health status, active pod count, total request volume, 5xx error rate, and average latency
  • Traffic: Request rate by pod, distribution across API endpoints
  • Errors: 5xx errors per endpoint, response code breakdown per endpoint, error rate by pod
  • Latency: P99 and average latency by endpoint, overall API response latency
  • Resources & Auth: Route validation API activity

Hardware

HW Metrics

Condensed host hardware overview covering CPU usage and load averages, memory utilisation, network interface throughput, swap usage, and root filesystem disk space. Suitable for day-to-day health checks across all cluster nodes.

An expanded HW Metrics (Advanced) dashboard is available as part of the Advanced Dashboards licence.


Infrastructure

k3s Cluster Infrastructure

Kubernetes cluster health overview using node-exporter and kube-state-metrics. Covers:

  • Cluster Overview: Node count, running pod count, OOMKilled containers, and overall cluster health status
  • Compute: CPU usage, memory usage, and load average per node
  • Network: Inbound and outbound bytes per node
  • Disk: Read/write throughput and I/O pressure per node
  • Longhorn PVC Disk Usage: Usage percentage per persistent volume
  • Workload Health: Pod restart counts and OOMKill occurrences

Kafka

Kafka broker health using JMX exporter metrics. Covers:

  • Cluster Health: Active controller, broker state, topic and partition counts, offline and under-replicated partitions, active and fenced broker counts, metadata log lag
  • Throughput: Bytes in/out and messages in by topic, replication bytes in/out
  • Internals: Request handler idle percentage, network processor idle percentage

Longhorn Storage

Persistent storage health for the Longhorn distributed block storage layer. Covers:

  • Overview: Total, healthy, degraded, and faulted volume counts; nodes down
  • Capacity: Total cluster capacity, used, and available storage
  • Volume Detail: Usage percentage per volume, actual size per volume, volume robustness state, volumes approaching capacity (>85%)
  • Node & Disk: Disk usage percentage and available bytes per node, node condition checks

Redis

Redis instance health using redis-exporter metrics. Covers:

  • Instance Health: Status, uptime, connected and blocked clients, slow log length, rejected connections
  • Memory: Usage and fragmentation ratio
  • Throughput & Keyspace: Commands processed, network I/O, keyspace hit rate, keys per database
  • Evictions & Persistence: Evictions, expirations, RDB unsaved changes
  • CPU & Connections: CPU usage and connection metrics
  • Command Analysis: Per-command breakdown

Streaming

Extended Monitoring

The primary operational dashboard for CDN routing activity. This is the home dashboard displayed on Grafana login. Covers:

  • Latency Statistics: ACD router latency and CDN latency over time
  • Redirects: Total redirect volume, status code breakdown, managed vs unmanaged ratio
  • Content Popularity: Top 10 requested content and top 10 most rapidly increasing popularity scores
  • CDN Selection: Redirect distribution across CDN endpoints, current and historical ratios
  • CDN Failovers and Retries: Failover events and retry rates by CDN
  • Host Selection: Endpoint request distribution
  • Session Statistics: Active session counts and session type breakdown
  • Client Responses: Client-facing HTTP status code distribution
  • Incoming Requests: Raw request volume
  • HTTPS Certificate Statistics: Certificate validity and expiry indicators
  • Warnings & Errors: Application-level warnings and errors over time
  • LUA Statistics: Lua exception counts and execution time
  • Configuration Change History: Timeline of routing configuration changes

Router Monitoring

External-facing view of ACD Director routing activity. Shows the number of initial routing decisions made, HTTP status code distribution, incoming HTTP/HTTPS request volumes, and selection input metrics. Useful for a high-level view of traffic hitting the directors.

QoE Monitoring

Quality of Experience scoring dashboard. Shows average QoE scores per host, per session group, per CDN, and per agent, as well as the initial CDN selection rate. Use this to identify CDN providers or content hosts that are delivering a degraded experience.

Streamer Statistics

Condensed view of streamer node performance, covering network ingress/egress throughput, TCP and HTTP connection counts, active session counts, HTTP request rates and response codes, response times (ingress and egress), and storage/memory/CPU. Suitable for routine streamer health monitoring.

An expanded Streamer Statistics (Advanced) dashboard is available as part of the Advanced Dashboards licence.


Internal Debugging

These dashboards expose low-level ACD Director internals and are primarily intended for advanced diagnostics and support investigations.

Debugging Information

Lua runtime statistics from the ACD Director: exception counts, active Lua context count, time spent in Lua execution, and router latency. Use when investigating unexpected Director behaviour or Lua errors.

ACD: Incoming Internet Connections

SSL-level connection statistics at the Director: SSL warnings and errors, valid and invalid HTTP and HTTPS request counts from external clients. Use when investigating TLS handshake failures or unexpected rejection rates.

Performance Metrics

ACD Director process-level resource usage: router CPU utilisation, router memory usage, and Lua memory consumption. Useful for identifying resource pressure on the Director process itself rather than the host.

Prometheus: ACD

ACD application metrics exposed via Prometheus: active and total session counts, session type breakdown, managed and unmanaged redirect counts, QoE corrections, manifest parse failures, initial endpoint request counts, HTTP request rates, and logged warnings and errors.

CDN Failures

CDN-level failure tracking: response code distribution from CDN backends, CDN-level failover events, host-level failovers, and host retry counts. Use when investigating CDN reliability issues or failover behaviour.

ACD: CDN Latencies Detail

Detailed CDN latency analysis with configurable percentile plots and a full latency histogram. Use when investigating tail latency issues on specific CDN backends.

ACD: Router Latencies

ACD Director routing latency distributions for both 2xx (successful) and 3xx (redirect) responses, visualised as heatmap buckets over time. Use alongside CDN Latencies Detail to separate router processing time from CDN response time.

Prometheus/ACD: SubRunners

Internal async processing queue depth and throughput metrics for the ACD Director subrunner system: client connection counts, low/medium/high priority queue depths (current and max), send/receive data block usage, wakeup counts, overload events, and autopause activations. Use when investigating Director throughput bottlenecks or queue backpressure.

Advanced Dashboards

Advanced dashboards are a paid add-on that unlocks more detailed variants of two standard dashboards, providing deeper visibility for performance investigation and capacity planning.

Licensing: Advanced dashboards require a separate licence key. To obtain a key for your deployment, contact your AgileTV account representative.

Enabling Advanced Dashboards

Once you have your licence key, add the following to your values.yaml:

dashboards:
  advanced:
    licenceKey: "<your-licence-key>"

Then apply the change by following the upgrade procedure in the Configuration Guide. The advanced dashboards will become available in Grafana automatically once the upgrade completes.

HW Metrics (Advanced)

Expanded hardware telemetry that supplements the standard HW Metrics dashboard with additional depth and additional sections: kernel metrics, TCP and UDP network stack statistics, per-interface error counters, disk IOPS, and metrics collection velocity. Use this when investigating performance anomalies surfaced by the standard dashboard or by alerts.

Streamer Statistics (Advanced)

Full streamer telemetry that supplements the standard Streamer Statistics dashboard. Includes all standard metrics plus:

  • OTT JCQ: Server group and node request rates and ratios, circuit breaker state (closed/open backends), global backend stats, pending requests per backend, and pop-out per backend
  • Account Records: Session counts, total traffic in/out, HTTP request rates (ingress/egress), cache hit ratio, backend request rates, and response times
  • Detailed network error and drop counters from /proc/net/dev

CDN Director Metrics

Director DNS Names in Grafana

CDN Director instances are identified in Grafana by their DNS name, which is derived from the name field in global.hosts.routers:

global:
  hosts:
    routers:
      - name: my-router-1
        address: 192.0.2.1

The DNS name used in Grafana dashboards will be: my-router-1.external

This naming convention is automatically applied for all configured directors.

Customising Dashboards

Permissions: Creating and importing dashboards requires Grafana Admin access. See Grafana Authentication & Roles for details on granting admin rights.

The pre-provisioned dashboards are read-only and managed by the Helm chart — changes made to them in the Grafana UI will not persist across upgrades. To create persistent custom dashboards:

  1. In Grafana, navigate to Dashboards > New > New Dashboard
  2. Add panels using the VictoriaMetrics or Prometheus datasource
  3. Save the dashboard to a folder of your choice

Custom dashboards saved this way are stored in the Grafana PostgreSQL database and are unaffected by Helm upgrades.

Note: Do not save custom dashboards into the provisioned folders (Alerting, Billing, CDN Manager, etc.). Grafana marks these folders as provisioned and may behave unexpectedly if user dashboards are mixed in.

Customising a Pre-provisioned Dashboard

If you want a modified version of one of the built-in dashboards as a starting point:

  1. Open the dashboard you want to customise
  2. Click the Share icon (top toolbar) > Export > Save to file to download the dashboard JSON
  3. In Grafana, navigate to Dashboards > New > Import
  4. Upload the downloaded JSON file
  5. On the import screen, give the dashboard a new name to distinguish it from the original, and choose a destination folder outside the provisioned set
  6. Click Import

You now have an independently editable copy. The original provisioned dashboard remains unchanged and will continue to be updated by future Helm upgrades. Your copy is stored in PostgreSQL and persists across upgrades independently.

Troubleshooting

Dashboard Loading Issues

If dashboards fail to load:

  1. Check Grafana pods:

    kubectl get pods -l app.kubernetes.io/component=grafana
    
  2. Review Grafana logs:

    kubectl logs -l app.kubernetes.io/component=grafana
    
  3. Verify datasource configuration in Grafana UI

For login and authentication issues, see Grafana Authentication & Roles.

Next Steps

  1. Alerts & Alarms - Set up alerting and notifications
  2. Operations Guide - Day-to-day operational procedures
  3. Metrics & Monitoring Overview - Return to the monitoring overview

9.3 - Alerts & Alarms

Configuring and managing alerts and alarms

Overview

The CDN Manager ships a set of pre-configured alerting rules evaluated by vmalert against VictoriaMetrics. When a rule fires, the alert is routed to Alertmanager, which handles deduplication, grouping, silencing, and delivery to configured notification channels.

This page documents every built-in alert rule, what it means, its severity, and the recommended operator action.

Alert Severity Levels

SeverityMeaning
criticalImmediate action required. The condition poses a risk to data integrity, service availability, or active traffic.
warningInvestigate soon. The condition is not immediately harmful but will degrade into a critical state if left unattended.

Alert Groups

Alerts are organised into the following groups, each evaluated on a 15-second interval.


infra-disk

Monitors disk space utilisation and I/O latency on cluster nodes.

StorageFillingUp

PropertyValue
Severitywarning
ConditionRoot filesystem usage exceeds 85%
Must persist for2 minutes

What it means: A node’s root filesystem is running low on space. If left unchecked this will progress to a full disk, which can cause pod evictions, write failures, and potential data loss.

Recommended actions:

  1. Identify the node from the host label in the alert.
  2. Log into the node and check disk usage:
    df -h /
    du -sh /var/log/* | sort -rh | head -20
    
  3. Clear old log files, unused container images, or temporary files:
    # On the node
    journalctl --vacuum-size=500M
    crictl rmi --prune
    
  4. If disk usage is due to application data growth, consider expanding the volume or adjusting retention settings. See Metrics Retention.

HighDiskLatency

PropertyValue
Severitywarning
ConditionAverage disk write latency exceeds 100 ms
Must persist for2 minutes

What it means: Disk write operations are taking longer than 100 ms on average. High disk latency can degrade database performance (PostgreSQL, VictoriaMetrics) and cause timeouts in write-heavy components.

Recommended actions:

  1. Identify the affected disk from the name label in the alert.
  2. Check for I/O-intensive processes on the node:
    iostat -x 2 5
    iotop -o
    
  3. Check for Longhorn replica rebuilds or rebalancing activity, which can saturate disk I/O.
  4. If the issue persists on a production node, review whether the storage hardware meets the System Requirements.

infra-compute

Monitors CPU and memory utilisation on cluster nodes.

CpuSaturation

PropertyValue
Severitywarning
ConditionTotal CPU usage exceeds 90%
Must persist for5 minutes

What it means: A node is running at near-full CPU capacity. Sustained CPU saturation causes request latency increases across all workloads on that node and may result in pod throttling.

Recommended actions:

  1. Identify the saturated node from the host label in the alert.
  2. Check which pods are consuming CPU:
    kubectl top pods --sort-by=cpu -A
    
  3. Check for runaway processes on the node:
    top -b -n 1 | head -20
    
  4. If saturation is caused by a legitimate workload spike (e.g. CDN traffic burst), consider scaling the deployment or redistributing load across nodes.

MemoryCriticallyLow

PropertyValue
Severitycritical
ConditionAvailable RAM falls below 10%
Must persist for2 minutes

What it means: The node has very little free memory remaining. The Linux OOM killer may begin terminating processes, which can cause abrupt pod restarts, data corruption in in-memory caches, and service unavailability.

Recommended actions:

  1. Identify the affected node from the host label in the alert.
  2. Immediately check for memory-leaking or oversized pods:
    kubectl top pods --sort-by=memory -A
    
  3. Identify and restart any pods showing abnormal memory consumption:
    kubectl rollout restart deployment/<name>
    
  4. Check kernel OOM kill log for any processes already killed:
    dmesg | grep -i "oom\|killed"
    
  5. Review memory resource limits and requests for affected deployments and adjust if necessary.

SwapUsageDetected

PropertyValue
Severitywarning
ConditionSwap usage exceeds 5%
Must persist for1 minute

What it means: The node is swapping memory to disk. Swap usage in a Kubernetes cluster is a strong indicator of memory pressure. It degrades performance significantly and may mask an underlying memory shortage that could escalate to a MemoryCriticallyLow event.

Recommended actions:

  1. Treat this as an early warning for the same conditions as MemoryCriticallyLow.
  2. Identify memory-intensive pods and investigate whether resource limits are configured appropriately.
  3. Swap should ideally never be active on a production Kubernetes node. If it persists, escalate to a memory capacity review.

infra-network

Monitors network interface errors and traffic anomalies on cluster nodes.

NetworkInterfaceErrors

PropertyValue
Severitycritical
ConditionAny non-zero rate of inbound or outbound packet errors on a network interface
Must persist for1 minute

What it means: A network interface is dropping or corrupting packets. Even a low error rate can cause TCP retransmissions, increased latency, and connection failures — directly impacting CDN Director communication and external traffic delivery.

Recommended actions:

  1. Identify the affected host and interface from the host and interface labels in the alert.
  2. Check interface error counters on the node:
    ip -s link show <interface>
    ethtool -S <interface> | grep -i error
    
  3. Check for duplex/speed mismatches between the node NIC and the upstream switch:
    ethtool <interface> | grep -E "Speed|Duplex"
    
  4. Escalate to network/hardware team if errors are persistent and cannot be attributed to a software configuration issue.

SuddenNetworkEgressDrop

PropertyValue
Severitycritical
ConditionEgress throughput drops to less than 50% of the 2-minute baseline, when baseline traffic is above 1 Mbit/s
Must persist for1 minute

What it means: A significant, sudden reduction in outbound traffic has been detected. This typically indicates a upstream network failure, link fault, or a routing issue. A CDN node that stops transmitting traffic is effectively out of service.

Recommended actions:

  1. Identify the affected node and interface from the alert labels.
  2. Verify the node’s network connectivity:
    ping <gateway-ip>
    traceroute <upstream-endpoint>
    
  3. Check for interface errors or link-down events:
    ip link show
    dmesg | grep -i "link\|eth\|nic"
    
  4. Verify that upstream routing and firewall rules have not changed.
  5. If the node is healthy and traffic has legitimately dropped (e.g. a CDN traffic shift), the alert can be silenced if the traffic reduction is expected and understood.

SuddenNetworkIngressSpike

PropertyValue
Severitywarning
ConditionIngress throughput exceeds twice the 5-minute baseline
Must persist for1 minute

What it means: A sudden surge of inbound traffic has been detected. This may indicate a legitimate traffic event (e.g. a large stream audience spike), a DDoS attempt, or a misconfigured client sending unexpected volume.

Recommended actions:

  1. Identify the affected node and interface from the alert labels.
  2. Review active connections and top talkers:
    ss -s
    netstat -an | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head
    
  3. Correlate with CDN Director metrics in Grafana to determine whether the spike is legitimate CDN traffic.
  4. If the spike is unexpected and sustained, consider rate-limiting or blocking the source at the network edge.

longhorn

Monitors the health of Longhorn distributed block storage, which backs persistent volumes for PostgreSQL, VictoriaMetrics, and other stateful components.

Note: Longhorn alert rules are always present in the alert configuration, but will not fire in environments where Longhorn is not installed (e.g. cloud deployments using external storage).

LonghornVolumeDegraded

PropertyValue
Severitywarning
ConditionA Longhorn volume’s robustness state is Degraded
Must persist for2 minutes

What it means: A Longhorn volume has fewer healthy replicas than its configured replication factor. Data is not at immediate risk, but the volume has reduced redundancy. A single additional node or disk failure could result in data loss or volume unavailability.

Recommended actions:

  1. Identify the affected volume from the volume label in the alert.
  2. Open the Longhorn UI and inspect the volume’s replica status.
  3. Check whether a replica is in the process of rebuilding (this is normal after a node restart). Rebuilding may take several minutes depending on volume size.
  4. If a replica has failed and is not rebuilding, attempt to evict and re-schedule it via the Longhorn UI.
  5. Investigate the health of the node that hosted the failed replica:
    kubectl get nodes
    kubectl describe node <node-name>
    

LonghornVolumeFaulted

PropertyValue
Severitycritical
ConditionA Longhorn volume’s robustness state is Faulted
Must persist for1 minute

What it means: A Longhorn volume has lost all healthy replicas and is no longer accessible. Any workload that depends on this volume (e.g. PostgreSQL, VictoriaMetrics) will be unable to write and may crash. Data may be at risk.

Recommended actions:

  1. Identify the affected volume from the volume label.
  2. Immediately check which pods are using the volume:
    kubectl get pods -A -o wide | grep -i <volume-name>
    
  3. Open the Longhorn UI. Check whether any replicas are still present and whether they can be recovered.
  4. Do not delete faulted replicas without first attempting recovery — they may contain the only copy of the data.
  5. Contact AgileTV support if the volume cannot be recovered, providing Longhorn UI screenshots and node logs.

LonghornNodeDown

PropertyValue
Severitycritical
ConditionA Longhorn node reports a non-ready state
Must persist for2 minutes

What it means: A storage node is unreachable or unhealthy from Longhorn’s perspective. All volumes with replicas on this node are at reduced redundancy. If more than one node goes down simultaneously, faulted volumes and data loss become a risk.

Recommended actions:

  1. Identify the affected node from the node label in the alert.
  2. Check the node’s status in Kubernetes:
    kubectl get nodes
    kubectl describe node <node-name>
    
  3. Attempt to SSH to the node and check system health:
    ssh root@<node-ip>
    systemctl status k3s
    
  4. If the node has crashed and cannot be recovered quickly, consider evicting its Longhorn replicas to allow rebuilding on healthy nodes — but only if the remaining healthy nodes have sufficient capacity.

LonghornDiskSpaceLow

PropertyValue
Severitywarning
ConditionAvailable Longhorn disk space on a node falls below 15%
Must persist for2 minutes

What it means: A node’s Longhorn-managed disk is running low on storage. When Longhorn disk space is exhausted, it cannot schedule new replicas or accommodate volume growth, which can lead to LonghornVolumeDegraded or LonghornVolumeFaulted conditions.

Recommended actions:

  1. Identify the affected node and disk from the node and disk labels in the alert.
  2. Open the Longhorn UI and check which volumes have replicas on this disk.
  3. Check for snapshots or backups that can be cleaned up to reclaim space.
  4. If space cannot be reclaimed, consider adding a disk to the node or expanding the underlying block device.
  5. Review Metrics Retention settings — reducing VictoriaMetrics retention is often the fastest way to reclaim Longhorn disk space in a monitoring-heavy deployment.

Adding Custom Alert Rules

Additional alert rules can be defined by extending the victoria_metrics_alert.server.config.alerts.groups list in your values.yaml. Rules follow the Prometheus alerting rule format.

Example: Adding a Custom Alert

The following example adds an alert group that fires when a Kafka consumer lag exceeds a threshold:

victoria_metrics_alert:
  server:
    config:
      alerts:
        groups:
          # ... existing groups are preserved alongside your additions ...
          - name: kafka
            interval: 15s
            rules:
              - alert: KafkaConsumerLagHigh
                expr: kafka_consumer_group_lag > 10000
                for: 5m
                labels:
                  severity: warning
                annotations:
                  summary: "High consumer lag on {{ $labels.topic }}"
                  description: "Consumer group {{ $labels.group }} is {{ $value }} messages behind on topic {{ $labels.topic }}."

Apply the change using the standard upgrade procedure in the Configuration Guide.

Rule Fields Reference

FieldRequiredDescription
alertYesAlert name. Must be unique within the group.
exprYesPromQL expression. The alert fires when this evaluates to a non-zero/non-empty result.
forNoHow long the condition must hold before the alert fires. Omitting this fires immediately.
labels.severityRecommendedSet to critical or warning to match the built-in routing rules.
annotations.summaryRecommendedShort human-readable description. Supports Go template labels (e.g. {{ $labels.host }}).
annotations.descriptionRecommendedDetailed description with context for the on-call operator.

Tip: Use the Alertmanager UI (https://<manager-host>/alertmanager) to verify that fired alerts are being received and routed correctly after adding new rules.


Configuring Alert Routes

By default, all alerts are routed to the built-in null receiver, which silently discards them. To receive alerts, configure one or more receivers and update the routing rules — all within the alertmanager.config section of your values.yaml.

Route Structure

The top-level route defines the default behaviour. Child routes under routes match alerts by label and direct them to specific receivers:

alertmanager:
  config:
    route:
      receiver: 'null'          # Default: discard unmatched alerts
      group_by: ['alertname']
      group_wait: 10s           # Wait before sending first notification for a new group
      group_interval: 10s       # Wait before sending updated notifications for a group
      repeat_interval: 1h       # Re-notify if an alert is still firing after this period
      routes:
        - matchers:
            - severity="critical"
          receiver: 'slack'
        - matchers:
            - severity="warning"
          receiver: 'email-warning'

Routes are evaluated top-to-bottom. The first matching route wins unless continue: true is set on the route.


Notification Channels

Email

Email requires an SMTP server to be configured globally. Both a critical and warning receiver can be defined independently.

alertmanager:
  config:
    global:
      smtp_smarthost: 'smtp.example.com:587'
      smtp_from: 'alertmanager@example.com'
      smtp_require_tls: true
    route:
      routes:
        - matchers:
            - severity="critical"
          receiver: 'email-critical'
        - matchers:
            - severity="warning"
          receiver: 'email-warning'
    receivers:
      - name: 'null'
      - name: 'email-critical'
        email_configs:
          - to: 'oncall@example.com'
            send_resolved: true
      - name: 'email-warning'
        email_configs:
          - to: 'alerts@example.com'
            send_resolved: true

Slack

Requires an incoming webhook URL created in your Slack workspace.

alertmanager:
  config:
    route:
      routes:
        - matchers:
            - severity="critical"
          receiver: 'slack'
    receivers:
      - name: 'null'
      - name: 'slack'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
            channel: '#alerts'
            send_resolved: true
            title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
            text: |
              *Severity:* {{ .CommonLabels.severity }}
              *Host:* {{ .CommonLabels.host }}
              {{ range .Alerts }}{{ .Annotations.description }}{{ end }}

Telegram

Requires a Telegram bot token and a channel/group chat ID. Create a bot via @BotFather and add it to your alert channel before configuring.

alertmanager:
  config:
    route:
      routes:
        - matchers:
            - severity="critical"
          receiver: 'telegram'
    receivers:
      - name: 'null'
      - name: 'telegram'
        telegram_configs:
          - bot_token: 'your-bot-token'
            chat_id: -1234567890
            parse_mode: 'Markdown'
            send_resolved: true
            message: |
              *Alert:* {{ .CommonLabels.alertname }}
              *Severity:* {{ .CommonLabels.severity }}
              *Host:* {{ .CommonLabels.host }}
              {{ range .Alerts }}
                {{ .Annotations.description }}
              {{ end }}

Finding your chat ID: Add your bot to the channel or group, send a message, then call https://api.telegram.org/bot<token>/getUpdates and read the chat.id from the response. Note that group and channel chat IDs are negative numbers.


Combining Multiple Receivers

Routes and receivers can be combined to send different alert severities to different channels simultaneously. For example, critical alerts to PagerDuty and Slack, warnings to email only:

alertmanager:
  config:
    route:
      receiver: 'null'
      routes:
        - matchers:
            - severity="critical"
          receiver: 'slack'
          continue: true        # Continue matching so the next route also fires
        - matchers:
            - severity="critical"
          receiver: 'email-critical'
        - matchers:
            - severity="warning"
          receiver: 'email-warning'
    receivers:
      - name: 'null'
      - name: 'slack'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
            channel: '#critical-alerts'
            send_resolved: true
      - name: 'email-critical'
        email_configs:
          - to: 'oncall@example.com'
            send_resolved: true
      - name: 'email-warning'
        email_configs:
          - to: 'alerts@example.com'
            send_resolved: true

Apply any receiver or routing changes using the standard upgrade procedure in the Configuration Guide.


Silencing Alerts

Silences suppress alert notifications for a defined time window without disabling the underlying alert rule. They are useful during planned maintenance, known incidents, or when investigating a non-urgent condition.

Silences are managed via the Alertmanager UI, accessible at:

https://<manager-host>/alertmanager

Creating a Silence

  1. Navigate to the Alertmanager UI and click Silences in the top navigation.
  2. Click Create Silence.
  3. Set the Start and End times for the silence window.
  4. Add one or more matchers to scope which alerts are suppressed. For example:
    • alertname = StorageFillingUp — silence a specific alert
    • severity = warning — silence all warnings
    • host = node-01 — silence all alerts from a specific host
  5. Add a Comment describing the reason for the silence (e.g. Planned disk expansion on node-01).
  6. Click Create. The silence takes effect immediately.

Expiring a Silence

Silences expire automatically at the configured end time. To remove a silence early, navigate to Silences in the Alertmanager UI, locate the silence, and click Expire.

Next Steps

  1. Operations Guide - Day-to-day operational procedures
  2. Troubleshooting Guide - Resolve underlying issues surfaced by alerts
  3. Metrics & Monitoring Overview - Return to the monitoring overview

10 - API Guide

REST API reference and integration examples

Overview

The CDN Manager exposes versioned HTTP APIs under /api (v1 and v2), using JSON payloads by default. When sending request bodies, set Content-Type: application/json. Server errors typically respond with { "message": "..." } where available, or an empty body with the relevant status code.

Authentication uses a two-step flow:

  1. Create a session
  2. Exchange that session for an access token with grant_type=session

Use the access token in Authorization: Bearer <token> when calling bearer-protected routes. CORS preflight (OPTIONS) is supported and wildcard origins are accepted by default.

Durations such as TTLs use humantime strings (for example, 60s, 5m, 1h).

Base URL

All API endpoints are relative to:

https://<manager-host>/api

API Reference Guides

The API documentation is organized by functional area:

GuideDescription
Authentication APILogin, token exchange, logout, and session management
Health APILiveness and readiness probes
Selection Input APIKey-value and list storage with search capabilities
Data Store APIGeneric JSON key/value storage
Subnets APICIDR-to-value mappings for routing decisions
Routing APIGeoIP lookups and IP validation
Discovery APIHost and namespace discovery
Metrics APIMetrics submission and aggregation
Configuration APIConfiguration document management
Operator UI APIBlocked tokens, user agents, and referrers
OpenAPI SpecificationComplete OpenAPI 3.0 specification

Authentication Flow

All authenticated API calls follow the same authentication flow. For detailed instructions, see the Authentication API Guide.

Quick Start:

# Step 1: Login to get session
curl -s -X POST "https://cdn-manager/api/v1/auth/login" \
  -H "Content-Type: application/json" \
  -d '{
    "email": "user@example.com",
    "password": "Password1!"
  }' | tee /tmp/session.json

SESSION_ID=$(jq -r '.session_id' /tmp/session.json)
SESSION_TOKEN=$(jq -r '.session_token' /tmp/session.json)

# Step 2: Exchange session for access token
curl -s -X POST "https://cdn-manager/api/v1/auth/token" \
  -H "Content-Type: application/json" \
  -d "$(jq -nc --arg sid "$SESSION_ID" --arg st "$SESSION_TOKEN" \
    '{session_id:$sid,session_token:$st,grant_type:"session",scope:"openid"}')" \
  | tee /tmp/token.json

ACCESS_TOKEN=$(jq -r '.access_token' /tmp/token.json)

# Step 3: Call a protected endpoint
curl -s "https://cdn-manager/api/v1/metrics" \
  -H "Authorization: Bearer ${ACCESS_TOKEN}"

Error Responses

The API uses standard HTTP response codes to indicate the success or failure of an API request.

Most errors return an empty response body with the relevant HTTP status code (e.g., 404 Not Found or 409 Conflict).

In some cases, the server may return a JSON body containing a user-facing error message:

{
  "message": "Human-readable error message"
}

Next Steps

After learning the API:

  1. Operations Guide - Day-to-day operational procedures
  2. Troubleshooting Guide - Resolve API issues
  3. Configuration Guide - Full configuration reference

10.1 - Authentication API

Authentication and session management

Overview

The Authentication API provides endpoints for user authentication, session management, and token exchange. All authenticated API calls require a valid access token obtained through the authentication flow.

Base URL

https://<manager-host>/api/v1/auth

Endpoints

POST /api/v1/auth/login

Create a session from email/password credentials.

Request:

POST /api/v1/auth/login
Content-Type: application/json

{
  "email": "user@example.com",
  "password": "Password1!"
}

Success Response (200):

{
  "session_id": "session-1",
  "session_token": "token-1",
  "verified_at": "2024-01-01T00:00:00Z",
  "expires_at": "2024-01-01T01:00:00Z"
}

Errors:

  • 401 - Authentication failure (invalid credentials)
  • 500 - Backend/state errors

POST /api/v1/auth/token

Exchange a session for an access token (required for bearer auth).

Request:

POST /api/v1/auth/token
Content-Type: application/json

{
  "session_id": "session-1",
  "session_token": "token-1",
  "grant_type": "session",
  "scope": "openid profile"
}

Success Response (200):

{
  "access_token": "<token>",
  "scope": "openid profile",
  "expires_in": 3600,
  "token_type": "bearer"
}

Token Scopes

The scope parameter in the token exchange request is a space-separated string of permissions requested for the access token.

Scope Resolution When a token is requested, the backend system filters the requested scopes against the user’s actual permissions. The resulting access token will only contain the subset of requested scopes that the user is authorized to possess.

Naming and Design Scope names are defined by the applications that consume the tokens, not by the central IAM system. To prevent collisions between different applications or modules, it is highly recommended that application developers use URN-style prefixes for scope names (e.g., urn:acd:manager:config:read).

Errors:

  • 401 - Authentication failure (invalid session)
  • 500 - Backend/state errors

POST /api/v1/auth/logout

Revoke a session. Note: This does not revoke issued access tokens; they remain valid until expiration.

Request:

POST /api/v1/auth/logout
Content-Type: application/json

{
  "session_id": "session-1",
  "session_token": "token-1"
}

Success Response (200):

{
  "status": "Ok"
}

Errors:

  • 400 - Invalid session parameters
  • 500 - Backend/state errors

Complete Authentication Flow Example

# Step 1: Login to get session
curl -s -X POST "https://cdn-manager/api/v1/auth/login" \
  -H "Content-Type: application/json" \
  -d '{
    "email": "user@example.com",
    "password": "Password1!"
  }' | tee /tmp/session.json

SESSION_ID=$(jq -r '.session_id' /tmp/session.json)
SESSION_TOKEN=$(jq -r '.session_token' /tmp/session.json)

# Step 2: Exchange session for access token
curl -s -X POST "https://cdn-manager/api/v1/auth/token" \
  -H "Content-Type: application/json" \
  -d "$(jq -nc --arg sid "$SESSION_ID" --arg st "$SESSION_TOKEN" \
    '{session_id:$sid,session_token:$st,grant_type:"session",scope:"openid"}')" \
  | tee /tmp/token.json

ACCESS_TOKEN=$(jq -r '.access_token' /tmp/token.json)

# Step 3: Call a protected endpoint
curl -s "https://cdn-manager/api/v1/metrics" \
  -H "Authorization: Bearer ${ACCESS_TOKEN}"

Using the Access Token

Once you have obtained an access token, include it in the Authorization header of all API requests:

Authorization: Bearer <access_token>

Example:

curl -s "https://cdn-manager/api/v1/configuration" \
  -H "Authorization: Bearer ${ACCESS_TOKEN}"

Token Expiration

Access tokens expire after the duration specified in expires_in (typically 3600 seconds / 1 hour). When a token expires, you must re-authenticate to obtain a new token.

Next Steps

10.2 - Health API

Liveness and readiness probe endpoints

Overview

The Health API provides endpoints for Kubernetes health probes and service health checking.

Base URL

https://<manager-host>/api/v1/health

Endpoints

GET /api/v1/health/alive

Liveness probe that indicates whether the service is running. Always returns 200 OK.

Request:

GET /api/v1/health/alive

Response (200):

{
  "status": "Ok"
}

Use Case: Kubernetes liveness probe to determine if the pod should be restarted.


GET /api/v1/health/ready

Readiness probe that checks service readiness including downstream dependencies.

Request:

GET /api/v1/health/ready

Success Response (200):

{
  "status": "Ok"
}

Failure Response (503):

{
  "status": "Fail"
}

Use Case: Kubernetes readiness probe to determine if the pod should receive traffic. Returns 503 if any downstream dependencies (database, Kafka, Redis) are unavailable.


Kubernetes Configuration

Example Kubernetes probe configuration:

livenessProbe:
  httpGet:
    path: /api/v1/health/alive
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /api/v1/health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Next Steps

10.3 - Selection Input API

Key-value and list storage with search capabilities

Overview

The Selection Input API provides JSON key/value storage with search capabilities. It supports two API versions (v1 and v2) with different operation models.

Base URL

https://<manager-host>/api/v1/selection_input
https://<manager-host>/api/v2/selection_input

Version Comparison

Featurev1 /api/v1/selection_inputv2 /api/v2/selection_input
Primary operationMerge/UPSERT (POST)Insert/Replace (PUT)
List appendN/APOST to push to list
Search syntaxWildcard prefix (foo* implicit)Full wildcard (foo* explicit)
Query paramssearch, sort, limit, ttlsearch, ttl, correlation_id
Sort supportYes (asc/desc)No
Limit supportYesNo
Use caseSimple key-value with optional searchList-like operations, full wildcard

When to Use Each Version

ScenarioRecommended Version
Simple key-value storagev1
List/queue operations (append to array)v2 POST
Full wildcard pattern matchingv2
Need to sort or paginate resultsv1

v1 Endpoints

GET /api/v1/selection_input/{path}

Fetch stored JSON. If value is an object, optional search/limit/sort applies to its keys.

Query Parameters:

  • search - Wildcard prefix search (adds * implicitly)
  • sort - Sort order (asc or desc)
  • limit - Maximum results (must be > 0)

Success Response (200):

{
  "foo": 1,
  "foobar": 2
}

Errors:

  • 404 - Path does not exist
  • 400 - Invalid search/sort/limit parameters
  • 500 - Backend failure

Example:

curl -s "https://cdn-manager/api/v1/selection_input/config?search=foo&limit=2"

POST /api/v1/selection_input/{path}

Upsert (merge) JSON at path. Nested objects are merged recursively.

Query Parameters:

  • ttl - Expiry time as humantime string (e.g., 10m, 1h)

Request:

{
  "feature_flag": true,
  "ratio": 0.5
}

Success: 201 Created echoing the payload

Errors:

  • 500 / 503 - Backend failure

Example:

curl -s -X POST "https://cdn-manager/api/v1/selection_input/config?ttl=10m" \
  -H "Content-Type: application/json" \
  -d '{
    "feature_flag": true,
    "ratio": 0.5
  }'

DELETE /api/v1/selection_input/{path}

Delete stored value.

Success: 204 No Content

Errors: 503 - Backend failure


v2 Endpoints

GET /api/v2/selection_input/{path}

Fetch stored JSON with optional wildcard filtering.

Query Parameters:

  • search - Full wildcard pattern (e.g., foo*, *bar*)
  • correlation_id - Accepted but currently ignored (logging only)

Success Response (200):

{
  "foo": 1,
  "foobar": 2
}

Errors:

  • 400 - Invalid search pattern
  • 404 - Path does not exist
  • 500 - Backend failure

Example:

curl -s "https://cdn-manager/api/v2/selection_input/config?search=foo*"

PUT /api/v2/selection_input/{path}

Insert/replace value. Old value is discarded.

Query Parameters:

  • ttl - Expiry time as humantime string

Request:

{
  "items": ["a", "b", "c"]
}

Success: 200 OK

Example:

curl -s -X PUT "https://cdn-manager/api/v2/selection_input/catalog" \
  -H "Content-Type: application/json" \
  -d '{
    "items": ["a", "b", "c"]
  }'

POST /api/v2/selection_input/{path}

Push a value to the back of a list-like entry (append to array).

Query Parameters:

  • ttl - Expiry time as humantime string

Request (any JSON value):

{
  "item": 42
}

Or a simple string:

"ready-for-publish"

Success: 200 OK

Example:

curl -s -X POST "https://cdn-manager/api/v2/selection_input/queue" \
  -H "Content-Type: application/json" \
  -d '"ready-for-publish"'

DELETE /api/v2/selection_input/{path}

Delete stored value.

Success: 204 No Content


Next Steps

10.4 - Data Store API

Generic JSON key/value storage

Overview

The Data Store API provides generic JSON key/value storage for short-lived or simple structured data.

Base URL

https://<manager-host>/api/v1/datastore

Endpoints

GET /api/v1/datastore

List all known keys.

Query Parameters:

  • show_hidden - Boolean (default false). When true, includes internal keys starting with _.

Success Response (200):

["user:123", "config:settings", "session:abc"]

Hidden Keys: Keys starting with _ are reserved for internal use (e.g., subnet service). Writing to hidden keys via the datastore API returns 400 Bad Request.


GET /api/v1/datastore/{key}

Retrieve the JSON value for a specific key.

Success Response (200): The stored JSON value

Errors:

  • 404 - Key does not exist
  • 500 - Backend failure

Example:

curl -s "https://cdn-manager/api/v1/datastore/user:123"

POST /api/v1/datastore/{key}

Create a new JSON value at the specified key. Fails if the key already exists.

Query Parameters:

  • ttl - Expiry time as humantime string (e.g., 60s, 1h)

Request:

{
  "id": 123,
  "name": "alice"
}

Success: 201 Created

Errors:

  • 409 Conflict - Key already exists
  • 500 - Backend failure

Example:

curl -s -X POST "https://cdn-manager/api/v1/datastore/user:123?ttl=1h" \
  -H "Content-Type: application/json" \
  -d '{"id":123,"name":"alice"}'

PUT /api/v1/datastore/{key}

Update or replace the JSON value at an existing key.

Query Parameters:

  • ttl - Expiry time as humantime string

Success: 200 OK

Errors:

  • 404 - Key does not exist
  • 500 - Backend failure

Example:

curl -s -X PUT "https://cdn-manager/api/v1/datastore/user:123" \
  -H "Content-Type: application/json" \
  -d '{"id":123,"name":"alice-updated"}'

DELETE /api/v1/datastore/{key}

Delete the value at the specified key. Idempotent operation.

Success: 204 No Content

Errors: 500 - Backend failure

Example:

curl -s -X DELETE "https://cdn-manager/api/v1/datastore/user:123"

Next Steps

10.5 - Subnets API

CIDR-to-value mappings for routing decisions

Overview

The Subnets API manages CIDR-to-value mappings used for routing decisions. This allows classification of IP ranges for routing purposes.

Base URL

https://<manager-host>/api/v1/subnets

Endpoints

PUT /api/v1/subnets

Create or update subnet mappings.

Request:

{
  "192.168.1.0/24": "office",
  "10.0.0.0/8": "internal",
  "203.0.113.0/24": "external"
}

Success: 200 OK

Errors:

  • 400 - Invalid CIDR format
  • 500 - Backend failure

Example:

curl -s -X PUT "https://cdn-manager/api/v1/subnets" \
  -H "Content-Type: application/json" \
  -d '{
    "192.168.1.0/24": "office",
    "10.0.0.0/8": "internal"
  }'

GET /api/v1/subnets

List all subnet mappings.

Success Response (200): JSON object of CIDR-to-value mappings

Example:

curl -s "https://cdn-manager/api/v1/subnets" | jq '.'

DELETE /api/v1/subnets

Delete all subnet mappings.

Success: 204 No Content


GET /api/v1/subnets/byKey/{subnet}

Retrieve subnet mappings whose CIDR begins with the given prefix.

Example:

curl -s "https://cdn-manager/api/v1/subnets/byKey/192.168" | jq '.'

GET /api/v1/subnets/byValue/{value}

Retrieve subnet mappings with the given classification value.

Example:

curl -s "https://cdn-manager/api/v1/subnets/byValue/office" | jq '.'

DELETE /api/v1/subnets/byKey/{subnet}

Delete subnet mappings whose CIDR begins with the given prefix.


DELETE /api/v1/subnets/byValue/{value}

Delete subnet mappings with the given classification value.


Next Steps

10.6 - Routing API

GeoIP lookups and IP validation

Overview

The Routing API provides GeoIP information lookup and IP address validation for routing decisions.

Base URL

https://<manager-host>/api/v1/routing

Endpoints

GET /api/v1/routing/geoip

Look up GeoIP information for an IP address.

Query Parameters:

  • ip - IP address to look up

Success Response (200):

{
  "city": {
    "name": "Washington"
  },
  "asn": 64512
}

Errors:

  • 400 - Invalid IP format
  • 500 - Backend failure

Caching: Cache-Control: public, max-age=86400 (24 hours)

Example:

curl -s "https://cdn-manager/api/v1/routing/geoip?ip=149.101.100.0"

GET /api/v1/routing/validate

Validate if an IP address is allowed (not blocked).

Query Parameters:

  • ip - IP address to validate

Success Response (200): Empty body (IP is allowed)

Forbidden Response (403):

Access Denied

Errors:

  • 400 - Invalid IP format
  • 500 - Backend failure

Caching: Cache-Control headers included (default: max-age=300, configurable via [tuning] section)

Example:

curl -i "https://cdn-manager/api/v1/routing/validate?ip=149.101.100.0"

Use Cases

GeoIP-Based Routing

Use the /geoip endpoint to determine the geographic location and ASN of an IP address for routing decisions:

# Get location data for routing
IP_INFO=$(curl -s "https://cdn-manager/api/v1/routing/geoip?ip=203.0.113.50")
CITY=$(echo "$IP_INFO" | jq -r '.city.name')
ASN=$(echo "$IP_INFO" | jq -r '.asn')

echo "Routing based on city: $CITY, ASN: $ASN"

IP Validation

Use the /validate endpoint to check if an IP is allowed before processing requests:

# Check if IP is allowed
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" \
  "https://cdn-manager/api/v1/routing/validate?ip=203.0.113.50")

if [ "$RESPONSE" = "200" ]; then
  echo "IP is allowed"
elif [ "$RESPONSE" = "403" ]; then
  echo "IP is blocked"
fi

Next Steps

10.7 - Discovery API

Host and namespace discovery

Overview

The Discovery API provides information about discovered hosts and namespaces. Discovery is configured via the Helm chart values.yaml file. Each entry defines a namespace with a list of hostnames.

Base URL

https://<manager-host>/api/v1/discovery

Endpoints

GET /api/v1/discovery/hosts

Return discovered hosts grouped by namespace.

Success Response (200):

{
  "directors": [
    { "name": "director-1.example.com" }
  ],
  "edge-servers": [
    { "name": "cdn1.example.com" },
    { "name": "cdn2.example.com" }
  ]
}

Example:

curl -s "https://cdn-manager/api/v1/discovery/hosts"

GET /api/v1/discovery/namespaces

Return discovery namespaces with their corresponding Confd URIs.

Success Response (200):

[
  {
    "namespace": "edge-servers",
    "confd_uri": "/api/v1/confd/edge-servers"
  },
  {
    "namespace": "directors",
    "confd_uri": "/api/v1/confd/directors"
  }
]

Example:

curl -s "https://cdn-manager/api/v1/discovery/namespaces"

Configuration

Discovery is configured via the Helm chart values.yaml file under manager.discovery:

manager:
  discovery:
    - namespace: "directors"
      hosts:
        - director-1.example.com
        - director-2.example.com
    - namespace: "edge-servers"
      hosts:
        - cdn1.example.com
        - cdn2.example.com

Each entry defines a namespace with a list of hostnames. Optionally, a pattern field can be specified for regex-based host matching.


Next Steps

10.8 - Metrics API

Metrics submission and aggregation

Overview

The Metrics API allows submission and retrieval of metrics data from CDN components.

Base URL

https://<manager-host>/api/v1/metrics

Endpoints

POST /api/v1/metrics

Submit metrics data.

Request:

{
  "example.com": {
    "metric1": 100,
    "metric2": 200
  }
}

Success: 200 OK

Errors: 500 - Validation/backend errors

Example:

curl -s -X POST "https://cdn-manager/api/v1/metrics" \
  -H "Content-Type: application/json" \
  -d '{
    "example.com": {
      "metric1": 100,
      "metric2": 200
    }
  }'

GET /api/v1/metrics

Return aggregated metrics per host.

Response: JSON object with aggregated metrics per host

Note: Metrics are stored per host for up to 5 minutes. Hosts that stop reporting disappear from aggregation after that window. When no metrics are being reported, returns empty object {}.

Example:

curl -s "https://cdn-manager/api/v1/metrics"

Metrics Retention

  • Metrics are stored for up to 5 minutes in the aggregation layer
  • For long-term metrics storage, data is forwarded to VictoriaMetrics
  • Query historical metrics via Grafana dashboards at /grafana

Next Steps

10.9 - Configuration API

Configuration document management

Overview

The Configuration API provides endpoints for managing the system configuration document. ETag is supported; send If-None-Match for conditional GET (may return 304).

Operational Note: This API is intended for internal verification only. Behavior is undefined in multi-replica clusters because pods do not coordinate config writes.

Base URL

https://<manager-host>/api/v1/configuration

Endpoints

GET /api/v1/configuration

Retrieve the configuration document.

Success: 200 OK with configuration JSON

Conditional GET: Returns 304 Not Modified if If-None-Match header matches current ETag

Example:

# Get ETag from response headers
etag=$(curl -s -D- "https://cdn-manager/api/v1/configuration" | awk '/ETag/{print $2}')

# Conditional GET - returns 304 if config unchanged
curl -s -H "If-None-Match: $etag" "https://cdn-manager/api/v1/configuration" -o /tmp/cfg.json -w "%{http_code}\n"

PUT /api/v1/configuration

Replace the configuration document.

Request:

{
  "feature_flag": false,
  "ratio": 0.25
}

Success: 200 OK

Errors:

  • 400 - Invalid configuration format
  • 500 - Backend failure

DELETE /api/v1/configuration

Delete the configuration document.

Success: 200 OK


ETag Usage

The configuration API supports ETags for optimistic concurrency control:

# 1. Get current config and ETag
response=$(curl -s -D headers.txt "https://cdn-manager/api/v1/configuration")
etag=$(grep -i ETag headers.txt | cut -d' ' -f2 | tr -d '\r')

# 2. Modify the config as needed
modified_config=$(echo "$response" | jq '.feature_flag = true')

# 3. Update with ETag to prevent overwriting concurrent changes
curl -s -X PUT "https://cdn-manager/api/v1/configuration" \
  -H "Content-Type: application/json" \
  -H "If-Match: $etag" \
  -d "$modified_config"

Next Steps

10.10 - Operator UI API

Blocked tokens, user agents, and referrers

Overview

The Operator UI API provides read-only helpers exposing curated selection input content for the operator interface.

Query Parameters: search, sort, limit (same as selection input v1)

Note: Stored keys for user agents/referrers are URL-safe base64; responses decode them to human-readable values.

Base URL

https://<manager-host>/api/v1/operator_ui

Endpoints

Blocked Household Tokens

GET /api/v1/operator_ui/modules/blocked_tokens

List all blocked household tokens.

Success Response (200):

[
  {
    "household_token": "house-001_token-abc",
    "expire_time": 1625247600
  }
]

GET /api/v1/operator_ui/modules/blocked_tokens/{token}

Get details for a specific blocked token.

Success Response (200):

{
  "household_token": "house-001_token-abc",
  "expire_time": 1625247600
}

Blocked User Agents

GET /api/v1/operator_ui/modules/blocked_user_agents

List all blocked user agents.

Success Response (200):

[
  {
    "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
  },
  {
    "user_agent": "curl/7.68.0"
  }
]

GET /api/v1/operator_ui/modules/blocked_user_agents/{encoded}

Get details for a specific blocked user agent. The path variable is URL-safe base64 encoded.

Example:

# Encode the user agent
ENC=$(python3 -c "import base64; print(base64.urlsafe_b64encode(b'curl/7.68.0').decode().rstrip('='))")

# Get details
curl -s "https://cdn-manager/api/v1/operator_ui/modules/blocked_user_agents/$ENC"

Blocked Referrers

GET /api/v1/operator_ui/modules/blocked_referrers

List all blocked referrers.

Success Response (200):

[
  {
    "referrer": "https://spam-example.com"
  }
]

GET /api/v1/operator_ui/modules/blocked_referrers/{encoded}

Get details for a specific blocked referrer. The path variable is URL-safe base64 encoded.

Example:

# Encode the referrer
ENC=$(python3 -c "import base64; print(base64.urlsafe_b64encode(b'spam-example.com').decode().rstrip('='))")

# Get details
curl -s "https://cdn-manager/api/v1/operator_ui/modules/blocked_referrers/$ENC"

URL-Safe Base64 Encoding

The Operator UI API uses URL-safe base64 encoding for path parameters. To encode values:

Python:

import base64

# Encode
encoded = base64.urlsafe_b64encode(b'value').decode().rstrip('=')

# Decode
decoded = base64.urlsafe_b64decode(encoded + '=' * (-len(encoded) % 4)).decode()

Bash (with openssl):

# Encode
echo -n "value" | openssl base64 -urlsafe | tr -d '='

# Decode
echo "encoded" | openssl base64 -urlsafe -d

Next Steps

10.11 - OpenAPI Specification

Complete OpenAPI 3.0 specification

Overview

The CDN Manager API is documented using the OpenAPI 3.0 specification. This appendix provides the complete specification for reference and for generating API clients.

OpenAPI Specification (YAML)

openapi: 3.0.3
info:
  title: AgileTV CDN Manager API
  version: "1.0"
servers:
  - url: https://<manager-host>/api
    description: CDN Manager API server
paths:
  /v1/auth/login:
    post:
      summary: Login and create session
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/LoginRequest'
      responses:
        '200':
          description: Session created
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/LoginResponse'
        '401': { description: Unauthorized, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
        '500': { description: Internal error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/auth/token:
    post:
      summary: Exchange session for access token
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/TokenRequest'
      responses:
        '200':
          description: Access token
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/TokenResponse'
        '401': { description: Unauthorized, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
        '500': { description: Internal error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/auth/logout:
    post:
      summary: Revoke session
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/LogoutRequest'
      responses:
        '200': { description: Revoked, content: { application/json: { schema: { $ref: '#/components/schemas/LogoutResponse' } } } }
        '401': { description: Unauthorized, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
        '500': { description: Internal error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/selection_input{tail}:
    get:
      summary: Read selection input
      parameters:
        - $ref: '#/components/parameters/Tail'
        - $ref: '#/components/parameters/Search'
        - $ref: '#/components/parameters/Sort'
        - $ref: '#/components/parameters/Limit'
      responses:
        '200': { description: JSON value }
        '400': { description: Bad request, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
        '404': { description: Not found }
        '500': { description: Backend failure }
    post:
      summary: Merge selection input
      parameters:
        - $ref: '#/components/parameters/Tail'
        - $ref: '#/components/parameters/Ttl'
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/AnyJson'
      responses:
        '201': { description: Created, content: { application/json: { schema: { $ref: '#/components/schemas/AnyJson' } } } }
        '500': { description: Backend failure }
        '503': { description: Service unavailable }
    delete:
      summary: Delete selection input
      parameters:
        - $ref: '#/components/parameters/Tail'
      responses:
        '204': { description: Deleted }
        '503': { description: Service unavailable }
  /v2/selection_input{tail}:
    get:
      summary: Read selection input v2
      parameters:
        - $ref: '#/components/parameters/TailV2'
        - $ref: '#/components/parameters/Search'
      responses:
        '200': { description: JSON value }
        '400': { description: Invalid search pattern }
        '404': { description: Not found }
        '500': { description: Backend failure }
    put:
      summary: Replace selection input v2
      parameters:
        - $ref: '#/components/parameters/TailV2'
        - $ref: '#/components/parameters/Ttl'
        - $ref: '#/components/parameters/CorrelationId'
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/AnyJson'
      responses:
        '200': { description: Updated }
        '500': { description: Backend failure }
    post:
      summary: Push to selection input v2
      parameters:
        - $ref: '#/components/parameters/TailV2'
        - $ref: '#/components/parameters/Ttl'
        - $ref: '#/components/parameters/CorrelationId'
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/AnyJson'
      responses:
        '200': { description: Pushed }
        '500': { description: Backend failure }
    delete:
      summary: Delete selection input v2
      parameters:
        - $ref: '#/components/parameters/TailV2'
      responses:
        '204': { description: Deleted }
        '500': { description: Backend failure }
  /v1/configuration:
    get:
      summary: Read configuration
      responses:
        '200': { description: Configuration, content: { application/json: { schema: { $ref: '#/components/schemas/AnyJson' } } }, headers: { ETag: { schema: { type: string } } } }
        '304': { description: Not modified }
        '500': { description: Backend failure }
    put:
      summary: Replace configuration
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/AnyJson'
      responses:
        '200': { description: Replaced }
        '500': { description: Backend failure }
    delete:
      summary: Delete configuration
      responses:
        '200': { description: Deleted }
        '500': { description: Backend failure }
  /v1/routing/geoip:
    get:
      summary: GeoIP lookup
      parameters:
        - name: ip
          in: query
          required: true
          schema: { type: string }
      responses:
        '200': { description: GeoIP data, content: { application/json: { schema: { $ref: '#/components/schemas/GeoIpResponse' } } } }
        '400': { description: Invalid IP }
        '500': { description: Backend failure }
  /v1/routing/validate:
    get:
      summary: Validate routing
      parameters:
        - name: ip
          in: query
          required: true
          schema: { type: string }
      responses:
        '200': { description: Allowed }
        '403': { description: Access Denied }
        '400': { description: Invalid IP }
        '500': { description: Backend failure }
  /v1/metrics:
    post:
      summary: Ingest metrics
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/MetricsIngress'
      responses:
        '200': { description: Stored }
        '500': { description: Validation/back-end error }
    get:
      summary: Aggregate metrics
      responses:
        '200': { description: Aggregated metrics, content: { application/json: { schema: { $ref: '#/components/schemas/AnyJson' } } } }
        '500': { description: Backend failure }
  /v1/discovery/hosts:
    get:
      summary: List discovered hosts by namespace
      responses:
        '200':
          description: Discovered hosts keyed by namespace
          content:
            application/json:
              schema:
                type: object
                additionalProperties:
                  type: array
                  items:
                    $ref: '#/components/schemas/DiscoveryHost'
        '500': { description: Backend failure }
  /v1/discovery/namespaces:
    get:
      summary: List discovery namespaces with Confd URIs
      responses:
        '200':
          description: Namespaces with Confd links
          content:
            application/json:
              schema:
                type: array
                items:
                  $ref: '#/components/schemas/DiscoveryNamespace'
        '500': { description: Backend failure }
  /v1/datastore:
    get:
      summary: List datastore keys
      responses:
        '200': { description: Keys list, content: { application/json: { schema: { type: array, items: { type: string } } } } }
        '500': { description: Backend failure }
  /v1/datastore/{key}:
    get:
      summary: Get a JSON value by key
      parameters:
        - name: key
          in: path
          required: true
          schema: { type: string }
      responses:
        '200': { description: JSON value, content: { application/json: { schema: { $ref: '#/components/schemas/AnyJson' } } } }
        '404': { description: Not found }
        '500': { description: Backend failure }
    post:
      summary: Create a JSON value at key
      parameters:
        - name: key
          in: path
          required: true
          schema: { type: string }
        - $ref: '#/components/parameters/Ttl'
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/AnyJson'
      responses:
        '201': { description: Created }
        '409': { description: Conflict (already exists) }
        '500': { description: Backend failure }
    put:
      summary: Update/replace a JSON value at key
      parameters:
        - name: key
          in: path
          required: true
          schema: { type: string }
        - $ref: '#/components/parameters/Ttl'
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/AnyJson'
      responses:
        '200': { description: Updated }
        '404': { description: Not found }
        '500': { description: Backend failure }
    delete:
      summary: Delete a datastore key
      parameters:
        - name: key
          in: path
          required: true
          schema: { type: string }
      responses:
        '204': { description: Deleted }
        '500': { description: Backend failure }
  /v1/subnets:
    get:
      summary: List all subnet mappings
      responses:
        '200': { description: Subnet mappings, content: { application/json: { schema: { type: object, additionalProperties: { type: string } } } } }
        '500': { description: Backend failure }
    put:
      summary: Create or update subnet mappings
      requestBody:
        required: true
        content:
          application/json:
            schema:
              type: object
              additionalProperties:
                type: string
              description: Map of CIDR strings to classification values
      responses:
        '200': { description: Created }
        '400': { description: Invalid CIDR format }
        '500': { description: Backend failure }
    delete:
      summary: Delete all subnet mappings
      responses:
        '204': { description: Deleted }
        '500': { description: Backend failure }
  /v1/subnets/byKey/{subnet}:
    get:
      summary: Get subnet mappings by CIDR prefix
      parameters:
        - name: subnet
          in: path
          required: true
          schema: { type: string }
      responses:
        '200': { description: Subnet mappings, content: { application/json: { schema: { type: object, additionalProperties: { type: string } } } } }
        '500': { description: Backend failure }
    delete:
      summary: Delete subnet mappings by CIDR prefix
      parameters:
        - name: subnet
          in: path
          required: true
          schema: { type: string }
      responses:
        '204': { description: Deleted }
        '500': { description: Backend failure }
  /v1/subnets/byValue/{value}:
    get:
      summary: Get subnet mappings by value
      parameters:
        - name: value
          in: path
          required: true
          schema: { type: string }
      responses:
        '200': { description: Subnet mappings, content: { application/json: { schema: { type: object, additionalProperties: { type: string } } } } }
        '500': { description: Backend failure }
    delete:
      summary: Delete subnet mappings by value
      parameters:
        - name: value
          in: path
          required: true
          schema: { type: string }
      responses:
        '204': { description: Deleted }
        '500': { description: Backend failure }
  /v1/operator_ui/modules/blocked_tokens:
    get:
      summary: List blocked tokens
      parameters:
        - $ref: '#/components/parameters/Search'
        - $ref: '#/components/parameters/Sort'
        - $ref: '#/components/parameters/Limit'
      responses:
        '200': { description: Blocked tokens, content: { application/json: { schema: { type: array, items: { $ref: '#/components/schemas/BlockedToken' } } } } }
        '400': { description: Parse error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/operator_ui/modules/blocked_tokens/{token}:
    get:
      summary: Get blocked token
      parameters:
        - name: token
          in: path
          required: true
          schema: { type: string }
      responses:
        '200': { description: Blocked token, content: { application/json: { schema: { $ref: '#/components/schemas/BlockedToken' } } } }
        '404': { description: Not found }
        '400': { description: Parse error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/operator_ui/modules/blocked_user_agents:
    get:
      summary: List blocked user agents
      parameters:
        - $ref: '#/components/parameters/Search'
        - $ref: '#/components/parameters/Sort'
        - $ref: '#/components/parameters/Limit'
      responses:
        '200': { description: Blocked user agents, content: { application/json: { schema: { type: array, items: { $ref: '#/components/schemas/BlockedUserAgent' } } } } }
        '400': { description: Parse error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/operator_ui/modules/blocked_user_agents/{encoded}:
    get:
      summary: Get blocked user agent
      parameters:
        - name: encoded
          in: path
          required: true
          schema: { type: string }
      responses:
        '200': { description: Blocked user agent, content: { application/json: { schema: { $ref: '#/components/schemas/BlockedUserAgent' } } } }
        '400': { description: Parse error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/operator_ui/modules/blocked_referrers:
    get:
      summary: List blocked referrers
      parameters:
        - $ref: '#/components/parameters/Search'
        - $ref: '#/components/parameters/Sort'
        - $ref: '#/components/parameters/Limit'
      responses:
        '200': { description: Blocked referrers, content: { application/json: { schema: { type: array, items: { $ref: '#/components/schemas/BlockedReferrer' } } } } }
        '400': { description: Parse error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/operator_ui/modules/blocked_referrers/{encoded}:
    get:
      summary: Get blocked referrer
      parameters:
        - name: encoded
          in: path
          required: true
          schema: { type: string }
      responses:
        '200': { description: Blocked referrer, content: { application/json: { schema: { $ref: '#/components/schemas/BlockedReferrer' } } } }
        '400': { description: Parse error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/health/alive:
    get:
      summary: Liveness check
      responses:
        '200': { description: Alive, content: { application/json: { schema: { $ref: '#/components/schemas/HealthStatus' } } } }
  /v1/health/ready:
    get:
      summary: Readiness check
      responses:
        '200': { description: Ready, content: { application/json: { schema: { $ref: '#/components/schemas/HealthStatus' } } } }
        '503': { description: Unready, content: { application/json: { schema: { $ref: '#/components/schemas/HealthStatus' } } } }
components:
  parameters:
    Tail:
      name: tail
      in: path
      required: true
      schema: { type: string }
    TailV2:
      name: tail
      in: path
      required: true
      schema: { type: string }
    Search:
      name: search
      in: query
      required: false
      schema: { type: string }
    Sort:
      name: sort
      in: query
      required: false
      schema: { type: string, enum: [asc, desc] }
    Limit:
      name: limit
      in: query
      required: false
      schema: { type: integer, minimum: 1 }
    Ttl:
      name: ttl
      in: query
      required: false
      schema: { type: string, description: Humantime duration }
    CorrelationId:
      name: correlation_id
      in: query
      required: false
      schema: { type: string }
  schemas:
    LoginRequest:
      type: object
      required: [email, password]
      properties:
        email: { type: string, format: email }
        password: { type: string, format: password }
    LoginResponse:
      type: object
      properties:
        session_id: { type: string }
        session_token: { type: string }
        verified_at: { type: string, format: date-time }
        expires_at: { type: string, format: date-time }
    LogoutRequest:
      type: object
      required: [session_id]
      properties:
        session_id: { type: string }
        session_token: { type: string }
    LogoutResponse:
      type: object
      properties:
        status: { $ref: '#/components/schemas/StatusValue' }
    TokenRequest:
      type: object
      required: [session_id, session_token, grant_type]
      properties:
        session_id: { type: string }
        session_token: { type: string }
        scope: { type: string }
        grant_type: { type: string, enum: [session] }
    TokenResponse:
      type: object
      required: [access_token, scope, expires_in, token_type]
      properties:
        access_token: { type: string }
        scope: { type: string }
        expires_in: { type: integer, format: int64 }
        token_type: { type: string, enum: [bearer] }
    ErrorResponse:
      type: object
      properties:
        message: { type: string }
    AnyJson:
      description: Arbitrary JSON value
    MetricsIngress:
      type: object
      additionalProperties:
        type: object
        additionalProperties: { type: number }
    GeoIpResponse:
      type: object
      properties:
        city:
          type: object
          properties:
            name: { type: string }
        asn: { type: integer }
        is_anonymous: { type: boolean }
    BlockedToken:
      type: object
      properties:
        household_token: { type: string }
        expire_time: { type: integer, format: int64 }
    BlockedUserAgent:
      type: object
      properties:
        user_agent: { type: string }
    BlockedReferrer:
      type: object
      properties:
        referrer: { type: string }
    DiscoveryHost:
      type: object
      properties:
        name: { type: string }
    DiscoveryNamespace:
      type: object
      properties:
        namespace: { type: string }
        confd_uri: { type: string }
    HealthStatus:
      type: object
      properties:
        status: { $ref: '#/components/schemas/StatusValue' }
    StatusValue:
      type: string
      enum: [Ok, Fail]

Using the OpenAPI Specification

Generating API Clients

The OpenAPI specification can be used to generate client libraries in multiple languages:

Using openapi-generator:

# Generate Python client
openapi-generator generate -i openapi.yaml -g python -o ./python-client

# Generate TypeScript client
openapi-generator generate -i openapi.yaml -g typescript-axios -o ./typescript-client

# Generate Go client
openapi-generator generate -i openapi.yaml -g go -o ./go-client

Using swagger-codegen:

swagger-codegen generate -i openapi.yaml -l python -o ./python-client

Validating the Specification

To validate the OpenAPI specification:

# Using swagger-cli
swagger-cli validate openapi.yaml

# Using spectral
spectral lint openapi.yaml

Next Steps

11 - Troubleshooting Guide

Common issues and resolution procedures

Overview

This guide provides troubleshooting procedures for common issues encountered when operating the AgileTV CDN Manager (ESB3027). Use the diagnostic commands and resolution steps to identify and resolve problems.

Diagnostic Tools

Cluster Status

# Check node status
kubectl get nodes

# Check all pods
kubectl get pods -A

# Check events sorted by time
kubectl get events --sort-by='.lastTimestamp'

# Check resource usage
kubectl top nodes
kubectl top pods

Component Status

# Check deployments
kubectl get deployments

# Check statefulsets
kubectl get statefulsets

# Check persistent volumes
kubectl get pvc
kubectl get pv

# Check services
kubectl get services

# Check ingress
kubectl get ingress

Common Issues

Pods Stuck in Pending State

Symptoms: Pods remain in Pending state indefinitely.

Causes:

  • Insufficient cluster resources (CPU/memory)
  • No nodes match scheduling constraints
  • PersistentVolume not available

Diagnosis:

# Describe the pending pod
kubectl describe pod <pod-name>

# Check events for scheduling failures
kubectl get events --field-selector reason=FailedScheduling

# Check node capacity
kubectl describe nodes | grep -A 5 "Allocated resources"

# Check available PVs
kubectl get pv

Resolution:

# Free up resources by scaling down non-critical workloads
kubectl scale deployment <deployment> --replicas=0

# Or add additional nodes to the cluster

# If PV is stuck, delete and recreate
kubectl delete pvc <pvc-name>
kubectl delete pod <pod-name>

Pods Stuck in ContainerCreating

Symptoms: Pods remain in ContainerCreating state.

Causes:

  • Image pull failures
  • Volume mount issues
  • Network configuration problems

Diagnosis:

kubectl describe pod <pod-name>

# Check for image pull errors
kubectl get events | grep -i "failed to pull"

# Check volume mount status
kubectl get events | grep -i "mount"

Resolution:

# For image pull issues, verify image exists and credentials
kubectl get secret <pull-secret-name> -o yaml

# For volume issues, check Longhorn volume status
kubectl get volumes -n longhorn-system

# Delete stuck pod to trigger recreation
kubectl delete pod <pod-name> --force --grace-period=0

Persistent Volume Mount Failures

Symptoms: Pod fails to start with error “AttachVolume.Attach failed for volume… is not ready for workloads” or similar volume attachment errors.

Causes:

  • Longhorn volume created but unable to be successfully mounted
  • Network connectivity issues between nodes (Longhorn requires iSCSI and NFS traffic)
  • Longhorn service unhealthy
  • Incorrect storage class configuration

Diagnosis:

# Describe the failing pod to see the error
kubectl describe pod <pod-name>

# Check Longhorn volumes status
kubectl get volumes -n longhorn-system

# Check Longhorn UI for detailed volume status
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80
# Access: http://localhost:8080

Resolution:

# Verify firewall allows Longhorn traffic between nodes
# Ports 9500 and 8500 must be open (see Networking Guide)

# Check Longhorn is healthy
kubectl get pods -n longhorn-system

# If volume is stuck, delete PVC and pod to trigger recreation
kubectl delete pvc <pvc-name>
kubectl delete pod <pod-name>

Pods in CrashLoopBackOff

Symptoms: Pods repeatedly crash and restart.

Causes:

  • Application configuration errors
  • Missing dependencies (database not ready)
  • Resource limits too low
  • Liveness probe failures

Diagnosis:

# View current logs
kubectl logs <pod-name>

# View previous instance logs
kubectl logs <pod-name> -p

# Describe pod for restart reasons
kubectl describe pod <pod-name>

# Check if dependencies are healthy
kubectl get pods | grep -E "(postgres|kafka|redis)"

Resolution:

# For dependency issues, wait for dependencies to be ready
kubectl wait --for=condition=Ready pod/<dependency-pod> --timeout=300s

# For resource issues, increase limits
kubectl edit deployment <deployment-name>

# For configuration issues, check ConfigMaps and Secrets
kubectl get configmap <configmap-name> -o yaml
kubectl get secret <secret-name> -o yaml

# Restart the deployment
kubectl rollout restart deployment/<deployment-name>

Pods in Terminating State

Symptoms: Pods stuck in Terminating state indefinitely.

Causes:

  • Volume detachment issues
  • Node communication problems
  • Finalizer blocking deletion

Diagnosis:

kubectl describe pod <pod-name>

# Check if node is reachable
kubectl get nodes

# Check finalizers
kubectl get pod <pod-name> -o jsonpath='{.metadata.finalizers}'

Resolution:

# Force delete the pod
kubectl delete pod <pod-name> --force --grace-period=0

# If node is unreachable, drain and remove from cluster
kubectl drain <node-name> --ignore-daemonsets --force
kubectl delete node <node-name>

Service Unreachable

Symptoms: Service endpoints not accessible.

Causes:

  • No ready pods backing the service
  • Network policy blocking traffic
  • Service port mismatch

Diagnosis:

# Check service endpoints
kubectl get endpoints <service-name>

# Check if pods are ready
kubectl get pods -l app=<label>

# Check network policies
kubectl get networkpolicies

# Test connectivity from within cluster
kubectl run test --rm -it --image=busybox -- wget -O- <service-name>:<port>

Resolution:

# Ensure pods are ready and matching service selector
kubectl get pods --show-labels

# Check service selector matches pod labels
kubectl get service <service-name> -o jsonpath='{.spec.selector}'

# Temporarily disable network policy for testing
kubectl edit networkpolicy <policy-name>

Ingress Not Working

Symptoms: External access via ingress fails.

Causes:

  • Traefik ingress controller not running
  • Ingress configuration errors
  • TLS certificate issues
  • DNS resolution problems

Diagnosis:

# Check Traefik pods
kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik

# Check ingress resources
kubectl get ingress

# Describe ingress for errors
kubectl describe ingress <ingress-name>

# Check Traefik logs
kubectl logs -n kube-system -l app.kubernetes.io/name=traefik

# Test DNS resolution
nslookup <hostname>

Resolution:

# Restart Traefik
kubectl rollout restart deployment -n kube-system traefik

# Fix ingress configuration
kubectl edit ingress <ingress-name>

# Renew or recreate TLS secret
kubectl create secret tls <secret-name> --cert=tls.crt --key=tls.key \
  --dry-run=client -o yaml | kubectl apply -f -

# Verify hostname matches certificate
openssl x509 -in tls.crt -noout -subject -issuer

Database Connection Failures

Symptoms: Application cannot connect to PostgreSQL.

Causes:

  • PostgreSQL cluster not ready
  • Connection pool exhausted
  • Network connectivity issues
  • Authentication failures

Diagnosis:

# Check PostgreSQL cluster status
kubectl get clusters

# Check PostgreSQL pods
kubectl get pods -l app.kubernetes.io/name=postgresql

# Check PostgreSQL logs
kubectl logs -l app.kubernetes.io/name=postgresql

# Test connectivity
kubectl exec -it <app-pod> -- psql -h <postgres-service> -U <user> -d <database>

Resolution:

# Wait for PostgreSQL to be ready
kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=postgresql --timeout=300s

# Check connection string in application config
kubectl get secret <secret-name> -o jsonpath='{.data}' | base64 -d

# Restart application pods
kubectl rollout restart deployment/<deployment-name>

Kafka Connection Issues

Symptoms: Application cannot connect to Kafka.

Causes:

  • Kafka controllers not ready
  • Topic not created
  • Network connectivity issues

Diagnosis:

# Check Kafka pods
kubectl get pods -l app.kubernetes.io/name=kafka

# Check Kafka logs
kubectl logs -l app.kubernetes.io/name=kafka

# List topics
kubectl exec -it <kafka-pod> -- kafka-topics.sh --bootstrap-server localhost:9092 --list

Resolution:

# Wait for Kafka controllers to be ready
kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=kafka --timeout=300s

# Create missing topic
kubectl exec -it <kafka-pod> -- kafka-topics.sh --bootstrap-server localhost:9092 \
  --create --topic <topic-name> --partitions 3 --replication-factor 3

# Restart application to reconnect
kubectl rollout restart deployment/<deployment-name>

Redis Connection Issues

Symptoms: Application cannot connect to Redis.

Diagnosis:

# Check Redis pods
kubectl get pods -l app.kubernetes.io/name=redis

# Check Redis logs
kubectl logs -l app.kubernetes.io/name=redis

# Test connectivity
kubectl exec -it <redis-pod> -- redis-cli ping

Resolution:

# Wait for Redis to be ready
kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=redis --timeout=300s

# Restart application
kubectl rollout restart deployment/<deployment-name>

High Memory Usage

Symptoms: Pods approaching or hitting memory limits.

Diagnosis:

# Check memory usage
kubectl top pods

# Check OOMKilled pods
kubectl get pods --field-selector=status.phase=Failed

# Check for memory leaks in logs
kubectl logs <pod-name> | grep -i "memory\|oom"

Resolution:

# Temporarily increase memory limit
kubectl edit deployment <deployment-name>

# Or scale horizontally if HPA is enabled
kubectl scale deployment <deployment-name> --replicas=<n>

# Long-term: Update values.yaml and perform helm upgrade

High CPU Usage

Symptoms: Pods consistently using high CPU.

Diagnosis:

# Check CPU usage
kubectl top pods

# Check for runaway processes
kubectl top pods --sort-by=cpu

Resolution:

# Scale horizontally if HPA is enabled
kubectl scale deployment <deployment-name> --replicas=<n>

# Or increase CPU limits
kubectl edit deployment <deployment-name>

Persistent Volume Issues

Symptoms: PVC not binding or volume errors.

Diagnosis:

# Check PVC status
kubectl get pvc

# Check PV status
kubectl get pv

# Check Longhorn volumes
kubectl get volumes -n longhorn-system

# Check Longhorn UI for details
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80

Resolution:

# For stuck PVC, delete and recreate
kubectl delete pvc <pvc-name>
kubectl delete pod <pod-name>

# For Longhorn issues, check Longhorn UI
# Access via http://localhost:8080

# Recreate Longhorn volume if necessary

Zitadel Authentication Failures

Symptoms: Users cannot authenticate via Zitadel.

Causes:

  • CORS configuration mismatch
  • External domain misconfigured
  • Zitadel pods not healthy

Diagnosis:

# Check Zitadel pods
kubectl get pods -l app.kubernetes.io/name=zitadel

# Check Zitadel logs
kubectl logs -l app.kubernetes.io/name=zitadel

# Verify external domain configuration
helm get values acd-manager -o yaml | grep -A 5 zitadel

Resolution:

# Ensure global.hosts.manager[0].host matches zitadel.zitadel.ExternalDomain
# Update values.yaml if needed

helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml

# Restart Zitadel
kubectl rollout restart deployment -l app.kubernetes.io/name=zitadel

Certificate Errors

Symptoms: TLS/SSL errors in browser or API calls.

Diagnosis:

# Check certificate expiration
kubectl get secret <tls-secret> -o jsonpath='{.data.tls\.crt}' | base64 -d | \
  openssl x509 -noout -dates

# Check certificate subject
kubectl get secret <tls-secret> -o jsonpath='{.data.tls\.crt}' | base64 -d | \
  openssl x509 -noout -subject -issuer

Resolution:

# Renew self-signed certificate
helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml \
  --set ingress.selfSigned=true

# Or update manual certificate
kubectl create secret tls <secret-name> \
  --cert=new-cert.crt --key=new-key.key \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart pods to pick up new certificate
kubectl rollout restart deployment <deployment-name>

Log Collection

Collecting Logs for Support

# Capture timestamp once to ensure consistency
TS=$(date +%Y%m%d-%H%M%S)

# Create log collection directory
mkdir -p ~/cdn-logs-$TS
cd ~/cdn-logs-$TS

# Collect pod logs
for pod in $(kubectl get pods -o name); do
  kubectl logs $pod > ${pod#pod/}.log 2>&1
  kubectl logs $pod -p > ${pod#pod/}.previous.log 2>&1 || true
done

# Collect cluster events
kubectl get events --sort-by='.lastTimestamp' > events.log

# Collect pod descriptions
for pod in $(kubectl get pods -o name); do
  kubectl describe $pod > ${pod#pod/}.describe.txt
done

# Compress for transfer
tar czf cdn-logs-$TS.tar.gz *.log *.txt

Emergency Procedures

Complete Cluster Recovery

If the cluster is completely down:

  1. Assess node status:

    kubectl get nodes
    
  2. Restart K3s on nodes:

    # On each node
    systemctl restart k3s
    
  3. If primary server failed:

    • Promote another server node
    • Update load balancer/DNS to point to new primary
  4. Restore from backup if necessary:

    • See Upgrade Guide for restore procedures

Data Recovery

For data recovery scenarios:

  • PostgreSQL: Use Cloudnative PG backup/restore
  • Longhorn: Restore from volume snapshots
  • Kafka: Replication handles most failures

Getting Help

If issues persist:

  1. Collect logs using the procedure above
  2. Check release notes for known issues
  3. Contact support with log bundle and issue description

Next Steps

After resolving issues:

  1. Operations Guide - Preventive maintenance procedures
  2. Configuration Guide - Verify configuration is correct
  3. Architecture Guide - Understand component dependencies

12 - Glossary

Terminology and definitions

Overview

This glossary defines key terms and acronyms used throughout the AgileTV CDN Manager (ESB3027) documentation.

A

ACD (Agile Content Delivery)

The overall CDN solution comprising the Manager (ESB3027) and Director (ESB3024) components.

Agent Node

A Kubernetes node that runs workloads but does not participate in the control plane. Agent nodes provide additional capacity for running application pods.

API Gateway

See NGinx Gateway.

ASN (Autonomous System Number)

A unique identifier for a network on the internet. Used in GeoIP-based routing decisions.

C

CDN Director

The Edge Server Business (ESB3024) component that handles actual content routing and delivery. Multiple Directors can be managed by a single CDN Manager.

Cloudnative PG (CNPG)

A Kubernetes operator that manages PostgreSQL clusters. Provides high availability, automatic failover, and backup capabilities for the Manager’s database layer.

Confd

Configuration daemon that synchronizes configuration from the Manager to CDN Directors. Runs as a sidecar or separate deployment.

CORS (Cross-Origin Resource Sharing)

A security mechanism that allows web applications to make requests to a different domain. Zitadel enforces CORS policies requiring the external domain to match the configured hostname.

CrashLoopBackOff

A Kubernetes pod state indicating the container is repeatedly crashing and being restarted. Typically indicates a configuration or dependency issue.

D

Datastore

The internal key-value storage system used by the Manager for short-lived or simple structured data. Backed by Redis.

Descheduler

A Kubernetes component that periodically analyzes pod distribution and evicts pods from overutilized nodes to optimize cluster balance.

Director

See CDN Director.

E

EDB (EnterpriseDB)

A company that provides PostgreSQL-related software and services. The Cloudnative PG operator was originally developed by EDB.

Ephemeral Storage

Temporary storage available to pods. Used for temporary files and caches. Not persistent across pod restarts.

ESB (Edge Server Business)

The product family designation for CDN components. ESB3027 is the Manager, ESB3024 is the Director.

etcd

A distributed key-value store used by Kubernetes for cluster state management. Runs on Server nodes as part of the control plane.

F

FailedScheduling

A Kubernetes event indicating a pod could not be scheduled due to insufficient resources or scheduling constraints.

Flannel

A network overlay solution for Kubernetes. Provides VXLAN-based networking for pod-to-pod communication.

Frontend GUI

See MIB Frontend.

G

GeoIP

Geographic IP lookup service using MaxMind databases. Used for location-based routing decisions.

Grafana

A visualization and dashboard platform for time-series data. Used to display metrics collected by Telegraf and stored in VictoriaMetrics.

H

Helm Chart

A package of pre-configured Kubernetes resources. The CDN Manager is deployed via a Helm chart that handles all component installation.

HPA (Horizontal Pod Autoscaler)

A Kubernetes feature that automatically scales the number of pods based on CPU/memory utilization or custom metrics.

HTTP Server

The main API server component of the Manager, built with Actix Web (Rust framework).

I

Ingress

A Kubernetes resource that exposes HTTP/HTTPS routes from outside the cluster to services within. The CDN Manager uses Traefik as the ingress controller.

Ingress Controller

A component that implements ingress rules. The CDN Manager uses Traefik for primary ingress and NGinx for external Director communication.

K

Kafka

A distributed event streaming platform used by the Manager for asynchronous communication and event processing.

K3s

A lightweight Kubernetes distribution optimized for edge and production deployments. Used as the underlying cluster technology.

Kubernetes (K8s)

An open-source container orchestration platform. The CDN Manager runs on a K3s-based Kubernetes cluster.

L

Longhorn

A distributed block storage system for Kubernetes. Provides persistent volumes for stateful components like PostgreSQL and Kafka.

Liveness Probe

A Kubernetes health check that determines if a container is running properly. Failed liveness probes trigger container restart.

M

Manager

The central management component (ESB3027) for configuring and monitoring CDN Directors.

MaxMind

A provider of IP intelligence databases including GeoIP City, GeoLite2 ASN, and Anonymous IP databases used by the Manager.

MIB Frontend

The web-based configuration GUI for CDN operators. Provides a user interface for managing streams, routers, and other configuration.

Multi-Factor Authentication (MFA)

An authentication method requiring multiple forms of verification. Note: MFA is not currently supported in the CDN Manager and should be skipped during setup.

N

Name-based Virtual Hosting

A technique where multiple hostnames are served from the same IP address. Zitadel uses this for CORS validation.

Namespace

A Kubernetes abstraction for organizing cluster resources. The CDN Manager uses namespaces to group related components.

NGinx Gateway

An NGinx-based gateway that handles external communication with CDN Directors.

Node Token

A secret token used to authenticate new nodes joining a K3s cluster. Located at /var/lib/rancher/k3s/server/node-token on Server nodes.

O

Operator

A method of packaging, deploying, and managing a Kubernetes application. Cloudnative PG is an operator for PostgreSQL.

OOMKilled

A Kubernetes pod state indicating the container was terminated due to exceeding memory limits.

P

PDB (Pod Disruption Budget)

A Kubernetes feature that ensures a minimum number of pods remain available during voluntary disruptions like maintenance.

PersistentVolume (PV)

A piece of storage in the Kubernetes cluster. Created dynamically by Longhorn for stateful components.

PersistentVolumeClaim (PVC)

A request for storage by a pod. Bound to a PersistentVolume.

Pod

The smallest deployable unit in Kubernetes. Contains one or more containers.

PostgreSQL

An open-source relational database. Used by the Manager for persistent data storage, managed by Cloudnative PG.

Probe

A Kubernetes health check mechanism. Types include liveness, readiness, and startup probes.

Prometheus

An open-source monitoring and alerting toolkit. Telegraf exports metrics in Prometheus format.

R

RBAC (Role-Based Access Control)

A method of regulating access to resources based on user roles. Used by Kubernetes for authorization.

Readiness Probe

A Kubernetes health check that determines if a container is ready to receive traffic. Failed readiness probes remove the pod from service load balancing.

Redis

An in-memory data structure store used for caching and as the datastore backend for the Manager.

Replica

A copy of a pod. Multiple replicas provide high availability and load distribution.

Resource Preset

Predefined resource configurations (nano, micro, small, medium, large, xlarge, 2xlarge) for common deployment sizes.

Rolling Update

A deployment strategy that updates pods one at a time to maintain availability during upgrades.

S

Selection Input

A key-value storage mechanism used for configuration data that can be queried with wildcard patterns. Available in v1 and v2 APIs with different semantics.

Server Node

A Kubernetes node that participates in the control plane (etcd, API server). Can also run workloads unless tainted.

Service

A Kubernetes abstraction that defines a logical set of pods and a policy for accessing them. Provides stable networking endpoints.

ServiceAccount

A Kubernetes identity for processes running in pods. Used for authentication between Kubernetes components.

StatefulSet

A Kubernetes workload API object for managing stateful applications. Used for PostgreSQL and Kafka deployments.

Startup Probe

A Kubernetes health check that determines if a container application has started. Disables liveness and readiness checks until it succeeds.

Stream

A content stream configuration defining source and routing parameters.

T

Telegraf

An agent for collecting, processing, aggregating, and writing metrics. Runs on each node to gather system and application metrics.

TLS (Transport Layer Security)

A cryptographic protocol for secure communication. The CDN Manager uses TLS for all external HTTPS connections.

Topology Aware Hints

A Kubernetes feature that prefers routing traffic to pods in the same zone as the source. Reduces latency by keeping traffic local.

Traefik

A modern HTTP reverse proxy and ingress controller. Used as the primary ingress controller for the CDN Manager.

TTL (Time To Live)

The duration after which data expires. Used in the datastore and selection input APIs.

V

Values.yaml

The Helm chart configuration file. Contains all configurable parameters for the CDN Manager deployment.

VictoriaMetrics

A time-series database used for storing metrics data. Provides long-term storage and querying capabilities.

VXLAN

Virtual Extensible LAN. A network virtualization technology used by Flannel for pod networking.

Z

Zitadel

An identity and access management (IAM) platform used for authentication and authorization in the CDN Manager. Provides OAuth2/OIDC capabilities.

Default Credentials

The following table lists all default credentials used by the CDN Manager. Change these defaults before deploying to production.

ServiceUsernamePasswordNotes
Zitadel Consoleadmin@agiletv.devPassword1!Primary identity management; accessed at /ui/console

Security Warning: Use the default admin@agiletv.dev account only to create a new administrator account with proper roles. After verifying the new account works, disable or delete the default admin account before exposing the system to users. For details on required roles and administrator permissions, see Zitadel’s Administrator Documentation. See the Next Steps Guide for initial configuration procedures.

Common Abbreviations

AbbreviationMeaning
APIApplication Programming Interface
ASNAutonomous System Number
CORSCross-Origin Resource Sharing
CPUCentral Processing Unit
DNSDomain Name System
EDBEnterpriseDB
ESBEdge Server Business
GUIGraphical User Interface
HAHigh Availability
HelmHelm Package Manager
HPAHorizontal Pod Autoscaler
HTTPHypertext Transfer Protocol
HTTPSHTTP Secure
IAMIdentity and Access Management
IPInternet Protocol
JSONJavaScript Object Notation
K8sKubernetes
MFAMulti-Factor Authentication
MIBManagement Information Base
NICNetwork Interface Card
OAuthOpen Authorization
OIDCOpenID Connect
PVCPersistentVolumeClaim
PVPersistentVolume
RBACRole-Based Access Control
SSLSecure Sockets Layer
TCPTransmission Control Protocol
TLSTransport Layer Security
TTLTime To Live
UDPUser Datagram Protocol
UIUser Interface
VPAVertical Pod Autoscaler
VXLANVirtual Extensible LAN

Next Steps

After reviewing terminology:

  1. Architecture Guide - Understand component relationships
  2. Configuration Guide - Full configuration reference
  3. Operations Guide - Day-to-day operational procedures