This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

AgileTV CDN Manager (esb3027)

Centralized Management of AgileTV CDN Director

1: Getting Started
2: System Requirements Guide
3: Networking Guide

3.1: Shared Interface Network Setup
3.2: Configuring Segregated Networks

4: Architecture Guide
5: Installation Guide

5.1: Installation Checklist
5.2: Single-Node Installation
5.3: Multi-Node Installation
5.4: Air-Gapped Deployment
5.5: Helm Chart Installation
5.6: Upgrade Guide
5.7: Next Steps

6: Configuration Guide
7: Performance Tuning Guide
8: Operations Guide
9: Metrics & Monitoring Guide

9.1: Grafana Authentication & Roles
9.2: Grafana Dashboards
9.3: Alerts & Alarms

10: API Guide

10.1: Authentication API
10.2: Health API
10.3: Selection Input API
10.4: Data Store API
10.5: Subnets API
10.6: Routing API
10.7: Discovery API
10.8: Metrics API
10.9: Configuration API
10.10: Operator UI API
10.11: OpenAPI Specification

11: Troubleshooting Guide
12: Glossary

1 - Getting Started

Introduction to AgileTV CDN Manager

Overview

The AgileTV CDN Manager (product code ESB3027) is a cloud-native control plane for managing CDN deployments. It provides centralized orchestration for authentication, configuration, routing, and metrics collection across CDN infrastructure.

Before You Start:

Deployment type: Lab (single-node) or Production (multi-node)? See Installation Guide
Hardware: Nodes meeting specifications for your deployment type
OS: RHEL 9 or compatible clone (Oracle Linux, AlmaLinux, Rocky Linux)
Software: Installation ISO from AgileTV customer portal; Extras ISO for air-gapped
Network: Firewall ports configured per Networking Guide

Deployment Models

Deployment Model	Description	Typical Use Case
Self-Hosted	K3s Kubernetes cluster on customer premises	Production deployments
Lab/Single-Node	Minimal single-node installation	Acceptance testing, demonstrations, development

Functionality remains consistent across deployment models.

Prerequisites

Installation ISO: Obtain esb3027-acd-manager-X.Y.Z.iso from AgileTV customer portal
Extras ISO (air-gapped): Obtain esb3027-acd-manager-extras-X.Y.Z.iso for offline installations
OS: RHEL 9 or compatible clone (Oracle Linux, AlmaLinux, Rocky Linux)
Kubernetes familiarity: Basic understanding of pods, deployments, and Helm charts

For detailed hardware, network, and operating system requirements, see the System Requirements Guide.

Installation

Ready to install? The Installation Guide provides step-by-step procedures for both lab and production deployments:

Lab/Single-Node: Quick deployment for testing and demonstrations
Production/Multi-Node: High-availability cluster with 3+ nodes

See the Installation Guide to get started.

Accessing the System

Following successful deployment, the following interfaces are available:

Service	URL Path	Authentication
MIB Frontend	`/gui`	Zitadel SSO
API Gateway	`/api`	Bearer token
Zitadel Console	`/ui/console`	See Glossary
Grafana	`/grafana`	See Glossary

All services are accessed via https://<cluster-host><path>.

Note: A self-signed SSL certificate is deployed by default. When accessing services through a browser, you will need to accept the self-signed certificate warning. For production deployments, configure a valid SSL certificate before exposing the system to users.

Initial user configuration is performed through Zitadel. Refer to the Configuration Guide for authentication setup procedures. For detailed guidance on managing users, roles, and permissions in the Zitadel Console, see Zitadel’s User Management Documentation.

The following guides provide detailed information for specific operational tasks:

Guide	Description
System Requirements	Hardware, operating system, and network specifications
Architecture	Detailed system architecture and scaling guidance
Installation	Step-by-step installation and upgrade procedures
Configuration	System configuration and customization
Performance Tuning	Optimization tips for improved performance
API Guide	REST API reference and integration examples
Operations	Day-to-day operational procedures
Metrics & Monitoring	Monitoring dashboards and alerting configuration
Troubleshooting	Common issues and resolution procedures
Glossary	Definitions of technical terms
Release Notes	Version-specific changes and known issues

2 - System Requirements Guide

Hardware, operating system, and networking requirements

Overview

This document specifies the hardware, operating system, and networking requirements for deploying the AgileTV CDN Manager (ESB3027). Requirements vary based on deployment type and node role within the cluster.

Cluster Sizing

Production Deployments

Production deployments require a minimum of three nodes to achieve high availability. The cluster architecture employs distinct node roles:

Role	Description
Server Node (Control Plane Only)	Runs control plane components (etcd, Kubernetes API server) only; does not host application workloads; requires separate Agent nodes
Server Node (Combined)	Runs control plane components and hosts application workloads; default configuration
Agent Node	Executes application workloads only; does not participate in cluster quorum

Server nodes can be deployed in either Control Plane Only or Combined role configurations. The choice depends on your deployment requirements:

Control Plane Only: Dedicated control plane nodes with lower resource requirements; requires separate Agent nodes for workloads
Combined: Server nodes run both control plane and workloads; minimum 3 nodes required for HA

Why Use Control Plane Only Nodes?

Dedicated Control Plane Only nodes provide several benefits for larger deployments:

Resource Isolation: Control plane components (etcd, API server, scheduler) run on dedicated hardware without competing with application workloads for CPU and memory
Stability: Application workload spikes or misbehaving pods cannot impact control plane performance
Security: Smaller attack surface on control plane nodes; fewer containers and services running
Predictable Performance: Control plane responsiveness remains consistent regardless of application load
Flexible Sizing: Control Plane Only nodes can use lower-specification hardware (2 cores, 4 GiB) since they don’t run application workloads

For most small to medium deployments, Combined role servers are simpler and more cost-effective. Control Plane Only nodes are recommended for larger deployments with significant workload requirements or where control plane stability is critical.

High Availability Considerations

Production deployments require 3 nodes running control plane (etcd) and 3 nodes capable of running workloads. These can be the same nodes (Combined role) or separate nodes (CP-Only + Agent).

Node Role Combinations:

Configuration	Control Plane Nodes	Workload Nodes	Total Nodes
All Combined	3 Combined servers	3 Combined servers	3
Separated	3 CP-Only servers	3 Agent nodes	6
Hybrid	2 CP-Only + 1 Combined	1 Combined + 2 Agent	5

Any combination works as long as you have 3 control plane nodes and 3 workload-capable nodes.

Note: Regardless of the deployment configuration, a minimum of 3 nodes capable of running workloads is required for production deployments. This ensures both high availability and sufficient capacity for application pods.
For detailed fault tolerance information and data replication strategies, see the Architecture Guide.

Hardware Requirements

Single-Node Lab Deployment

Lab deployments are intended for acceptance testing, demonstrations, and development only. These configurations are not suitable for production workloads.

Resource	Minimum	Recommended
CPU	8 cores	12 cores
Memory	16 GiB	24 GiB
Disk*	128 GiB	128 GiB

Production Cluster - Server Node (Control Plane Only)

Server nodes dedicated to control plane functions have modest resource requirements:

Resource	Minimum	Recommended
CPU	2 cores	4 cores
Memory	4 GiB	8 GiB
Disk*	64 GiB	128 GiB

These nodes run only control plane components and require separate Agent nodes to run application workloads.

Production Cluster - Server Node (Control Plane + Workloads)

Combined role nodes require resources for both control plane and application workloads:

Resource	Minimum	Recommended
CPU	4 cores	16 cores
Memory	8 GiB	32 GiB
Disk*	100 GiB	500 GiB

Production Cluster - Agent Node

Agent nodes execute application workloads and require the following resources:

Resource	Minimum	Recommended
CPU	4 cores	16 cores
Memory	8 GiB	32 GiB
Disk*	100 GiB	500 GiB

Storage Notes

* Disk Space: All disk space values must be available in the /var/lib/longhorn partition. It is recommended that /var/lib/longhorn be a separate partition on a fast SSD for optimal performance, though SSD is not strictly required.
Longhorn Capacity: Longhorn storage requires an additional 30% capacity headroom for internal operations and scaling. If less than 30% of the total partition capacity is available, Longhorn may mark volumes as “full” and prevent further writes. Plan disk capacity accordingly.

Storage Performance

For optimal performance, the following storage characteristics are recommended:

Disk Type: SSD or NVMe storage for Longhorn volumes
Filesystem: XFS or ext4 with default mount options
Partition Layout: Dedicated /var/lib/longhorn partition for persistent storage

Virtual machines and bare-metal hardware are both supported. Nested virtualization (running multiple nodes under a single hypervisor) may impact performance and is not recommended for production deployments.

Operating System Requirements

Supported Operating Systems

The CDN Manager supports Red Hat Enterprise Linux and compatible distributions:

Operating System	Status
Red Hat Enterprise Linux 9	Supported
Red Hat Enterprise Linux 10	Untested
Red Hat Enterprise Linux 8	Not supported

Compatible Clones

The following RHEL-compatible distributions are supported when major version requirements are satisfied:

Oracle Linux 9
AlmaLinux 9
Rocky Linux 9

Air-Gapped Deployments

Important: For air-gapped deployments (no internet access), the OS installation ISO must be mounted on all nodes before running the installer or join commands. The installer needs to install one or more packages from the distribution’s repository.

Oracle Linux UEK Kernel

Note: For Oracle Linux 9.7 and later using the Unbreakable Enterprise Kernel (UEK), you must install the kernel-uek-modules-extra-netfilter-$(uname -r) package before running the installer:
# Mount OS ISO first (required for air-gapped)
mount -o loop /path/to/oracle-linux-9.iso /mnt/iso

# Install required kernel modules
dnf install kernel-uek-modules-extra-netfilter-$(uname -r)
This package provides netfilter kernel modules required by K3s and Longhorn.

SELinux

SELinux is supported when installed in “Enforcing” mode. The installation process will configure appropriate SELinux policies automatically.

Networking Requirements

Network Interface

Each cluster node must have at least one network interface card (NIC) configured as the default gateway. If the node lacks a pre-configured default route, one must be established prior to installation.

Port Requirements

The cluster requires the following network connectivity:

Category	Ports	Purpose
Inter-Node	2379-2380, 6443, 8472/UDP, 10250, 5001, 9500, 8500	etcd, API server, Flannel VXLAN, Kubelet, Spegel, Longhorn
External Access	80, 443	HTTP redirect and HTTPS ingress
Application (optional)	6379, 8125 TCP/UDP, 9093, 9095	Redis, Telegraf, Alertmanager, Kafka external

Important: Complete port requirements, network ranges, and firewall configuration procedures are provided in the Networking Guide. Do not expose VictoriaMetrics (8428, 8429), Grafana (3000), or PostgreSQL (5432) directly—access these services only through the secure HTTPS ingress (port 443).

Resource Planning

Calculating Cluster Capacity

When planning cluster capacity, consider the following factors:

Base Overhead: Kubernetes system components consume approximately 1-2 cores and 2-4 GiB memory per node
Application Workloads: Refer to individual component resource requirements in the Architecture Guide
Headroom: Maintain 20-30% resource headroom for workload spikes and automatic scaling

Scaling Considerations

The CDN Manager supports horizontal scaling for most components. The Horizontal Pod Autoscaler (HPA) can automatically adjust replica counts based on resource utilization. Detailed scaling guidance is available in the Architecture Guide.

Example Production Deployment

A minimal production deployment with 3 server nodes (combined role) and 2 agent nodes would require:

Node Type	Count	CPU Total	Memory Total	Disk Total
Server (Combined)	3	12 cores	24 GiB	300 GiB
Agent	2	8 cores	16 GiB	200 GiB
Total	5	20 cores	40 GiB	500 GiB

This configuration provides:

High availability (survives loss of 1 server node)
Capacity for application workloads across all nodes
Headroom for horizontal scaling

Next Steps

After verifying system requirements:

Review the Installation Guide for deployment procedures
Consult the Networking Guide for firewall configuration
Examine the Architecture Guide for component resource requirements

3 - Networking Guide

Network architecture and configuration guides

Network Architecture

Physical Network

Each cluster node must have at least one network interface card (NIC) configured as the default gateway. If the node lacks a pre-configured default route, it must be established prior to installation.

K3s requires a default route to auto-detect the node’s primary IP and for kube-proxy ClusterIP routing to function properly. If no default route exists, create a dummy interface as a workaround:

ip link add dummy0 type dummy
ip link set dummy0 up
ip addr add 203.0.113.254/31 dev dummy0
ip route add default via 203.0.113.255 dev dummy0 metric 1000

Overlay Network

Kubernetes creates virtual network interfaces for pods that are typically not associated with any specific firewalld zone. The cluster uses the following network ranges:

Network	CIDR	Purpose
Pod	10.42.0.0/16	Inter-pod communication
Service	10.43.0.0/16	Kubernetes service discovery

Firewall regulations should target the primary physical interface. The overlay network traffic is handled by Flannel VXLAN.

Port Requirements

Inter-Node Communication

The following ports must be permitted between all cluster nodes for Kubernetes and cluster infrastructure:

Port	Protocol	Source	Destination	Purpose
2379-2380	TCP	Server nodes	Server nodes	etcd cluster communication
6443	TCP	All nodes	Server nodes	Kubernetes API server
8472	UDP	All nodes	All nodes	Flannel VXLAN overlay network
10250	TCP	All nodes	All nodes	Kubelet metrics and management
5001	TCP	All nodes	Server nodes	Spegel registry mirror
9500-9503	TCP	All nodes	All nodes	Longhorn management API
8500-8504	TCP	All nodes	All nodes	Longhorn agent communication
10000-30000	TCP	All nodes	All nodes	Longhorn data replication
3260	TCP	All nodes	All nodes	Longhorn iSCSI
2049	TCP	All nodes	All nodes	Longhorn RWX (NFS)

Application Services Ports

The following ports must be accessible for application services within the cluster:

Port	Protocol	Service
6379	TCP	Redis
9093	TCP	Alertmanager
9095	TCP	Kafka
8086	TCP	Vector (InfluxDB line protocol listener)

External Access Ports

The following ports must be accessible from external clients to cluster nodes:

Port	Protocol	Service
80	TCP	HTTP ingress (Optional, redirects to HTTPS)
443	TCP	HTTPS ingress (Required, all services)
9095	TCP	Kafka (external client connections)
6379	TCP	Redis (external client connections)
8086	TCP	Vector (InfluxDB line protocol, external metrics senders)

Network Configuration Guides

Deployment Type

Choose the guide that matches your deployment architecture:

Guide	Description	Who Should Use This
Configuring Segregated Networks	Multi-NIC deployments with air-gapped cluster backplane	Most users - If you have separate interfaces for cluster traffic and external internet access
Shared Interface Setup	Single-NIC deployments where all traffic shares one interface	Users with a single network interface for both cluster traffic and external access

Not sure which to use? If you have explicitly separate interfaces for cluster communication and external access, start with Configuring Segregated Networks. Only use the shared interface guide if your hardware is limited to a single NIC.

3.1 - Shared Interface Network Setup

Network configuration for standard single-NIC deployments where all traffic shares a single interface.

Overview

This guide covers network configuration for standard single-NIC deployments. In this architecture, all traffic—including internal cluster communication (East-West) and external internet access (North-South)—is routed through a single network interface.

Security Warning: Because all traffic shares the same interface and firewall zone, there is no physical or logical isolation between cluster management traffic and public-facing service traffic. For production environments requiring security isolation, see Configuring Segregated Networks.

Note: The installer script automatically detects if firewalld is enabled. If so, it will verify that the required inter-node ports are open through the firewall in the default zone before proceeding. If any required ports are missing, the installer will report an error and exit. Application service ports (such as Kafka, VictoriaMetrics, and Vector) are not checked by the installer as they are configurable.

For network architecture, port requirements, and general information, see the Network Architecture Overview section in the main Networking Guide.

firewall Configuration

Assign Interface to Default Zone

Assign your primary network interface to the default zone:

firewall-cmd --permanent --zone=public --change-interface=<interface>
firewall-cmd --reload

Replace <interface> with your actual interface name (e.g., eth0).

Configure Firewall Rules

In a shared interface setup, you must manually configure firewall rules for both internal cluster traffic and external access, as K3s does not automatically manage the public zone.

# 1. Allow pod and service networks (Internal CIDRs)
firewall-cmd --permanent --zone=public --add-source=10.42.0.0/16
firewall-cmd --permanent --zone=public --add-source=10.43.0.0/16

# 2. Kubernetes and Cluster Infrastructure (East-West Traffic)
# These ports must be opened manually for the cluster to function on a single interface.
firewall-cmd --permanent --zone=public --add-port=2379-2380/tcp
firewall-cmd --permanent --zone=public --add-port=6443/tcp
firewall-cmd --permanent --zone=public --add-port=8472/udp
firewall-cmd --permanent --zone=public --add-port=10250/tcp
firewall-cmd --permanent --zone=public --add-port=5001/tcp
firewall-cmd --permanent --zone=public --add-port=9500-9503/tcp
firewall-cmd --permanent --zone=public --add-port=8500-8504/tcp
firewall-cmd --permanent --zone=public --add-port=10000-30000/tcp
firewall-cmd --permanent --zone=public --add-port=3260/tcp
firewall-cmd --permanent --zone=public --add-port=2049/tcp

# 3. External Access Ports (North-South Traffic)
firewall-cmd --permanent --zone=public --add-port=80/tcp
firewall-cmd --permanent --zone=public --add-port=443/tcp
firewall-cmd --permanent --zone=public --add-port=9095/tcp
firewall-cmd --permanent --zone=public --add-port=6379/tcp
firewall-cmd --permanent --zone=public --add-port=8086/tcp

# Apply changes
firewall-cmd --reload

Verification

Verify all port rules are applied:

firewall-cmd --zone=public --list-all

Expected output:

public (active)
  target: default
  icmp-block-inversion: no
  interfaces: eth0
  sources: 10.42.0.0/16 10.43.0.0/16
  services: dhcpv6-client ssh
  ports: 80/tcp 443/tcp 9095/tcp 6379/tcp 8086/tcp
  protocols: 2379-2380/tcp 6443/tcp 8472/udp 10250/tcp 5001/tcp 9500-9503/tcp 8500-8504/tcp 10000-30000/tcp 3260/tcp 2049/tcp
  masquerade: no
  forward-ports:
  source-ports:
  icmp-blocks:
  rich-rules:

Note: Additional interfaces may appear in the zone (e.g., eth0 eth1) if firewalld auto-assigned them based on network configuration. This is expected and does not affect functionality.

Verify the interface is correctly assigned to the public zone:

firewall-cmd --get-active-zones

Expected output will show eth0 listed under the public zone:

public (active)
  interfaces: eth0

Troubleshooting

Expected output will show eth0 listed under the public zone:

public (active)
  interfaces: eth0

Troubleshooting

Nodes Cannot Communicate

Verify firewall rules allow inter-node traffic in the public zone:

firewall-cmd --list-all

Test basic connectivity between nodes:

ping <node-ip>

Post-Installation Troubleshooting

Once the cluster is installed, if you encounter issues with pod-to-pod communication or service access, verify the following:

Flannel Interface: Ensure the flannel.1 interface is up and has the correct IP addresses.
Network Routes: Verify that the pod and service CIDR routes are present in the routing table.
Firewall Rules: Ensure all required Kubernetes and cluster ports are allowed in the public zone.

For detailed troubleshooting of Kubernetes-specific components (like Ingress or Pod connectivity), please refer to the Kubernetes Troubleshooting Guide.

3.2 - Configuring Segregated Networks

Multi-NIC deployment guide for air-gapped or segregated network setups

Overview

This guide covers configuring a cluster with separate interfaces for internal cluster communication and external internet access (also known as segregated or dual-homed deployments). In this setup, eth1 handles the internal cluster traffic (pod-to-pod, control plane) while eth0 provides public internet access.

Security Benefit: This configuration provides physical isolation between East-West (cluster) and North-South (external) traffic. The trusted zone allows unrestricted internal communication, while the public zone handles external access with controlled port exposure.

When configuring segregated networks with K3s, proper interface binding is essential. K3s uses the --flannel-iface flag to ensure pod traffic stays on the private network, and the --node-external-ip flag to advertise the public address for external access. Server nodes additionally require --advertise-address=<ETH1_IP> to ensure the API server advertises its internal/private address; without this flag, k3s promotes the external IP to the advertise address when --node-external-ip is set, causing the kubernetes service ClusterIP endpoint to register as an address that is unreachable from within the cluster.

Important: K3s manages pod masquerading and service routing automatically. You only need to configure firewalld zones correctly and pass the proper flags to the K3s installer.

Complete, step-by-step instructions follow.

Prerequisites

Before starting, ensure:

Operating system is installed and updated on all nodes
Network connectivity between nodes is available
SSH access is configured for all cluster nodes

Configure Firewalld Zones

This guide configures separate zones for internal cluster traffic and external access.

Assign Interfaces to Zones

K3s uses trusted zone for the internal network to allow unrestricted pod-to-pod and control plane traffic:

# Assign eth0 (external/internet) to public zone
firewall-cmd --permanent --zone=public --change-interface=eth0

# Assign eth1 (internal/cluster) to trusted zone
firewall-cmd --permanent --zone=trusted --change-interface=eth1

# Allow pod and service CIDRs in trusted zone (required for pod communication)
firewall-cmd --permanent --zone=trusted --add-source=10.42.0.0/16
firewall-cmd --permanent --zone=trusted --add-source=10.43.0.0/16

# Reload firewall
firewall-cmd --reload

Configure Firewall Ports

Open the necessary ports on the public zone for external access:

# External access ports
firewall-cmd --permanent --zone=public --add-port=80/tcp
firewall-cmd --permanent --zone=public --add-port=443/tcp
firewall-cmd --permanent --zone=public --add-port=9095/tcp
firewall-cmd --permanent --zone=public --add-port=6379/tcp
firewall-cmd --permanent --zone=public --add-port=8086/tcp

# Apply changes
firewall-cmd --reload

Note: K3s automatically creates iptables rules for internal cluster ports (6443, 10250, 2379-2380, 8472, 5001, 9500-9503, 8500-8504, 10000-30000, 3260, 2049) when using --flannel-iface=eth1. Pod and service CIDRs (10.42.0.0/16 and 10.43.0.0/16) are already allowed in the trusted zone via the --add-source commands above.

Verify Zone Configuration

firewall-cmd --zone=public --list-all
firewall-cmd --zone=trusted --list-all

Expected output for public zone:

public (active)
  target: default
  icmp-block-inversion: no
  interfaces: eth0 eth2
  sources: 
  services: dhcpv6-client ssh cockpit
  ports: 80/tcp 443/tcp 9095/tcp 6379/tcp 8086/tcp
  protocols: 
  forward: yes
  masquerade: no
  forward-ports: 
  source-ports: 
  icmp-blocks: 
  rich rules:

Expected output for trusted zone:

trusted (active)
  target: ACCEPT
  icmp-block-inversion: no
  interfaces: eth1
  sources: 10.42.0.0/16 10.43.0.0/16
  services: ssh mdns
  ports: 
  protocols: 
  forward: yes
  masquerade: no
  forward-ports: 
  source-ports: 
  icmp-blocks: 
  rich rules:

Note: Additional interfaces may appear in a zone (e.g., eth0 eth2) if firewalld auto-assigned them based on network configuration. This is expected and does not affect functionality.

Single-NIC Alternative

If you only have a single network interface, see the Shared Interface Setup guide instead. This guide is specifically for multi-NIC deployments with separate interfaces for cluster and external traffic.

Troubleshooting

Verify Zone Configuration

If pods cannot communicate with services, verify the trusted zone has the correct sources configured:

firewall-cmd --zone=trusted --list-all

Expected output:

trusted (active)
  target: ACCEPT
  icmp-block-inversion: no
  interfaces: eth1
  sources: 10.42.0.0/16 10.43.0.0/16
  services: ssh mdns
  ports: 
  protocols: 
  forward: yes
  masquerade: no
  forward-ports: 
  source-ports: 
  icmp-blocks: 
  rich rules:

Ensure both 10.42.0.0/16 (pod network) and 10.43.0.0/16 (service network) are listed under sources. If missing, re-run:

firewall-cmd --permanent --zone=trusted --add-source=10.42.0.0/16
firewall-cmd --permanent --zone=trusted --add-source=10.43.0.0/16
firewall-cmd --reload

4 - Architecture Guide

Detailed system architecture and component overview

Overview

The AgileTV CDN Manager (ESB3027) is a cloud-native Kubernetes application designed for managing CDN operations. This guide provides a detailed description of the system architecture, component interactions, and scaling considerations.

High-Level Architecture

The CDN Manager follows a microservices architecture deployed on Kubernetes. The system is organized into logical layers:

graph LR
    Clients[API Clients] --> Ingress[Ingress Controller]
    Ingress --> Manager[Core Manager]
    Ingress --> Frontend[MIB Frontend]
    Ingress --> Grafana[Grafana]
    Manager --> Redis[(Redis)]
    Manager --> Kafka[(Kafka)]
    Manager --> PostgreSQL[(PostgreSQL)]
    Manager --> Zitadel[Zitadel IAM]
    Manager --> Confd[Configuration Service]
    Grafana --> VM[(VictoriaMetrics)]
    Confd -.-> Gateway[NGinx Gateway]
    Gateway --> Director[CDN Director]

Component Architecture

Ingress Layer

The ingress layer manages all incoming traffic to the cluster:

Component	Role
Ingress Controller	Primary ingress for all cluster traffic; routes requests to internal services based on path
NGinx Gateway	Reverse proxy for routing traffic to external CDN Directors; used by MIB Frontend to communicate with remote Confd instances on CDN Director nodes

Traffic flow:

API clients and Operator UI connect via the Ingress Controller at /api and /gui paths respectively
Grafana dashboards are accessed via the Ingress Controller at /grafana
Zitadel authentication console is accessed via the Ingress Controller at /ui/console
MIB Frontend uses NGinx Gateway when communicating with external Confd instances on CDN Director nodes

Application Services

The application layer contains the core CDN Manager services:

Component	Role	Scaling
Core Manager	Main REST API server (v1/v2 endpoints); handles authentication, configuration, routing, and discovery	Horizontally scalable via HPA
MIB Frontend	Web-based configuration GUI for operators	Horizontally scalable via HPA
Confd	Configuration service for routing configuration; synchronizes with Core Manager application	Single instance
Grafana	Monitoring and visualization dashboards	Single instance
Selection Input Worker	Consumes selection input events from Kafka and updates configuration	Single instance
Metrics Aggregator	Collects and aggregates metrics from CDN components	Single instance
Telegraf	System-level metrics collection from cluster nodes	DaemonSet (one per node)
Alertmanager	Alert routing and notification management	Single instance

Data Layer

The data layer provides persistent and ephemeral storage:

Component	Role	Scaling
Redis	In-memory caching, session storage, and ephemeral state	Master + replicas (read-only)
Kafka	Event streaming for selection input and metrics; provides durable message queue	Controller cluster (odd count)
PostgreSQL	Persistent configuration and state storage	3-node cluster with HA
VictoriaMetrics (Analytics)	Real-time and short-term metrics for operational dashboards	Single instance
VictoriaMetrics (Billing)	Long-term metrics retention (1+ years) for billing and license compliance	Single instance

External Integrations

Component	Role
Zitadel IAM	Identity and access management; provides OAuth2/OIDC authentication
CDN Director (ESB3024)	Edge routing infrastructure; receives configuration from Confd

Detailed Component Descriptions

Core Manager

The Core Manager is the central application server that exposes the REST API. It is implemented in Rust using the Actix-web framework.

Key Responsibilities:

Authentication and session management via Zitadel
Configuration document storage and retrieval
Selection input CRUD operations
Routing rule evaluation and GeoIP lookups
Service discovery for CDN Directors and edge servers
Operator UI helper endpoints

API Endpoints:

/api/v1/auth/* - Authentication (login, token, logout)
/api/v1/configuration - Configuration management
/api/v1/selection_input/* - Selection input operations
/api/v2/selection_input/* - Enhanced selection input with list operations
/api/v1/routing/* - Routing evaluation and validation
/api/v1/discovery/* - Host and namespace discovery
/api/v1/metrics - System metrics
/api/v1/health/* - Liveness and readiness probes
/api/v1/operator_ui/* - Operator helper endpoints

Runtime Modes: The Core Manager supports multiple runtime modes, each deployed as a separate container:

http-server - Primary HTTP API server (default)
metrics-aggregator - Background worker for metrics collection
selection-input - Background worker for Kafka selection input consumption

MIB Frontend

The MIB Frontend provides a web-based GUI for configuration management.

Key Features:

Intuitive web interface for CDN configuration
Real-time configuration validation
Integration with Zitadel for SSO authentication
Uses NGinx Gateway for external Director communication

Confd (Configuration Service)

Confd provides routing configuration services and synchronizes with the Core Manager application.

Key Responsibilities:

Hosts the service configuration for routing decisions
Provides API and CLI for configuration management
Synchronizes routing configuration with Core Manager
Maintains configuration state in PostgreSQL

Selection Input Worker

The Selection Input Worker processes selection input events from the Kafka stream.

Key Responsibilities:

Consumes messages from the selection_input Kafka topic
Validates and transforms input data
Updates configuration in the data store
Maintains message ordering within partitions

Scaling Limitation: The Selection Input Worker cannot be scaled beyond a single consumer per Kafka partition, as message ordering must be preserved.

Metrics Aggregator

The Metrics Aggregator collects and processes metrics from CDN components.

Key Responsibilities:

Polls metrics from Director instances
Aggregates usage statistics
Writes data to VictoriaMetrics (Analytics) for dashboards
Writes long-term data to VictoriaMetrics (Billing) for compliance

Telegraf

Telegraf is deployed as a DaemonSet to collect host-level metrics.

Key Responsibilities:

CPU, memory, disk, and network metrics from each node
Container-level resource usage
Kubernetes cluster metrics
Forwards metrics to VictoriaMetrics

Grafana

Grafana provides visualization and dashboard capabilities.

Features:

Pre-built dashboards for CDN monitoring
Custom dashboard support
VictoriaMetrics as data source
Alerting integration with Alertmanager

Access: https://<host>/grafana

Alertmanager

Alertmanager handles alert routing and notifications.

Key Responsibilities:

Receives alerts from Grafana and other sources
Deduplicates and groups alerts
Routes to notification channels (email, webhook, etc.)
Manages alert silencing and inhibition

Data Storage

Redis

Redis provides in-memory storage for:

User sessions and authentication tokens
Ephemeral configuration cache
Real-time state synchronization

Deployment: Master + read replicas for high availability

Kafka

Kafka provides durable event streaming for:

Selection input events
Metrics data streams
Inter-service communication

Deployment: Controller cluster with 3 replicas for production, 1 replica for lab deployments

Node Affinity: Kafka replicas must be scheduled on separate nodes to ensure high availability. The Helm chart configures pod anti-affinity rules to enforce this distribution.

Topics:

selection_input - Selection input events
metrics - Metrics data streams

Note: For lab/single-node deployments, the Kafka replica count must be set to 1 in the Helm values. Production deployments require 3 replicas for fault tolerance.

PostgreSQL

PostgreSQL provides persistent storage for:

Configuration documents
User and permission data
System state

Deployment: 3-node cluster managed by Cloudnative PG (CNPG) operator

High Availability: The CNPG operator manages automatic failover and ensures high availability:

One primary node handles read/write operations
Two replica nodes provide redundancy and can be promoted to primary on failure
Automatic failover occurs within seconds of primary node failure
Synchronous replication ensures data consistency

Note: The PostgreSQL cluster is deployed and managed automatically by the CNPG operator. Manual intervention is typically not required for normal operations.

VictoriaMetrics

Two VictoriaMetrics instances serve different purposes:

VictoriaMetrics (Analytics):

Real-time and short-term metrics storage
Supports Grafana dashboards
Retention: Configurable (typically 30-90 days)

VictoriaMetrics (Billing):

Long-term metrics retention
Billing and license compliance data
Retention: Minimum 1 year

Authentication and Authorization

Zitadel Integration

Zitadel provides identity and access management:

Authentication Flow:

User accesses MIB Frontend or API
Redirected to Zitadel for authentication
Zitadel validates credentials and issues session token
Session token exchanged for access token
Access token included in API requests (Bearer authentication)

Default Credentials: See the Glossary for default login credentials.

Access Paths:

Zitadel Console: /ui/console
API authentication: /api/v1/auth/*

CORS Configuration

Zitadel enforces Cross-Origin Resource Sharing (CORS) policies. The external hostname configured in Zitadel must match the first entry in global.hosts.manager in the Helm values.

Network Architecture

Traffic Flow

graph TB
    External[External Clients] --> Ingress[Ingress Controller]
    External --> Redis[(Redis)]
    External --> Kafka[(Kafka)]
    External --> Telegraf[Telegraf]
    Ingress --> Manager[Core Manager]
    Ingress --> Frontend[MIB Frontend]
    Ingress --> Grafana[Grafana]
    Ingress --> Zitadel[Zitadel]

Note: Certain services (Redis, Kafka, Telegraf) can be accessed directly by external clients without traversing the ingress controller. This is typically used for metrics collection, event streaming, and direct data access scenarios.

Internal Communication

All internal services communicate over the Kubernetes overlay network (Flannel VXLAN). Services discover each other via Kubernetes DNS.

External Communication

CDN Directors: Accessed via NGinx Gateway for simplified routing
MaxMind GeoIP: Local database files (no external calls)

Scaling

Horizontal Pod Autoscaler (HPA)

The following components support automatic horizontal scaling via HPA:

Component	Minimum	Maximum	Scale Metrics
Core Manager	3	8	CPU (50%), Memory (80%)
NGinx Gateway	2	4	CPU (75%), Memory (80%)
MIB Frontend	2	4	CPU (75%), Memory (90%)

Note: HPA is enabled by default in the Helm chart. The default configuration is tuned for production deployments. Adjust min/max values based on expected load and available cluster capacity.

Manual Scaling

Components can also be scaled manually by setting replica counts in the Helm values:

manager:
  replicaCount: 3
mib-frontend:
  replicaCount: 2

Important: When manually setting replica counts, you must disable the Horizontal Pod Autoscaler (HPA) for the corresponding component. If HPA remains enabled, it will override manual replica settings. To disable HPA, set autoscaling.hpa.enabled: false for the component in your Helm values.

Components That Do Not Scale

The following components do not support horizontal scaling:

Component	Reason
Confd	Single instance required for configuration consistency
PostgreSQL	Cloudnative PG cluster; scaled by adding replicas via operator configuration
Kafka	Scaled by adding controllers, not via replica count
VictoriaMetrics	Stateful; single instance per role
Redis	Master is single; replicas are read-only
Grafana	Single instance sufficient for dashboard access
Alertmanager	Single instance for alert routing
Selection Input Worker	Kafka message ordering requires single consumer
Metrics Aggregator	Single instance for consistent metrics aggregation

Node Scaling

Additional Agent nodes can be added to the cluster at any time to increase workload capacity. Kubernetes automatically schedules pods to nodes with available resources.

Cluster Balancing

The CDN Manager deployment includes the Kubernetes Descheduler to maintain balanced resource utilization across cluster nodes:

Automatic Rebalancing: The descheduler periodically analyzes pod distribution and evicts pods from overutilized nodes
Node Balance: Helps prevent resource hotspots by redistributing workloads across available nodes
Integration with HPA: Works in conjunction with Horizontal Pod Autoscaler to optimize both pod count and placement

The descheduler runs as a background process and does not require manual intervention under normal operating conditions.

Resource Configuration

For detailed resource preset configurations and planning guidance, see the Configuration Guide.

High Availability

Server Node Redundancy

Production deployments require a minimum of 3 Server nodes:

Survives loss of 1 server node
Maintains quorum for etcd and Kafka

For enhanced availability, use 5 Server nodes:

Survives loss of 2 server nodes
Recommended for critical production environments

For large-scale deployments, 7 or more Server nodes can be used:

Survives loss of 3+ server nodes
Suitable for high-capacity production environments

Pod Distribution

Kubernetes automatically distributes pods across nodes to maximize availability:

Pods with the same deployment are scheduled on different nodes when possible
Pod Disruption Budgets (PDB) ensure minimum availability during maintenance

Data Replication

Component	Replication Strategy
Redis	Single instance (backup via Longhorn snapshots)
Kafka	Replicated partitions (default: 3)
PostgreSQL	3-node cluster via Cloudnative PG
VictoriaMetrics	Single instance (backup via snapshots)
Longhorn	Single replica with pod-node affinity

Longhorn Storage: Longhorn volumes are configured with a single replica by default. Pod scheduling is configured with node affinity to prefer scheduling pods on the same node as their persistent volume data. This approach optimizes I/O performance while maintaining data locality.

Next Steps

After understanding the architecture:

Installation Guide - Deploy the CDN Manager
Configuration Guide - Configure components for your environment
Operations Guide - Day-to-day operational procedures
Performance Tuning Guide - Optimize system performance
Metrics & Monitoring - Set up monitoring and alerting

5 - Installation Guide

Step-by-step installation and upgrade procedures

Overview

This guide provides detailed instructions for installing the AgileTV CDN Manager (ESB3027) in various deployment scenarios. The installation process varies depending on the target environment and desired configuration.

Estimated Installation Time:

Deployment Type	Time
Single-Node (Lab)	~15 minutes
Multi-Node (3 servers)	~30 minutes

Actual installation time may vary depending on hardware performance, network speed, and whether air-gapped procedures are required.

Note: These estimates assume the operating system is already installed on all nodes. OS installation is outside the scope of this guide.

Installation Types

Installation Type	Description	Use Case
Single-Node (Lab)	Minimal installation on a single host	Acceptance testing, demonstrations, development
Multi-Node (Production)	Full high-availability cluster with 3+ server nodes	Production deployments

Installation Process Summary

The installation follows a sequential process:

Prepare the host system - Verify requirements and mount the installation ISO
Install the Kubernetes cluster - Deploy K3s, Longhorn storage, and PostgreSQL
Join additional nodes (production only) - Expand the cluster for HA or capacity
Deploy the Manager application - Install the CDN Manager Helm chart
Post-installation configuration - Configure authentication, networking, and users

Quick Links

Guide	Description
Installation Checklist	Step-by-step checklist to track progress
Single-Node Installation	Lab and acceptance testing deployment
Multi-Node Installation	Production high-availability deployment
Air-Gapped Deployment	Air-gapped environment installation
Helm Chart Installation	Common helm chart deployment steps
Upgrade Guide	Upgrading from previous versions
Next Steps	Post-installation configuration tasks

Prerequisites

Before beginning installation, ensure the following requirements are met:

Hardware: Nodes meeting the System Requirements including CPU, memory, and disk specifications
Operating System: RHEL 9 or compatible clone (details); air-gapped deployments require the OS ISO mounted on all nodes
Network: Proper firewall configuration between nodes (port requirements, firewall configuration)
Software: Installation ISO obtained from AgileTV; air-gapped deployments also require the Extras ISO
Kernel Tuning: For production deployments, apply recommended sysctl settings (Performance Tuning Guide)

We recommend using the Installation Checklist to track your progress through the installation process.

Getting Help

If you encounter issues during installation:

Review the Troubleshooting Guide for common issues
Check the System Requirements to verify your environment
Consult the Release Notes for version-specific known issues

5.1 - Installation Checklist

Step-by-step checklist to track installation progress

Overview

Use this checklist to track your installation progress. Print this page or keep it open during your installation to ensure all steps are completed correctly.

Pre-Installation

Hardware and Software

Verify hardware meets System Requirements
Confirm operating system is supported (RHEL 9 or compatible clone)
Configure firewall rules between nodes (details)
Apply recommended sysctl settings (details)
Obtain installation ISO (esb3027-acd-manager-X.Y.Z.iso)

Air-Gapped Deployments

Obtain Extras ISO (esb3027-acd-manager-extras-X.Y.Z.iso)
Mount OS ISO on all nodes before installation
Verify OS packages are accessible from mounted ISO

Special Requirements

Oracle Linux UEK: Install kernel-uek-modules-extra-netfilter-$(uname -r) package
Control Plane Only nodes: Set SKIP_REQUIREMENTS_CHECK=1 if below lab minimums
SELinux: Set to “Enforcing” mode before running installer (cannot enable after)

Cluster Installation

Single-Node Deployment

Follow the Single-Node Installation Guide.

Mount installation ISO (Step 1)
Install the base cluster (Step 2)
Verify cluster status (Step 3)
Air-gapped only: Load container images (Step 4)
Create configuration file (Step 5)
Optional: Load MaxMind GeoIP databases (Step 6)
Deploy the Manager Helm chart (Step 7)
Verify deployment (Step 8)

Multi-Node Deployment

Follow the Multi-Node Installation Guide.

Primary Server Node

Mount installation ISO (Step 1)
Install the base cluster (Step 2)
Verify system pods are running (Step 2)
Retrieve the node token (Step 3)

Additional Server Nodes

Mount installation ISO (Step 5)
Join the cluster (Step 5)
Verify each node joins (Step 5)
Optional: Taint Control Plane Only nodes (Step 5b)

Agent Nodes (Optional)

Mount installation ISO (Step 6)
Join the cluster as an agent (Step 6)
Verify each agent joins (Step 6)

Cluster Verification

Verify all nodes are ready (Step 7)
Verify system pods running on all nodes (Step 7)
Air-gapped only: Load container images on each node (Step 9)

Application Deployment

Create configuration file (Step 10)
Optional: Load MaxMind GeoIP databases (Step 11)
Optional: Configure TLS certificates from trusted CA (Step 12)
Deploy the Manager Helm chart (Step 13)
Verify all pods are running and distributed (Step 14)
Configure DNS records for manager hostname (Step 15)

Post-Installation

Initial Access

Access the system via HTTPS
Accept self-signed certificate warning (if using default certificate)
Log in with default credentials (see Glossary)

Security Configuration

Create new administrator account in Zitadel
Delete or secure the default admin account
Configure additional users and permissions
Review Zitadel Administrator Documentation for role assignments

Monitoring and Operations

Access Grafana dashboards at /grafana
Review pre-built monitoring dashboards
Configure alerting rules (optional)
Set up notification channels (optional)

Next Steps

Review Next Steps Guide for additional configuration
Configure CDN routing rules
Set up GeoIP-based routing (if using MaxMind databases)
Review Operations Guide for day-to-day procedures

Troubleshooting

If you encounter issues during installation:

Check pod status: kubectl describe pod <pod-name>
Review logs: kubectl logs <pod-name>
Check cluster events: kubectl get events --sort-by='.lastTimestamp'
Review the Troubleshooting Guide for common issues

5.2 - Single-Node Installation

Lab and acceptance testing deployment

Warning: Single-node deployments are for lab environments, acceptance testing, and demonstrations only. This configuration is not suitable for production workloads. For production deployments, see the Multi-Node Installation Guide, which requires a minimum of 3 server nodes for high availability.

Air-Gapped Deployment? This guide assumes internet connectivity. For air-gapped deployments, see the Air-Gapped Deployment Guide for additional requirements and procedures.

Overview

This guide describes the installation of the AgileTV CDN Manager on a single node. This configuration is intended for lab environments, acceptance testing, and demonstrations only. It is not suitable for production workloads.

Prerequisites

Hardware Requirements

Refer to the System Requirements Guide for hardware specifications. Single-node deployments require the “Single-Node (Lab)” configuration.

Operating System

Refer to the System Requirements Guide for supported operating systems.

Software Access

Installation ISO: esb3027-acd-manager-X.Y.Z.iso
Extras ISO (air-gapped only): esb3027-acd-manager-extras-X.Y.Z.iso

Network Configuration

Ensure that required firewall ports are configured before installation. See the Networking Guide for complete firewall configuration requirements.

SELinux

If SELinux is to be used, it must be set to “Enforcing” mode before running the installer script. The installer will configure appropriate SELinux policies automatically. SELinux cannot be enabled after installation.

Installation Steps

Step 1: Mount the ISO

Create a mount point and mount the installation ISO:

mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

Replace X.Y.Z with the actual version number.

Step 2: Install the Base Cluster

Run the installer to set up the K3s Kubernetes cluster:

/mnt/esb3027/install

This installs:

K3s Kubernetes distribution
Longhorn distributed storage
Cloudnative PG operator for PostgreSQL
Base system dependencies

The installer will configure the node as both a server and agent node.

Step 3: Verify Cluster Status

After the installer completes, verify that all components are operational before proceeding. This verification serves as an important checkpoint to confirm the installation is progressing correctly.

1. Verify the node is ready:

kubectl get nodes

Expected output:

NAME         STATUS   ROLES                       AGE   VERSION
k3s-server   Ready    control-plane,etcd,master   2m    v1.33.4+k3s1

2. Verify system pods in both namespaces are running:

# Check kube-system namespace (Kubernetes core components)
kubectl get pods -n kube-system

# Check longhorn-system namespace (distributed storage)
kubectl get pods -n longhorn-system

All pods should show Running status. If any pods are still Pending or ContainerCreating, wait until they are ready. Proceeding with incomplete system pods can cause subsequent steps to fail in unpredictable ways.

This verification confirms:

K3s cluster is operational
Longhorn distributed storage is running
Cloudnative PG operator is deployed
All core components are healthy before continuing

Step 4: Air-Gapped Deployments (If Applicable)

If deploying in an air-gapped environment, load container images from the extras ISO:

mkdir -p /mnt/esb3027-extras
mount -o loop,ro esb3027-acd-manager-extras-X.Y.Z.iso /mnt/esb3027-extras
/mnt/esb3027-extras/load-images

Step 5: Deploy the Manager Helm Chart

For complete instructions on deploying the CDN Manager Helm chart, including configuration file setup, MaxMind GeoIP database loading, TLS certificate configuration, deployment commands, and verification steps, see the Helm Chart Installation Guide.

This guide covers the common deployment steps that apply to all installation types. After completing the helm chart installation steps, proceed to Post-Installation below.

Post-Installation

After installation completes, proceed to the Next Steps guide for:

Initial user configuration
Accessing the web interfaces
Configuring authentication
Setting up monitoring

Accessing the System

Refer to the Accessing the System section in the Getting Started guide for service URLs and default credentials.

Note: A self-signed SSL certificate is deployed by default. You will need to accept the certificate warning in your browser.

Troubleshooting

If pods fail to start:

Check pod status: kubectl describe pod <pod-name>
Review logs: kubectl logs <pod-name>
Verify resources: kubectl top pods

See the Troubleshooting Guide for additional assistance.

Next Steps

After successful installation:

Next Steps Guide - Post-installation configuration
Configuration Guide - System configuration
Operations Guide - Day-to-day operations

Appendix: Lab Configuration File

The installation ISO includes a pre-built lab configuration at /mnt/esb3027/values-lab.yaml, designed specifically for single-node deployments. It handles single-replica settings for Kafka and Zitadel, resource sizing, and TLS configuration automatically.

Copy it as your starting point:

cp /mnt/esb3027/values-lab.yaml ~/values.yaml

At minimum, update these two fields to match your environment before deploying:

global:
  hosts:
    manager:
      - host: manager.local   # Replace with your hostname or IP

zitadel:
  zitadel:
    configmapConfig:
      ExternalDomain: manager.local   # Must match global.hosts.manager[0].host

These two values must match exactly or authentication will fail. For a full description of all available options in values-lab.yaml, see the Configuration Guide.

5.3 - Multi-Node Installation

Production high-availability deployment

Overview

This guide describes the installation of the AgileTV CDN Manager across multiple nodes for production deployments. This configuration provides high availability and horizontal scaling capabilities.

Air-Gapped Deployment? This guide assumes internet connectivity. For air-gapped deployments, see the Air-Gapped Deployment Guide for additional requirements and procedures.

Prerequisites

Hardware Requirements

Refer to the System Requirements Guide for hardware specifications. Production deployments require:

Minimum 3 Server nodes (Control Plane Only or Combined role)
Optional Agent nodes for additional workload capacity

Operating System

Refer to the System Requirements Guide for supported operating systems.

Software Access

Installation ISO: esb3027-acd-manager-X.Y.Z.iso (for each node)
Extras ISO (air-gapped only): esb3027-acd-manager-extras-X.Y.Z.iso

Network Configuration

Ensure that required firewall ports are configured between all nodes before installation. See the Configuring Segregated Networks guide for the standard firewall configuration.

Note: When using segregated networks, the K3s API server on the primary node will be reachable via its internal/private interface. Consequently, when joining additional nodes, the <primary-server-ip> provided to the join script must be the internal/private IP address of the primary node to ensure the join request is routed correctly through the private network.

Single-NIC Deployments: If your nodes have only a single network interface, see the Shared Interface Setup guide instead. This guide assumes segregated networks with separate interfaces for cluster traffic (eth1) and external access (eth0).

Segregated Network Configuration

If your nodes have multiple network interfaces and you want to use a separate interface for cluster traffic (not the default route interface), configure the INSTALL_K3S_EXEC environment variable before installing the cluster or joining nodes.

For segregated networks (private cluster network on eth1 + public external access on eth0), set all three K3s flags:

# For server nodes
export INSTALL_K3S_EXEC="server --node-ip=<ETH1_IP> --node-external-ip=<ETH0_IP> --flannel-iface=eth1 --advertise-address=<ETH1_IP>"

# For agent nodes  
export INSTALL_K3S_EXEC="agent --node-ip=<ETH1_IP> --node-external-ip=<ETH0_IP> --flannel-iface=eth1"

Where:

Mode: Use server for the primary node establishing the cluster, or for additional server nodes. Use agent for agent nodes joining the cluster.
--node-ip=<ETH1_IP>: The internal/private IP address of eth1 for cluster communication
--node-external-ip=<ETH0_IP>: The public IP address of eth0 for external access (LoadBalancer services, ingress)
--flannel-iface=eth1: The network interface name for Flannel VXLAN overlay traffic
--advertise-address=<ETH1_IP>: The address the API server uses to advertise itself to cluster members. Must be set to the internal/private IP address in a segregated-network deployment; without this flag, k3s defaults to the external IP when --node-external-ip is set, causing the kubernetes service endpoint to register as an unreachable address. This flag is required for server nodes only; agent nodes do not run an API server.

Set this variable on each node before running the install or join scripts.

SELinux

Installation Steps

Step 1: Prepare the Primary Server Node

Mount the installation ISO on the primary server node:

mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

Replace X.Y.Z with the actual version number.

Step 2: Install the Base Cluster on Primary Server

Segregated Networks: If your node has multiple network interfaces, set the INSTALL_K3S_EXEC environment variable with the complete segregated network configuration before running the installer (see Segregated Network Configuration):
export INSTALL_K3S_EXEC="server --node-ip=<ETH1_IP> --node-external-ip=<ETH0_IP> --flannel-iface=eth1 --advertise-address=<ETH1_IP>"
Replace <ETH1_IP> with the internal/private IP address and <ETH0_IP> with the public IP address.

If your node has only a single network interface, do not set INSTALL_K3S_EXEC. K3s will use the default interface automatically.

Run the installer to set up the K3s Kubernetes cluster:

/mnt/esb3027/install

This installs:

K3s Kubernetes distribution
Longhorn distributed storage
Cloudnative PG operator for PostgreSQL
Base system dependencies

Important: After the installer completes, verify that all system pods in both namespaces are in the Running state before proceeding:

# Check kube-system namespace (Kubernetes core components)
kubectl get pods -n kube-system

# Check longhorn-system namespace (distributed storage)
kubectl get pods -n longhorn-system

This verification confirms:

K3s cluster is operational
Longhorn distributed storage is running
Cloudnative PG operator is deployed
All core components are healthy before continuing

Step 3: Retrieve the Node Token

Retrieve the node token for joining additional nodes:

cat /var/lib/rancher/k3s/server/node-token

Save this token for use on additional nodes. Also note the IP address of the primary server node.

Step 4: Server vs Agent Node Roles

Before joining additional nodes, determine which nodes will serve as Server nodes vs Agent nodes:

Role	Control Plane	Workloads	HA Quorum	Use Case
Server Node (Combined)	Yes (etcd, API server)	Yes	Participates	Default production role; minimum 3 nodes
Server Node (Control Plane Only)	Yes (etcd, API server)	No	Participates	Dedicated control plane; requires separate Agent nodes
Agent Node	No	Yes	No	Additional workload capacity only

Guidance:

Combined role (default): Server nodes run both control plane and workloads; minimum 3 nodes required for HA
Control Plane Only: Dedicate nodes to control plane functions; requires at least 3 Server nodes plus 3+ Agent nodes for workloads
Agent nodes are required if using Control Plane Only servers; optional if using Combined role servers
For most deployments, 3 Server nodes (Combined role) with no Agent nodes is sufficient
Add Agent nodes to scale workload capacity without affecting control plane quorum

Proceed to Step 5 to join Server nodes. Agent nodes are joined after all Server nodes are ready.

Step 5: Join Additional Server Nodes

On each additional server node:

Mount the ISO:

mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

Join the cluster:

Segregated Networks: If your node has multiple network interfaces, set the INSTALL_K3S_EXEC environment variable with the complete segregated network configuration before running the join script (see Segregated Network Configuration):
export INSTALL_K3S_EXEC="server --node-ip=<ETH1_IP> --node-external-ip=<ETH0_IP> --flannel-iface=eth1 --advertise-address=<ETH1_IP>"
Replace <ETH1_IP> with the internal/private IP address and <ETH0_IP> with the public IP address.

If your node has only a single network interface, do not set INSTALL_K3S_EXEC. K3s will use the default interface automatically.

Note for Segregated Networks: When joining nodes in a segregated network environment, ensure the <primary-server-ip> used in the join command is the internal/private IP address (the eth1 address) of the primary server. Using the external IP will cause the join attempt to fail as the service will be listening on the private interface.

Run the join script:

/mnt/esb3027/join-server https://<primary-server-ip>:6443 <node-token>

Replace <primary-server-ip> with the IP address of the primary server and <node-token> with the token retrieved in Step 3.

Verify the node joined successfully:

kubectl get nodes

Repeat for each server node. A minimum of 3 server nodes is required for high availability.

Step 5b: Taint Control Plane Only Nodes (Optional)

If you are using dedicated Control Plane Only nodes (not Combined role), apply taints to prevent workload scheduling:

kubectl taint nodes <node-name> CriticalAddonsOnly=true:NoSchedule

Apply this taint to each Control Plane Only node. Verify taints are applied:

kubectl describe nodes | grep -A 5 "Taints"

Note: This step is only required if you want dedicated control plane nodes. For Combined role deployments, do not apply taints.

Important: Control Plane Only Server nodes can be deployed with lower hardware specifications (2 cores, 4 GiB, 64 GiB) than the installer’s default minimum requirements. If your Control Plane Only Server nodes do not meet the Single-Node Lab configuration minimums (8 cores, 16 GiB, 128 GiB), you must set the SKIP_REQUIREMENTS_CHECK environment variable before running the installer or join command:
# For the primary server node
export SKIP_REQUIREMENTS_CHECK=1
/mnt/esb3027/install

# For additional Control Plane Only Server nodes
export SKIP_REQUIREMENTS_CHECK=1
/mnt/esb3027/join-server https://<primary-server-ip>:6443 <node-token>
Note: This applies to Server nodes only. Agent nodes have separate minimum requirements.

Step 6: Join Agent Nodes (Optional)

On each agent node:

Mount the ISO:

mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

Join the cluster as an agent:

Segregated Networks: If your node has multiple network interfaces, set the INSTALL_K3S_EXEC environment variable with the complete segregated network configuration before running the join script (see Segregated Network Configuration):
export INSTALL_K3S_EXEC="agent --node-ip=<ETH1_IP> --node-external-ip=<ETH0_IP> --flannel-iface=eth1"
Replace <ETH1_IP> with the internal/private IP address and <ETH0_IP> with the public IP address.

If your node has only a single network interface, do not set INSTALL_K3S_EXEC. K3s will use the default interface automatically.

Run the join script:

/mnt/esb3027/join-agent https://<primary-server-ip>:6443 <node-token>

Note for Segregated Networks: When joining nodes in a segregated network environment, ensure the <primary-server-ip> used in the join command is the internal/private IP address (the eth1 address) of the primary server. Using the external IP will cause the join attempt to fail as the service will be listening on the private interface.

Verify the node joined successfully from an existing server node:
```
kubectl get nodes
```

Agent nodes provide additional workload capacity but do not participate in the control plane quorum.

Step 7: Verify Cluster Status

After all nodes are joined, verify the cluster is operational:

1. Verify all nodes are ready:

kubectl get nodes

Expected output:

NAME                 STATUS   ROLES                       AGE   VERSION
k3s-server-0         Ready    control-plane,etcd,master   5m    v1.33.4+k3s1
k3s-server-1         Ready    control-plane,etcd,master   3m    v1.33.4+k3s1
k3s-server-2         Ready    control-plane,etcd,master   2m    v1.33.4+k3s1
k3s-agent-1          Ready    <none>                      1m    v1.33.4+k3s1
k3s-agent-2          Ready    <none>                      1m    v1.33.4+k3s1

2. Verify system pods in both namespaces are running:

# Check kube-system namespace (Kubernetes core components)
kubectl get pods -n kube-system

# Check longhorn-system namespace (distributed storage)
kubectl get pods -n longhorn-system

All pods should show Running status. If any pods are still Pending or ContainerCreating, wait until they are ready.

This verification confirms:

K3s cluster is operational across all nodes
Longhorn distributed storage is running
Cloudnative PG operator is deployed
All core components are healthy before proceeding to application deployment

Step 9: Air-Gapped Deployments (If Applicable)

If deploying in an air-gapped environment, on each node:

mkdir -p /mnt/esb3027-extras
mount -o loop,ro esb3027-acd-manager-extras-X.Y.Z.iso /mnt/esb3027-extras
/mnt/esb3027-extras/load-images

Step 10: Deploy the Manager Helm Chart

This guide covers the common deployment steps that apply to all installation types. After completing the helm chart installation steps, proceed to Post-Installation below.

Step 15: Configure DNS (Optional)

Add DNS records for the manager hostname. For high availability, configure multiple A records pointing to different server nodes:

manager.example.com.  IN  A  <server-1-ip>
manager.example.com.  IN  A  <server-2-ip>
manager.example.com.  IN  A  <server-3-ip>

Alternatively, configure a load balancer to distribute traffic across nodes.

Post-Installation

After installation completes, proceed to the Next Steps guide for:

Initial user configuration
Accessing the web interfaces
Configuring authentication
Setting up monitoring

Accessing the System

Refer to the Accessing the System section in the Getting Started guide for service URLs and default credentials.

Note: A self-signed SSL certificate is deployed by default. For production deployments, configure a valid SSL certificate before exposing the system to users.

High Availability Considerations

Pod Distribution

The Helm chart configures pod anti-affinity rules to ensure:

Kafka controllers are scheduled on separate nodes
PostgreSQL cluster members are distributed across nodes
Application pods are spread across available nodes

Data Replication and Failure Tolerance

For detailed information on data replication strategies and failure scenario tolerance, refer to the Architecture Guide and System Requirements Guide.

Troubleshooting

If pods fail to start or nodes fail to join:

Check node status: kubectl get nodes
Describe problematic pods: kubectl describe pod <pod-name>
Review logs: kubectl logs <pod-name>
Check cluster events: kubectl get events --sort-by='.lastTimestamp'

Nodes Ready but Workloads Cannot Reach the API Server (Segregated Networks)

Symptom: All nodes show Ready status, but cluster components (kubelet, controller-manager, scheduler) or workloads fail to communicate with the API server. Pods in kube-system or longhorn-system may fail to start or remain in a crash loop.

Cause: This is caused by omitting --advertise-address from the server-node INSTALL_K3S_EXEC. When --node-external-ip is set without --advertise-address, k3s defaults the API server’s advertise address to the external IP (eth0). In a segregated-network topology where nodes are not routable to each other over eth0, the kubernetes service ClusterIP endpoint registers as an unreachable address.

Diagnostic check:

kubectl get endpoints kubernetes -n default

If the IP shown is the eth0 (external) address rather than the eth1 (internal) address, the cluster was installed without --advertise-address.

Remediation: The kubernetes service endpoint cannot be corrected by reconfiguration alone. K3s must be reinstalled on all server nodes with the correct flags:

export INSTALL_K3S_EXEC="server --node-ip=<ETH1_IP> --node-external-ip=<ETH0_IP> --flannel-iface=eth1 --advertise-address=<ETH1_IP>"

After reinstallation, re-run the diagnostic check to confirm the endpoint IP is now the eth1 (internal) address.

See the Troubleshooting Guide for additional assistance.

Next Steps

After successful installation:

Next Steps Guide - Post-installation configuration
Configuration Guide - System configuration
Operations Guide - Day-to-day operations

5.4 - Air-Gapped Deployment

Installation procedures for air-gapped environments

Overview

This guide describes the installation of the AgileTV CDN Manager in air-gapped environments (no internet access). Air-gapped deployments require additional preparation compared to connected deployments.

Key differences from connected deployments:

Both Installation ISO and Extras ISO are required on all nodes
OS installation ISO must be mounted on all nodes for package access
Container images must be loaded from the Extras ISO on each node
Additional firewall considerations for OS package repositories

Prerequisites

Required ISOs

Before beginning installation, obtain the following:

ISO	Filename	Purpose
Installation ISO	`esb3027-acd-manager-X.Y.Z.iso`	Kubernetes cluster and Manager application
Extras ISO	`esb3027-acd-manager-extras-X.Y.Z.iso`	Container images for air-gapped environments
OS Installation ISO	RHEL 9 or compatible clone	Operating system packages (required on all nodes)

Hardware Requirements

Refer to the System Requirements Guide for hardware specifications.

Single-Node (Lab): Minimum 8 cores, 16 GiB RAM, 128 GiB disk
Multi-Node (Production): Minimum 3 Server nodes for high availability

Network Configuration

Air-gapped environments may have internal network mirrors for OS packages. If no internal mirror exists, the OS installation ISO must be mounted on each node to provide packages during installation.

Ensure that required firewall ports are configured before installation. See the Networking Guide for complete firewall configuration requirements.

SELinux

Installation Steps

Step 1: Prepare All Nodes

On each node (primary server, additional servers, and agents):

Mount the OS installation ISO:

mkdir -p /mnt/os
mount -o loop,ro /path/to/rhel-9.iso /mnt/os

Configure local repository (if no internal mirror):

cat > /etc/yum.repos.d/local.repo <<EOF
[local]
name=Local OS Repository
baseurl=file:///mnt/os/BaseOS
enabled=1
gpgcheck=0
EOF

# Also configure AppStream if needed
cat >> /etc/yum.repos.d/local.repo <<EOF

[appstream]
name=AppStream Repository
baseurl=file:///mnt/os/AppStream
enabled=1
gpgcheck=0
EOF

Verify repository is accessible:

dnf repolist
dnf makecache

Step 2: Prepare the Primary Server Node

Mount the installation ISOs on the primary server node:

# Mount Installation ISO
mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

# Mount Extras ISO
mkdir -p /mnt/esb3027-extras
mount -o loop,ro esb3027-acd-manager-extras-X.Y.Z.iso /mnt/esb3027-extras

Step 3: Install the Base Cluster on Primary Server

Run the installer to set up the K3s Kubernetes cluster:

/mnt/esb3027/install

This installs:

K3s Kubernetes distribution
Longhorn distributed storage
Cloudnative PG operator for PostgreSQL
Base system dependencies

Important: After the installer completes, verify that all system pods in both namespaces are in the Running state before proceeding:

# Check kube-system namespace (Kubernetes core components)
kubectl get pods -n kube-system

# Check longhorn-system namespace (distributed storage)
kubectl get pods -n longhorn-system

This verification confirms:

K3s cluster is operational
Longhorn distributed storage is running
Cloudnative PG operator is deployed
All core components are healthy before continuing

Step 4: Retrieve the Node Token

Retrieve the node token for joining additional nodes:

cat /var/lib/rancher/k3s/server/node-token

Save this token for use on additional nodes. Also note the IP address of the primary server node.

Step 5: Join Additional Server Nodes (Multi-Node Only)

On each additional server node:

Mount the OS ISO:

mkdir -p /mnt/os
mount -o loop,ro /path/to/rhel-9.iso /mnt/os

# Configure local repository
cat > /etc/yum.repos.d/local.repo <<EOF
[local]
name=Local OS Repository
baseurl=file:///mnt/os/BaseOS
enabled=1
gpgcheck=0

[appstream]
name=AppStream Repository
baseurl=file:///mnt/os/AppStream
enabled=1
gpgcheck=0
EOF

dnf makecache

Mount the Installation ISOs:

# Mount Installation ISO
mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

# Mount Extras ISO
mkdir -p /mnt/esb3027-extras
mount -o loop,ro esb3027-acd-manager-extras-X.Y.Z.iso /mnt/esb3027-extras

Join the cluster:

Run the join script:

/mnt/esb3027/join-server https://<primary-server-ip>:6443 <node-token>

Replace <primary-server-ip> with the IP address of the primary server and <node-token> with the token retrieved in Step 4.

Verify the node joined successfully:

kubectl get nodes

Repeat for each server node. A minimum of 3 server nodes is required for high availability.

Step 6: Join Agent Nodes (Optional)

On each agent node:

Mount the OS ISO:

mkdir -p /mnt/os
mount -o loop,ro /path/to/rhel-9.iso /mnt/os

# Configure local repository
cat > /etc/yum.repos.d/local.repo <<EOF
[local]
name=Local OS Repository
baseurl=file:///mnt/os/BaseOS
enabled=1
gpgcheck=0

[appstream]
name=AppStream Repository
baseurl=file:///mnt/os/AppStream
enabled=1
gpgcheck=0
EOF

dnf makecache

Mount the Installation ISOs:

# Mount Installation ISO
mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

# Mount Extras ISO
mkdir -p /mnt/esb3027-extras
mount -o loop,ro esb3027-acd-manager-extras-X.Y.Z.iso /mnt/esb3027-extras

Join the cluster as an agent:

Run the join script:

/mnt/esb3027/join-agent https://<primary-server-ip>:6443 <node-token>

Verify the node joined successfully from an existing server node:

kubectl get nodes

Agent nodes provide additional workload capacity but do not participate in the control plane quorum.

Step 7: Load Container Images

On each node in the cluster:

/mnt/esb3027-extras/load-images

This script loads all container images from the Extras ISO into the local container runtime.

Important: This step must be performed on every node (primary server, additional servers, and agents) before deploying the Manager application.

Step 8: Verify Cluster Status

After all nodes are joined and images are loaded, verify the cluster is operational:

1. Verify all nodes are ready:

kubectl get nodes

Expected output:

NAME                 STATUS   ROLES                       AGE   VERSION
k3s-server-0         Ready    control-plane,etcd,master   5m    v1.33.4+k3s1
k3s-server-1         Ready    control-plane,etcd,master   3m    v1.33.4+k3s1
k3s-server-2         Ready    control-plane,etcd,master   2m    v1.33.4+k3s1
k3s-agent-1          Ready    <none>                      1m    v1.33.4+k3s1

2. Verify system pods in both namespaces are running:

# Check kube-system namespace (Kubernetes core components)
kubectl get pods -n kube-system

# Check longhorn-system namespace (distributed storage)
kubectl get pods -n longhorn-system

All pods should show Running status.

3. Verify container images are loaded:

crictl images | grep acd-manager

Step 9: Deploy the Manager Helm Chart

This guide covers the common deployment steps that apply to all installation types. After completing the helm chart installation steps, proceed to Post-Installation below.

Post-Installation

After installation completes, proceed to the Next Steps guide for:

Initial user configuration
Accessing the web interfaces
Configuring authentication
Setting up monitoring

Accessing the System

Refer to the Accessing the System section in the Getting Started guide for service URLs and default credentials.

Note: A self-signed SSL certificate is deployed by default. You will need to accept the certificate warning in your browser.

Updating MaxMind GeoIP Databases

If using GeoIP-based routing, load the MaxMind databases:

/mnt/esb3027/generate-maxmind-volume

The utility will prompt for the database file locations and volume name. Reference the volume in your values.yaml:

manager:
  maxmindDbVolume: maxmind-geoip-2026-04

See the Operations Guide for database update procedures.

Troubleshooting

Image Pull Errors

If pods fail with image pull errors:

Verify the load-images script completed successfully on all nodes
Check container runtime image list:
```
crictl images | grep <image-name>
```
Ensure image tags in Helm chart match tags on the Extras ISO

OS Package Errors

If the installer reports missing OS packages:

Verify OS ISO is mounted on the affected node
Check repository configuration:
```
dnf repolist
dnf info <package-name>
```
Ensure the ISO matches the installed OS version

Longhorn Volume Issues

If Longhorn volumes fail to mount:

Verify all nodes have the load-images script completed
Check Longhorn system pods:
```
kubectl get pods -n longhorn-system
```

Review Longhorn UI via port-forward:

kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80

Next Steps

After successful installation:

Next Steps Guide - Post-installation configuration
Configuration Guide - System configuration
Operations Guide - Day-to-day operational procedures
Troubleshooting Guide - Common issues and resolution

5.5 - Helm Chart Installation

Common procedure for deploying the CDN Manager Helm chart across all deployment types

Overview

This guide covers the common steps for deploying the CDN Manager Helm chart. These steps apply to all deployment types (single-node, multi-node, and air-gapped) after the Kubernetes cluster is fully operational.

Prerequisites: This guide assumes the Kubernetes cluster is already installed and all system pods are running. If you haven’t installed the cluster yet, refer to:
Single-Node Installation for lab environments
Multi-Node Installation for production deployments
Air-Gapped Deployment for air-gapped environments

Prerequisites

Before proceeding, verify the following:

Cluster operational: All nodes show Ready status
System pods running: All pods in kube-system and longhorn-system namespaces are Running
ISO mounted: Installation ISO is mounted at /mnt/esb3027
Extras ISO mounted (air-gapped only): Extras ISO is mounted at /mnt/esb3027-extras and images are loaded on all nodes

Step 1: Create Configuration File

The installation ISO includes environment-specific configuration files as the recommended starting points. Choose the file that matches your deployment type:

Deployment	Starting file	Copy command
Single-node lab	`/mnt/esb3027/values-lab.yaml`	`cp /mnt/esb3027/values-lab.yaml ~/values.yaml`
Multi-node production	`/mnt/esb3027/values-production.yaml`	`cp /mnt/esb3027/values-production.yaml ~/values.yaml`

After copying, edit ~/values.yaml for your environment. The two fields that must be updated in either file are:

global:
  hosts:
    manager:
      - host: manager.example.com   # Your manager hostname

zitadel:
  zitadel:
    configmapConfig:
      ExternalDomain: manager.example.com   # Must match global.hosts.manager[0].host exactly

Important: global.hosts.manager[0].host and zitadel.zitadel.configmapConfig.ExternalDomain must match exactly or authentication will fail due to CORS policy violations.

For a full description of what each file configures and the complete list of required changes per environment, see the Configuration Guide.

Complete reference: /mnt/esb3027/values.yaml documents every available option with its default value. Use this as a reference, not as a starting point for your configuration.

Split configuration files: For better organisation, split your configuration into multiple files and specify them with repeated --values flags. Later files override earlier files:

helm install acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml \
  --values ~/values-tls.yaml

Step 2: Load MaxMind GeoIP Databases (Optional)

If you plan to use GeoIP-based routing or validation features, load the MaxMind GeoIP databases. The following databases are used by the manager:

GeoIP2-City.mmdb - The City Database
GeoLite2-ASN.mmdb - The ASN Database
GeoIP2-Anonymous-IP.mmdb - The VPN and Anonymous IP Database

Create the Kubernetes volume using the helper utility:

/mnt/esb3027/generate-maxmind-volume

The utility will prompt for:

Location of GeoIP2-City.mmdb
Location of GeoLite2-ASN.mmdb
Location of GeoIP2-Anonymous-IP.mmdb
Name of the volume

After running this command, reference the volume in your configuration file:

manager:
  maxmindDbVolume: maxmind-db-volume

Replace maxmind-db-volume with the volume name you specified when running the utility.

Tip: When naming the volume, include a revision number or date (e.g., maxmind-db-volume-2026-04 or maxmind-db-volume-v2). This simplifies future updates: create a new volume with an updated name, update the values.yaml to reference the new volume, and delete the old volume after verification.

Step 3: Configure TLS Certificates (Optional)

For production deployments, configure a valid TLS certificate from a trusted Certificate Authority (CA). A self-signed certificate is deployed by default if no certificate is provided.

Method 1: Create TLS Secret Manually

Create a Kubernetes TLS secret with your certificate and key:

kubectl create secret tls acd-manager-tls --cert=tls.crt --key=tls.key

Method 2: Helm-Managed Secret

Add the certificate directly to your values.yaml:

ingress:
  secrets:
    acd-manager-tls: |
      -----BEGIN CERTIFICATE-----
      ...
      -----END CERTIFICATE-----
  tls:
    - hosts:
        - manager.example.com
      secretName: acd-manager-tls

Configuring All Ingress Controllers

All ingress controllers must be configured with the same certificate secret and hostname:

ingress:
  hostname: manager.example.com
  tls: true
  secretName: acd-manager-tls

zitadel:
  ingress:
    tls:
      - hosts:
          - manager.example.com
        secretName: acd-manager-tls

confd:
  ingress:
    hostname: manager.example.com
    tls: true
    secretName: acd-manager-tls

mib-frontend:
  ingress:
    hostname: manager.example.com
    tls: true
    secretName: acd-manager-tls

Important: The hostname must match the first entry in global.hosts.manager for Zitadel CORS compatibility. The secret name has a maximum length of 53 characters.

Step 4: Deploy the Manager Helm Chart

Deploy the CDN Manager application:

helm install acd-manager /mnt/esb3027/charts/acd-manager --values ~/values.yaml

Real-time output: By default, helm install runs silently until completion. To see real-time output during deployment, add the --debug flag:

helm install acd-manager /mnt/esb3027/charts/acd-manager --values ~/values.yaml --debug

Monitor deployment:

kubectl get pods --watch

Wait for all pods to show Running status before proceeding.

Timeout handling: The default Helm timeout is 5 minutes. If the installation fails due to a rollout timeout, retry with a larger timeout value:

helm install acd-manager /mnt/esb3027/charts/acd-manager --values ~/values.yaml --timeout 10m

Retry failed installation: If a previous installation attempt failed and you receive an error that the release name is already in use, uninstall the previous release before retrying:

helm uninstall acd-manager
helm install acd-manager /mnt/esb3027/charts/acd-manager --values ~/values.yaml

Step 5: Verify Deployment

Verify all application pods are running:

kubectl get pods

Expected Output: Single-Node

NAME                                              READY   STATUS      RESTARTS   AGE
acd-manager-5b98d569d9-abc12                      1/1     Running     0          3m
acd-manager-confd-6fb78548c4-xnrh4                1/1     Running     0          3m
acd-manager-gateway-8bc8446fc-chs26               1/1     Running     0          3m
acd-manager-kafka-controller-0                    2/2     Running     0          3m
acd-manager-metrics-aggregator-76d96c4964-lwdcj   1/1     Running     0          3m
acd-manager-mib-frontend-7bdb69684b-6qxn8         1/1     Running     0          3m
acd-manager-postgresql-0                          1/1     Running     0          3m
acd-manager-redis-master-0                        2/2     Running     0          3m
acd-manager-redis-replicas-0                      2/2     Running     0          3m
acd-manager-selection-input-5fb694b857-qxt67      1/1     Running     0          3m
acd-manager-zitadel-8448b4c4fc-2pkd8              1/1     Running     0          3m
acd-manager-zitadel-init-hh6j7                    0/1     Completed   0          4m
acd-manager-zitadel-setup-nwp8k                   0/2     Completed   0          4m
alertmanager-0                                    1/1     Running     0          3m
grafana-6d948cfdc6-77ggk                          1/1     Running     0          3m
victoria-metrics-agent-dc87df588-tn8wv            1/1     Running     0          3m
victoria-metrics-alert-757c44c58f-kk9lp           1/1     Running     0          3m
victoria-metrics-longterm-server-0                1/1     Running     0          3m
victoria-metrics-server-0                         1/1     Running     0          3m

Expected Output: Multi-Node

NAME                                              READY   STATUS      RESTARTS   AGE
acd-cluster-postgresql-1                          1/1     Running     0               11m
acd-cluster-postgresql-2                          1/1     Running     0               11m
acd-cluster-postgresql-3                          1/1     Running     0               10m
acd-manager-5b98d569d9-2pbph                      1/1     Running     0               3m
acd-manager-5b98d569d9-m54f9                      1/1     Running     0               3m
acd-manager-5b98d569d9-pq26f                      1/1     Running     0               3m
acd-manager-confd-6fb78548c4-xnrh4                1/1     Running     0               3m
acd-manager-gateway-8bc8446fc-chs26               1/1     Running     0               3m
acd-manager-gateway-8bc8446fc-wzrml               1/1     Running     0               3m
acd-manager-kafka-controller-0                    2/2     Running     0               3m
acd-manager-kafka-controller-1                    2/2     Running     0               3m
acd-manager-kafka-controller-2                    2/2     Running     0               3m
acd-manager-metrics-aggregator-76d96c4964-lwdcj   1/1     Running     2               3m
acd-manager-mib-frontend-7bdb69684b-6qxn8         1/1     Running     0               3m
acd-manager-mib-frontend-7bdb69684b-pkjrw         1/1     Running     0               3m
acd-manager-redis-master-0                        2/2     Running     0               3m
acd-manager-redis-replicas-0                      2/2     Running     0               3m
acd-manager-selection-input-5fb694b857-qxt67      1/1     Running     2               3m
acd-manager-zitadel-8448b4c4fc-2pkd8              1/1     Running     0               3m
acd-manager-zitadel-8448b4c4fc-vchp9              1/1     Running     0               3m
acd-manager-zitadel-init-hh6j7                    0/1     Completed   0               4m
acd-manager-zitadel-setup-nwp8k                   0/2     Completed   0               4m
alertmanager-0                                    1/1     Running     0               3m
grafana-6d948cfdc6-77ggk                          1/1     Running     0               3m
telegraf-54779f5f46-2jfj5                         1/1     Running     0               3m
victoria-metrics-agent-dc87df588-tn8wv            1/1     Running     0               3m
victoria-metrics-alert-757c44c58f-kk9lp           1/1     Running     0               3m
victoria-metrics-longterm-server-0                1/1     Running     0               3m
victoria-metrics-server-0                         1/1     Running     0               3m

Pod Distribution Verification

Verify pods are distributed across nodes:

kubectl get pods -o wide

Expected Behavior

Init pods (such as zitadel-init and zitadel-setup) will show Completed status after successful initialization. This is expected behavior.
Multi-node deployments: Some pods may enter CrashLoopBackoff state during initial deployment depending on the timing of other containers starting up. This is expected behavior as some services wait for dependencies (such as databases or Kafka) to become available. The deployment should stabilize automatically after a few minutes.
Restart counts: Some pods may show restart counts as they wait for dependencies to become available. This is normal during initial deployment.

Next Steps

After successful deployment:

Next Steps Guide - Post-installation configuration
Getting Started Guide - Accessing the system
Configuration Guide - System configuration
Operations Guide - Day-to-day operations

5.6 - Upgrade Guide

Upgrading the CDN Manager to a newer version

Overview

This guide describes the procedure for upgrading the AgileTV CDN Manager (ESB3027) to a newer version. The upgrade process involves updating the Kubernetes cluster components and redeploying the Helm chart with the new version.

Prerequisites

Backup Requirements

Before beginning any upgrade, ensure you have:

PostgreSQL Backup: Verify recent backups are available via the Cloudnative PG operator
Configuration Backup: Save your current values.yaml file(s)
TLS Certificates: Ensure certificate files are backed up
MaxMind Volumes: Note the current volume names if using GeoIP databases

Version Compatibility

Review the Release Notes for the target version to check for:

Breaking changes requiring manual intervention
Required intermediate upgrade steps
New configuration options that should be set

Cluster Health

Verify the cluster is healthy before upgrading:

kubectl get nodes
kubectl get pods
kubectl get pvc

All nodes should show Ready status and all pods should be Running (or Completed for job pods).

Upgrade Methods

There are three upgrade methods available. Choose the one that best fits your situation:

Method	Downtime	Use Case
Rolling Upgrade	Minimal	Patch releases; minor version upgrades; configuration updates
Clean Upgrade	Brief	Major version upgrades; component changes; troubleshooting
Full Reinstall	Extended	Cluster rebuilds; troubleshooting persistent issues; ensuring clean state

Method Selection Guidance:

Rolling Upgrade (Method 1) is the default choice for most upgrades. Use this for patch releases (e.g., 1.6.0 → 1.6.1) and even minor version upgrades (e.g., 1.4.0 → 1.6.0) where no breaking changes are documented. This method preserves all existing resources and performs an in-place update. Note: This method supports Helm’s automatic rollback (helm rollback) if the upgrade fails, allowing quick recovery to the previous state.
Clean Upgrade (Method 2) is recommended for major version upgrades (e.g., 1.x → 2.x) or when the release notes indicate significant component changes. This method ensures all resources are recreated with the new version, avoiding potential issues with stale configurations. Also use this method when troubleshooting upgrade failures from Method 1.
Full Reinstall (Method 3) should only be used when a completely clean cluster state is required. This includes troubleshooting persistent cluster-level issues, recovering from failed upgrades that cannot be rolled back, or when migrating between significantly different deployment configurations. This method requires verified backups and should be planned for extended downtime.

Upgrade Steps

Method 1: Rolling Upgrade (Recommended)

This method performs an in-place rolling upgrade with minimal downtime. All upgrade commands are executed from the primary server node.

Step 1: Obtain the New Installation ISO

Unmount the old ISO (if mounted) and mount the new installation ISO:

umount /mnt/esb3027 2>/dev/null || true
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

Replace X.Y.Z with the target version number.

Step 2: Update Containers and Cluster Software

Run the installation script to update the container images and cluster software:

/mnt/esb3027/install

Wait for the script to complete.

Step 2b: Air-Gapped Environments (If Applicable)

If deploying in an air-gapped environment, also mount and load the extras ISO:

# Mount the Extras ISO
mkdir -p /mnt/esb3027-extras
mount -o loop,ro esb3027-acd-manager-extras-X.Y.Z.iso /mnt/esb3027-extras

# Load container images from the extras ISO
/mnt/esb3027-extras/load-images

Replace X.Y.Z with the target version number.

Step 4: Review and Update Configuration

Compare the default values.yaml from the new ISO with your current configuration:

diff /mnt/esb3027/values.yaml ~/values.yaml

Update your configuration file to include any new required settings. Common updates include:

# ~/values.yaml
global:
  hosts:
    manager:
      - host: manager.example.com
    routers:
      - name: director-1
        address: 192.0.2.1

zitadel:
  zitadel:
    ExternalDomain: manager.example.com

# Add any new required settings for the target version

Important: Do not modify settings unrelated to the upgrade unless specifically documented in the release notes.

Step 5: Update MaxMind GeoIP Volumes (If Applicable)

If you use MaxMind GeoIP databases, use the utility from the new ISO to create an updated volume:

/mnt/esb3027/generate-maxmind-volume

Update your values.yaml to reference the new volume name:

manager:
  maxmindDbVolume: maxmind-geoip-2026-04

Tip: Using dated or versioned volume names (e.g., maxmind-geoip-2026-04) allows you to create new volumes during upgrades and delete old ones after verification.

Step 6: Update TLS Certificates (If Needed)

If your TLS certificates need renewal or the new version requires certificate updates, create or update the secret:

kubectl create secret tls acd-manager-tls --cert=tls.crt --key=tls.key --dry-run=client -o yaml | kubectl apply -f -

Step 7: Upgrade the Helm Release

Perform a Helm upgrade with the new chart:

helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml

Note: The upgrade performs a rolling update of each deployment in the chart. Deployments are upgraded one at a time, with pods being terminated and recreated sequentially. StatefulSets (PostgreSQL, Kafka, Redis) roll out one pod at a time to maintain data availability.

Monitor the upgrade progress:

kubectl get pods --watch

Wait for all pods to stabilize and show Running status before considering the upgrade complete. Some pods may temporarily enter CrashLoopBackoff during the transition as they wait for dependencies to become available.

Step 8: Verify the Upgrade

Check the deployed version:

helm list
kubectl get deployments -o wide

Verify application functionality:

Access the MIB Frontend and confirm it loads
Test API connectivity
Verify Grafana dashboards are accessible
Check that Zitadel authentication is working

Step 9: Clean Up

After confirming the upgrade is successful:

Unmount the old ISO (if still mounted):
```
umount /mnt/esb3027
```

Delete old MaxMind volumes (if replaced):

kubectl get pvc
kubectl delete pvc <old-volume-name>

Remove old configuration files if no longer needed.

Method 2: Clean Upgrade (Helm Uninstall/Install)

This method removes the existing Helm release before installing the new version. This is useful for major version upgrades or when troubleshooting upgrade issues. All upgrade commands are executed from the primary server node.

Warning: This method causes brief downtime as all resources are deleted before reinstallation.

Step 1: Obtain the New Installation ISO

Mount the new installation ISO:

umount /mnt/esb3027 2>/dev/null || true
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

Step 2: Backup Configuration

Save your current Helm values:

helm get values acd-manager -o yaml > ~/values-backup.yaml

Step 3: Uninstall the Existing Release

Remove the existing Helm release:

helm uninstall acd-manager

Wait for pods to terminate:

kubectl get pods --watch

Note: Helm uninstall does not remove PersistentVolumes (PVs) or PersistentVolumeClaims (PVCs). All data stored in PostgreSQL, Kafka, Redis, and Longhorn volumes is preserved during the uninstall process. When the new version is installed, it will reattach to the existing PVCs and restore data automatically.

Step 4: Review and Update Configuration

Compare the default values.yaml from the new ISO with your configuration:

diff /mnt/esb3027/values.yaml ~/values.yaml

Update your configuration file as needed.

Step 5: Install the New Release

Install the new version:

helm install acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml

Monitor the deployment:

kubectl get pods --watch

Wait for all pods to stabilize before proceeding.

Step 6: Verify the Upgrade

Verify the upgrade as described in Method 1, Step 8.

Method 3: Full Reinstall (Cluster Rebuild)

This method completely removes Kubernetes and reinstalls from scratch. Use only for cluster rebuilds or when other upgrade methods fail.

Warning: This method causes extended downtime and permanent data loss. The K3s uninstall process destroys all Longhorn PersistentVolumes (PVs) and PersistentVolumeClaims (PVCs). All data stored in PostgreSQL, Kafka, Redis, and application volumes will be permanently lost. Verified backups are required before proceeding.

Warning: This method should only be used when necessary. Ensure you have verified backups before proceeding.

Step 1: Stop Kubernetes Services

On all nodes (server and agent), stop the K3s service:

systemctl stop k3s

Step 2: Uninstall K3s (Server Nodes Only)

On the primary server node first, then each additional server node:

/usr/local/bin/k3s-uninstall.sh

Step 3: Clean Up Residual State (All Nodes)

On all nodes, remove residual state:

/usr/local/bin/k3s-kill-all.sh
rm -rf /var/lib/rancher/k3s/*

Warning: This removes all cluster data including Longhorn PersistentVolumes (PVs) and PersistentVolumeClaims (PVCs). All data stored in PostgreSQL, Kafka, Redis, and application volumes will be permanently lost. Ensure verified backups are available before proceeding.

Step 4: Reinstall K3s Cluster and Deploy Manager

Follow the installation procedure in the Installation Guide to reinstall the cluster and deploy the Helm chart. At this point, you are in the same state as a fresh installation:

Primary server installation
Additional server joins (if applicable)
Agent joins (if applicable)
Helm chart deployment

Note: The K3s node token is regenerated during reinstallation. Retrieve the new token from /var/lib/rancher/k3s/server/node-token on the primary server after installation if you need to join additional nodes.

Rollback Procedure

Rollback procedures vary by upgrade method:

Method 1 (Rolling Upgrade)

Use Helm’s built-in rollback command:

helm rollback acd-manager

This reverts to the previous Helm release revision automatically.

Or manually redeploy the previous version:

helm upgrade acd-manager /mnt/esb3027-old/helm/charts/acd-manager \
  --values ~/values.yaml

Note: If you use multiple --values files for organization, ensure they are specified in the same order as the original installation.

Method 2 (Clean Upgrade)

Reinstall the previous version:

helm uninstall acd-manager
helm install acd-manager /mnt/esb3027-old/helm/charts/acd-manager \
  --values ~/values-backup.yaml

Method 3 (Full Reinstall)

Rollback requires repeating the full cluster reinstall procedure using the old installation ISO. Follow Method 3 steps with the previous version’s ISO. Ensure verified backups are available before attempting.

Troubleshooting

Pods Fail to Start

Check pod status and events:

kubectl describe pod <pod-name>
kubectl get events --sort-by='.lastTimestamp'

Review pod logs:

kubectl logs <pod-name>
kubectl logs <pod-name> -p  # Previous instance logs

Database Migration Issues

If PostgreSQL migrations fail:

Check Cloudnative PG cluster status:

kubectl get clusters
kubectl describe cluster <cluster-name>

Review migration job logs:

kubectl get jobs
kubectl logs job/<migration-job-name>

Helm Upgrade Fails

If helm upgrade fails:

Check Helm release status:

helm status acd-manager
helm history acd-manager

Review the error message for specific failures
Attempt rollback if necessary

Post-Upgrade

After a successful upgrade:

Review the Release Notes for any post-upgrade tasks
Update monitoring dashboards if new metrics are available
Test all critical functionality
Document the upgrade in your change management system

Next Steps

After completing the upgrade:

Next Steps Guide - Review post-installation tasks
Operations Guide - Day-to-day operational procedures
Release Notes - Review new features and changes

5.7 - Next Steps

Post-installation configuration tasks

Overview

After completing the installation of the AgileTV CDN Manager (ESB3027), several post-installation configuration tasks must be performed before the system is ready for production use. This guide walks you through the essential next steps.

Prerequisites

Before proceeding, ensure:

The CDN Manager Helm chart is successfully deployed
All pods are in Running status
You have network access to the cluster hostname or IP
You have the default credentials available

Step 1: Access Zitadel Console

The first step is to configure user authentication through Zitadel Identity and Access Management (IAM).

Navigate to the Zitadel Console:
```
https://<manager-host>/ui/console
```
Replace <manager-host> with your configured hostname (e.g., manager.local or manager.example.com).
Important: The <manager-host> must match the first entry in global.hosts.manager from your Helm values exactly. Zitadel uses name-based virtual hosting and CORS validation. If the hostname does not match, authentication will fail.
Log in with the default administrator credentials (also listed in the Glossary):
- Username: admin@agiletv.dev
- Password: Password1!
Important: If prompted to configure Multi-Factor Authentication (MFA), you must skip this step for now. MFA is not currently supported. Attempting to configure MFA may lock you out of the administrator account.
Security Recommendation: After logging in, create a new administrator account with proper roles. Once verified, disable or delete the default admin@agiletv.dev account. For details on required roles and administrator permissions, see Zitadel’s Administrator Documentation.

Step 2: Configure SMTP Settings (Recommended)

Zitadel requires an SMTP server to send email notifications and perform email validations.

In the Zitadel Console, navigate to Settings > Default Settings
Configure the SMTP settings:
- SMTP Host: Your mail server hostname
- SMTP Port: Typically 587 (TLS) or 465 (SSL)
- SMTP Username: Mail account username
- SMTP Password: Mail account password
- Sender Address: Email address for outgoing mail (e.g., noreply@example.com)
Save the configuration

Note: Without SMTP configuration, email-based user validation and password recovery features will not function.

Step 3: Create Additional User Accounts

Create user accounts for operators and administrators:

Tip: For detailed guidance on managing users, roles, and permissions in the Zitadel Console, see Zitadel’s User Management Documentation.

In the Zitadel Console, navigate to Users > Add User
Fill in the user details:
- Username: Unique username
- First Name: User’s first name
- Last Name: User’s last name
- Email: User’s email address (this is their login username)
Known Issue: Due to a limitation in this release of Zitadel, the username must match the local part (the portion before the @) of the email address. For example, if the email is foo@example.com, the username must be foo.
If these do not match, Zitadel may allow login with the mismatched local part while blocking the full email address. For instance, if username is foo but email is foo.bar@example.com, login with foo@example.com may succeed while foo.bar@example.com is blocked.
Workaround: Always ensure the username matches the email local part exactly.
Important: The following options must be configured:
- Email Verified: Check this box to skip email verification
- Set Initial Password: Enter a temporary password for the user
Note: If you configured SMTP settings in Step 2, the user will receive an email asking to verify their address and set their initial password. If SMTP is not configured, you must check the “Email Verified” box and set an initial password manually, otherwise the user account will not be enabled.
Click Create User
Provide the user with:
- Their username
- The temporary password (if set manually)
- The Zitadel Console URL
Instruct the user to change their password on first login

Step 4: Configure User Roles and Permissions

Zitadel manages roles and permissions for accessing the CDN Manager:

In the Zitadel Console, navigate to Roles
Assign appropriate roles to users:
- Admin: Full administrative access
- Operator: Operational access without administrative functions
- Viewer: Read-only access
To assign a role:
- Select the user
- Click Add Role
- Select the appropriate role
- Save the assignment

Step 5: Access the MIB Frontend

The MIB Frontend is the web-based configuration GUI for CDN operators:

Navigate to the MIB Frontend:
```
https://<manager-host>/gui
```
Log in using your Zitadel credentials
Verify you can access the configuration interface

Step 6: Verify API Access

Test API connectivity to ensure the system is functioning:

curl -k https://<manager-host>/api/v1/health/ready

Expected response:

{
  "status": "ready"
}

See the API Guide for detailed API documentation.

Step 7: Configure TLS Certificates (If Not Done During Installation)

For production deployments, a valid TLS certificate from a trusted Certificate Authority should be configured. If you did not configure TLS certificates during installation, refer to Step 12: Configure TLS Certificates in the Installation Guide.

Step 8: Set Up Monitoring and Alerting

Configure monitoring dashboards and alerting:

Access Grafana:
- Navigate to https://<manager-host>/grafana
- Log in with default credentials (also listed in the Glossary):
  - Username: admin
  - Password: edgeware
Review Pre-built Dashboards:
- System health dashboards are included by default
- CDN metrics dashboards show routing and usage statistics
Note: CDN Director instances automatically have DNS names configured for use in Grafana dashboards. The DNS name is derived from the name field in global.hosts.routers with .external appended. For example, a router named my-router-1 will have the DNS name my-router-1.external in Grafana configuration.

Step 9: Verify Kafka and PostgreSQL Health

Ensure the data layer components are healthy:

kubectl get pods

Verify the following pods are running:

Component	Pod Name Pattern	Expected Status
Kafka	`acd-manager-kafka-controller-*`	Running (3 pods for production)
PostgreSQL	`acd-cluster-postgresql-0`, `acd-cluster-postgresql-1`, `acd-cluster-postgresql-2`	Running (3-node HA cluster)
Redis	`acd-manager-redis-master-*`	Running

All pods should show Running status with no restarts.

Step 10: Configure Availability Zones (Optional)

For improved network performance, configure availability zones to enable Topology Aware Hints. This optimizes service-to-pod routing by keeping traffic within the same zone when possible.

See the Performance Tuning Guide for detailed instructions on:

Labeling nodes with zone and region topology
Verifying topology configuration
Requirements for Topology Aware Hints to activate
Integration with pod anti-affinity rules

Note: This step is optional. If zone labels are not configured, the system will fall back to random load-balancing.

Step 11: Review System Configuration

Verify the initial configuration:

Review Helm Values:
```
helm get values acd-manager -o yaml
```
Check Ingress Configuration:
```
kubectl get ingress
```
Verify Service Endpoints:
```
kubectl get endpoints
```

Step 12: Document Your Deployment

Maintain documentation for your deployment:

Cluster hostname and IP addresses
Configuration file locations
User accounts and roles created
TLS certificate expiration dates
Backup procedures and schedules
Monitoring and alerting contacts

Next Steps

After completing post-installation configuration:

Configuration Guide - Detailed system configuration options
Operations Guide - Day-to-day operational procedures
Metrics & Monitoring Guide - Comprehensive monitoring setup
API Guide - REST API reference and integration examples

Troubleshooting

Cannot Access Zitadel Console

Verify DNS resolution or hosts file configuration
Check that Traefik ingress is running: kubectl get pods -n kube-system | grep traefik
Review Traefik logs: kubectl logs -n kube-system -l app.kubernetes.io/name=traefik

Authentication Failures

Verify Zitadel pods are healthy: kubectl get pods | grep zitadel
Check Zitadel logs: kubectl logs <zitadel-pod-name>
Ensure the external domain matches your hostname in Zitadel configuration

MIB Frontend Not Loading

Verify MIB Frontend pods are running: kubectl get pods | grep mib-frontend
Check for connectivity issues to Confd and API services
Review browser console for JavaScript errors

API Returns 401 Unauthorized

Verify you have a valid bearer token
Check token expiration
Ensure Zitadel authentication is functioning

For additional troubleshooting assistance, refer to the Troubleshooting Guide.

6 - Configuration Guide

Helm chart configuration reference

Overview

The CDN Manager is deployed via Helm chart with configuration supplied through values.yaml files. This guide explains the configuration structure, how to apply changes, and provides a reference for all configurable options.

Configuration Files

The installation ISO provides three configuration files at /mnt/esb3027/:

File	Purpose
`values-lab.yaml`	Recommended starting point for single-node lab deployments
`values-production.yaml`	Recommended starting point for multi-node production deployments
`values.yaml`	Complete reference of all configurable options with their defaults

You only need to specify fields that differ from the defaults. Helm applies configuration hierarchically — values from your file override the chart’s built-in defaults, and any key you omit retains its default value.

Lab Configuration (`values-lab.yaml`)

values-lab.yaml is the recommended starting point for single-node lab, acceptance testing, and demonstration deployments. It pre-configures settings appropriate for a constrained single-node environment:

Single Kafka controller replica (the default 3 replicas require 3 separate nodes to satisfy pod anti-affinity rules)
Single Zitadel replica
Self-signed TLS by default, with real certificate configuration commented out for reference
Minimal resource requests suited to a single node

Copy the file to a writable location and edit it before deploying:

cp /mnt/esb3027/values-lab.yaml ~/values.yaml

The minimum required changes are:

Set global.hosts.manager[0].host to your node’s hostname or IP address
Set zitadel.zitadel.configmapConfig.ExternalDomain to the same value

These two values must match exactly or authentication will fail due to CORS policy violations. See Global Settings for details.

Production Configuration (`values-production.yaml`)

values-production.yaml is the recommended starting point for multi-node production deployments across a minimum three-node cluster. It pre-configures settings appropriate for a high-availability environment:

Three Zitadel replicas spread across nodes for HA
Production-grade resource requests and limits for all major components
Kafka with a dedicated single-replica StorageClass (avoiding unnecessary triple-redundancy on top of Kafka’s own quorum)
Manager HPA configured to scale between 3 and 8 replicas
TLS certificate configuration with clearly marked placeholders

Copy the file to a writable location and edit it before deploying:

cp /mnt/esb3027/values-production.yaml ~/values.yaml

The minimum required changes before deploying are:

Set global.hosts.manager[0].host to your primary manager hostname
Set zitadel.zitadel.configmapConfig.ExternalDomain to the same hostname
Replace the placeholder TLS certificate and key in the ingress.secrets section, and update the secretName values in mibFrontend.ingress.extraTls and zitadel.ingress.tls to match
Update global.hosts.routers with your CDN Director instances

See TLS Configuration and Global Settings for full details.

Hardware requirements: For per-node hardware specifications, refer to the System Requirements Guide. The System Requirements Guide is the authoritative source — the hardware comments in the header of values-production.yaml may not reflect the current requirements.

Complete Reference (`values.yaml`)

The full default values file at /mnt/esb3027/values.yaml documents every configurable option with its default value and inline comments. Use this as a reference when looking up available settings or understanding what the environment-specific files override.

Note: values.yaml is not intended to be used directly as your deployment configuration. Use values-lab.yaml or values-production.yaml as your starting point instead.

Configuration Merging

Helm merges configuration files from left to right, with later files overriding earlier values. This allows you to split your configuration into multiple files — for example, keeping TLS certificates separate from the main configuration:

# Multiple files merged left-to-right
helm install acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml \
  --values ~/values-tls.yaml

Individual Value Overrides

For temporary changes, you can override individual values with --set:

helm upgrade acd-manager /mnt/esb3027/helm/charts/acd-manager \
  --values ~/values.yaml \
  --set manager.logLevel=debug

Note: Using --set is discouraged for permanent changes, as the same arguments must be specified for every Helm operation.

Applying Configuration

Initial Installation

helm install acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml

Updating Configuration

helm upgrade acd-manager /mnt/esb3027/helm/charts/acd-manager \
  --values ~/values.yaml

Dry Run

Before applying changes, validate the configuration with a dry run:

helm upgrade acd-manager /mnt/esb3027/helm/charts/acd-manager \
  --values ~/values.yaml \
  --dry-run

Rollback

If an upgrade fails, rollback to the previous revision:

# View revision history
helm history acd-manager

# Rollback to previous revision
helm rollback acd-manager

# Rollback to specific revision
helm rollback acd-manager <revision_number>

Note: Rollback reverts the Helm release but does not modify your values.yaml file. You must manually revert configuration file changes.

Force Reinstall

If an upgrade fails and rollback is not sufficient, you can perform a clean reinstall:

helm uninstall acd-manager
helm install acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml

Warning: This is service-affecting as all pods will be destroyed and recreated.

Configuration Reference

Global Settings

The global section contains cluster-wide settings. The most critical configuration is global.hosts.

global:
  hosts:
    manager:
      - host: manager.local
    routers:
      - name: default
        address: 127.0.0.1
    edns_proxy: []
    geoip: []

Key	Type	Description
`global.hosts.manager`	Array	External IP addresses or DNS hostnames for all Manager cluster nodes
`global.hosts.routers`	Array	CDN Director (ESB3024) instances
`global.hosts.edns_proxy`	Array	EDNS Proxy addresses (currently unused)
`global.hosts.geoip`	Array	GeoIP Proxy addresses for Frontend GUI

Important: The first entry in global.hosts.manager must match zitadel.zitadel.ExternalDomain exactly. Zitadel enforces CORS protection, and authentication will fail if these do not match.

Manager Configuration

Core Manager API server settings:

Key	Type	Default	Description
`manager.image.registry`	String	`ghcr.io`	Container image registry
`manager.image.repository`	String	`edgeware/acd-manager`	Container image repository
`manager.image.tag`	String		Image tag override (uses latest if empty)
`manager.logLevel`	String	`info`	Log level (`trace`, `debug`, `info`, `warn`, `error`)
`manager.replicaCount`	Number	`1`	Number of replicas (HPA manages this when enabled)
`manager.containerPorts.http`	Number	`80`	HTTP container port
`manager.maxmindDbVolume`	String		Name of PVC containing MaxMind GeoIP databases

Manager Resources

The chart supports both resource presets and explicit resource specifications:

Key	Type	Default	Description
`manager.resourcesPreset`	String	`` (empty)	Resource preset (see Resource Presets table). Ignored if `manager.resources` is set.
`manager.resources.requests.cpu`	String	`300m`	CPU request
`manager.resources.requests.memory`	String	`512Mi`	Memory request
`manager.resources.limits.cpu`	String	`1`	CPU limit
`manager.resources.limits.memory`	String	`4Gi`	Memory limit

Note: For production workloads, explicitly set manager.resources rather than using presets.

Manager Datastore

manager:
  datastore:
    type: redis
    namespace: "cdn_manager_ds"
    default_ttl: ""
    compression: zstd

Key	Type	Default	Description
`manager.datastore.type`	String	`redis`	Datastore backend type
`manager.datastore.namespace`	String	`cdn_manager_ds`	Redis namespace for manager data
`manager.datastore.default_ttl`	String	`` (empty)	Default TTL for entries
`manager.datastore.compression`	String	`zstd`	Compression algorithm (`none`, `zstd`, etc.)

Manager Discovery

manager:
  discovery: []
  # Example:
  # - namespace: "other"
  #   hosts:
  #     - other-host1
  #     - other-host2
  #   pattern: "other-.*"

Key	Type	Description
`manager.discovery`	Array	Array of discovery host configurations. Each entry can specify `hosts` (list of hostnames), `pattern` (regex pattern), or both

Manager Tuning

manager:
  tuning:
    enable_cache_control: true
    cache_control_max_age: "5m"
    cache_control_miss_max_age: ""

Key	Type	Default	Description
`manager.tuning.enable_cache_control`	Boolean	`true`	Enable cache control headers in responses
`manager.tuning.cache_control_max_age`	String	`5m`	Maximum age for cache control headers
`manager.tuning.cache_control_miss_max_age`	String	`` (empty)	Maximum age for cache control headers on cache misses

Manager Container Arguments

manager:
  args:
    - --config-file=/etc/manager/config.toml
    - http-server

Gateway Configuration

NGinx Gateway settings for external Director communication:

Key	Type	Default	Description
`gateway.replicaCount`	Number	`1`	Number of gateway replicas
`gateway.resources.requests.cpu`	String	`100m`	CPU request
`gateway.resources.requests.memory`	String	`128Mi`	Memory request
`gateway.resources.limits.cpu`	String	`150m`	CPU limit
`gateway.resources.limits.memory`	String	`192Mi`	Memory limit
`gateway.service.type`	String	`ClusterIP`	Service type

MIB Frontend Configuration

Web-based configuration GUI settings:

Key	Type	Default	Description
`mib-frontend.enabled`	Boolean	`true`	Enable the frontend GUI
`mib-frontend.frontend.resourcePreset`	String	`nano`	Resource preset
`mib-frontend.frontend.autoscaling.hpa.enabled`	Boolean	`true`	Enable HPA
`mib-frontend.frontend.autoscaling.hpa.minReplicas`	Number	`2`	Minimum replicas
`mib-frontend.frontend.autoscaling.hpa.maxReplicas`	Number	`4`	Maximum replicas

Confd Configuration

Confd settings for configuration management:

Key	Type	Default	Description
`confd.enabled`	Boolean	`true`	Enable Confd
`confd.service.ports.internal`	Number	`15000`	Internal service port

VictoriaMetrics Configuration

Time-series database for metrics:

Key	Type	Default	Description
`acd-metrics.enabled`	Boolean	`true`	Enable metrics components
`acd-metrics.victoria-metrics-single.enabled`	Boolean	`true`	Enable VictoriaMetrics
`acd-metrics.grafana.enabled`	Boolean	`true`	Enable Grafana
`acd-metrics.telegraf.enabled`	Boolean	`true`	Enable Telegraf
`acd-metrics.prometheus.enabled`	Boolean	`true`	Enable Prometheus metrics

Ingress Configuration

Traffic exposure settings:

Key	Type	Default	Description
`ingress.enabled`	Boolean	`true`	Enable ingress record generation
`ingress.pathType`	String	`Prefix`	Ingress path type
`ingress.hostname`	String	`` (empty)	Primary hostname (defaults to manager.local via global.hosts)
`ingress.path`	String	`/api`	Default path for ingress
`ingress.tls`	Boolean	`false`	Enable TLS configuration
`ingress.selfSigned`	Boolean	`false`	Generate self-signed certificate via Helm
`ingress.secrets`	Array		Custom TLS certificate secrets

Ingress Extra Paths

The chart includes default extra paths for Confd and GeoIP:

ingress:
  extraPaths:
    - path: /confd
      pathType: Prefix
      backend:
        service:
          name: acd-manager-gateway
          port:
            name: http
    - path: /geoip
      pathType: Prefix
      backend:
        service:
          name: acd-manager-gateway
          port:
            name: http

TLS Certificate Secrets

For production TLS certificates:

ingress:
  secrets:
    - name: manager.local-tls
      key: |-
        -----BEGIN RSA PRIVATE KEY-----
        ...
        -----END RSA PRIVATE KEY-----
      certificate: |-
        -----BEGIN CERTIFICATE-----
        ...
        -----END CERTIFICATE-----
  tls: true

Resource Configuration

Resource Presets

Predefined resource configurations for common deployment sizes:

Preset	Request CPU	Request Memory	Limit CPU	Limit Memory	Ephemeral Storage Limit
`nano`	100m	128Mi	150m	192Mi	2Gi
`micro`	250m	256Mi	375m	384Mi	2Gi
`small`	500m	512Mi	750m	768Mi	2Gi
`medium`	500m	1024Mi	750m	1536Mi	2Gi
`large`	1000m	2048Mi	1500m	3072Mi	2Gi
`xlarge`	1000m	3072Mi	3000m	6144Mi	2Gi
`2xlarge`	1000m	3072Mi	6000m	12288Mi	2Gi

Note: Limits are calculated as requests plus 50% (except for xlarge/2xlarge and ephemeral-storage).

Custom Resources

Override preset with custom values:

manager:
  resources:
    requests:
      cpu: "300m"
      memory: "512Mi"
    limits:
      cpu: "1"
      memory: "1Gi"

Note:

CPU values use millicores (1000m = 1 core)
Memory values use binary SI units (1024Mi = 1GiB)
Requests represent minimum guaranteed resources
Limits represent maximum consumable resources

Capacity Planning

When sizing resources:

Requests determine scheduling (node must have available capacity)
Limits prevent resource starvation
Maintain 20-30% cluster headroom for scaling
Total capacity = sum of all requests × replica count + headroom

Security Contexts

Pod Security Context

manager:
  podSecurityContext:
    enabled: true
    fsGroup: 1001
    fsGroupChangePolicy: Always
    sysctls: []
    supplementalGroups: []

Container Security Context

manager:
  containerSecurityContext:
    enabled: true
    runAsUser: 1001
    runAsGroup: 1001
    runAsNonRoot: true
    readOnlyRootFilesystem: true
    privileged: false
    allowPrivilegeEscalation: false
    capabilities:
      drop: ["ALL"]
    seccompProfile:
      type: "RuntimeDefault"

Health Probes

Probe Types

Probe	Purpose	Failure Action
`startupProbe`	Initial startup verification	Container restart
`readinessProbe`	Traffic readiness check	Remove from load balancer
`livenessProbe`	Health monitoring	Container restart

Default Probe Configuration

Liveness Probe

manager:
  livenessProbe:
    enabled: true
    initialDelaySeconds: 5
    periodSeconds: 30
    timeoutSeconds: 10
    failureThreshold: 5
    successThreshold: 1
    httpGet:
      path: /api/v1/health/alive
      port: http

Readiness Probe

manager:
  readinessProbe:
    enabled: true
    initialDelaySeconds: 5
    periodSeconds: 10
    timeoutSeconds: 7
    failureThreshold: 3
    successThreshold: 1
    httpGet:
      path: /api/v1/health/ready
      port: http

Startup Probe

manager:
  startupProbe:
    enabled: true
    initialDelaySeconds: 0
    periodSeconds: 5
    timeoutSeconds: 3
    failureThreshold: 10
    successThreshold: 1
    httpGet:
      path: /api/v1/health/alive
      port: http

Autoscaling Configuration

Horizontal Pod Autoscaler (HPA)

manager:
  autoscaling:
    hpa:
      enabled: true
      minReplicas: 3
      maxReplicas: 8
      targetCPU: 50
      targetMemory: 80

Key	Type	Default	Description
`manager.autoscaling.hpa.enabled`	Boolean	`true`	Enable HPA
`manager.autoscaling.hpa.minReplicas`	Number	`3`	Minimum number of replicas
`manager.autoscaling.hpa.maxReplicas`	Number	`8`	Maximum number of replicas
`manager.autoscaling.hpa.targetCPU`	Number	`50`	Target CPU utilization percentage
`manager.autoscaling.hpa.targetMemory`	Number	`80`	Target Memory utilization percentage

Network Policy

networkPolicy:
  enabled: true
  allowExternal: true
  allowExternalEgress: true
  addExternalClientAccess: true

Key	Type	Default	Description
`networkPolicy.enabled`	Boolean	`true`	Enable NetworkPolicy
`networkPolicy.allowExternal`	Boolean	`true`	Allow connections from any source (don’t require pod label)
`networkPolicy.allowExternalEgress`	Boolean	`true`	Allow pod to access any range of port and destinations
`networkPolicy.addExternalClientAccess`	Boolean	`true`	Allow access from pods with client label set to “true”

Pod Affinity and Anti-Affinity

manager:
  podAffinityPreset: ""
  podAntiAffinityPreset: soft
  nodeAffinityPreset:
    type: ""
    key: ""
    values: []
  affinity: {}

Key	Type	Default	Description
`manager.podAffinityPreset`	String	`` (empty)	Pod affinity preset (`soft` or `hard`). Ignored if `affinity` is set
`manager.podAntiAffinityPreset`	String	`soft`	Pod anti-affinity preset (`soft` or `hard`). Ignored if `affinity` is set
`manager.nodeAffinityPreset.type`	String	`` (empty)	Node affinity preset type (`soft` or `hard`)
`manager.affinity`	Object	`{}`	Custom affinity rules (overrides presets)

Service Configuration

service:
  type: ClusterIP
  ports:
    http: 80
  annotations:
    service.kubernetes.io/topology-mode: Auto
  externalTrafficPolicy: Cluster
  sessionAffinity: None

Key	Type	Default	Description
`service.type`	String	`ClusterIP`	Service type
`service.ports.http`	Number	`80`	HTTP service port
`service.annotations`	Object	`service.kubernetes.io/topology-mode: Auto`	Service annotations
`service.externalTrafficPolicy`	String	`Cluster`	External traffic policy

Persistence Configuration

persistence:
  enabled: false
  mountPath: /agiletv/manager/data
  storageClass: ""
  accessModes:
    - ReadWriteOnce
  size: 8Gi

Key	Type	Default	Description
`persistence.enabled`	Boolean	`false`	Enable persistence using PVC
`persistence.mountPath`	String	`/agiletv/manager/data`	Mount path
`persistence.storageClass`	String	`` (empty)	Storage class (uses cluster default if empty)
`persistence.size`	String	`8Gi`	Size of data volume

RBAC and Service Account

rbac:
  create: false
  rules: []

serviceAccount:
  create: true
  name: ""
  automountServiceAccountToken: true
  annotations: {}

Metrics

metrics:
  enabled: false
  serviceMonitor:
    enabled: false
    namespace: ""
    annotations: {}
    labels: {}
    interval: ""
    scrapeTimeout: ""

Key	Type	Default	Description
`metrics.enabled`	Boolean	`false`	Enable Prometheus metrics export
`metrics.serviceMonitor.enabled`	Boolean	`false`	Create Prometheus Operator ServiceMonitor

Next Steps

After configuration:

Installation Guide - Deploy with your configuration
Operations Guide - Day-to-day management
Performance Tuning Guide - Optimize system performance
Architecture Guide - Understand component relationships

7 - Performance Tuning Guide

Optimization tips for improving CDN Manager performance

Overview

This guide provides performance tuning recommendations for the AgileTV CDN Manager (ESB3027). While the default configuration is suitable for most deployments, certain environments may benefit from additional optimizations.

Network Topology Optimization

Topology Aware Hints

The CDN Manager uses Kubernetes Topology Aware Hints to prefer routing pods in the same zone as the source of network traffic. This reduces cross-zone latency and improves overall system responsiveness.

How It Works

When nodes are labeled with topology zones, Kubernetes automatically routes traffic to pods in the same zone when possible. This is particularly beneficial for:

Low-latency requirements: Keeps traffic local to reduce round-trip time
Cost optimization: Reduces cross-zone data transfer costs in cloud environments
Load distribution: Prevents hotspots by distributing load across zones

Configuring Availability Zones

Each node must have zone and region labels applied for Topology Aware Hints to function:

# Label a node with a zone
kubectl label nodes <node-name> topology.kubernetes.io/zone=us-east-1a

# Label a node with a region
kubectl label nodes <node-name> topology.kubernetes.io/region=us-east-1

Replace <node-name> with your actual node names and adjust the zone/region values to match your deployment geography.

Note: Labels applied via kubectl label are automatically persistent and will survive node restarts.

Verify Topology Configuration

Verify labels are applied:

kubectl get nodes --show-labels | grep topology.kubernetes.io

Verify EndpointSlices are being generated with hints:

kubectl get endpointslices

Requirements for Topology Aware Hints

For Topology Aware Hints to activate:

Minimum Nodes: At least one node must be labeled with each zone referenced by endpoints
Symmetry: The control plane checks for sufficient CPU capacity across zones to balance traffic
Zone Coverage: All zones with endpoints should have at least one ready node

Integration with Pod Anti-Affinity

Topology labels complement the pod anti-affinity rules already configured in the Helm chart:

Pod Anti-Affinity: Handles pod-to-node placement to ensure high availability
Topology Aware Hints: Handles service-to-pod traffic routing to keep requests within the same zone

Together, these features optimize both placement and routing for improved performance.

Fallback Behavior

If zone labels are not configured, the system falls back to random load-balancing across all available pods. This is functionally correct but may result in:

Increased cross-zone traffic
Higher latency for some requests
Less predictable performance characteristics

Kernel Network Tuning (sysctl)

For high-throughput deployments, tuning Linux kernel network parameters can significantly improve connection handling and overall system performance. These settings are particularly beneficial for environments with high connection rates or large numbers of concurrent connections.

Recommended sysctl Settings

Apply the following settings to optimize network performance:

# Networking
net.core.somaxconn = 1024
net.core.netdev_max_backlog = 2048
net.ipv4.tcp_max_syn_backlog = 2048

# Connection Tracking
net.netfilter.nf_conntrack_max = 131072
net.netfilter.nf_conntrack_tcp_timeout_established = 1200

# Port Reuse
net.ipv4.ip_local_port_range = 10240 65535
net.ipv4.tcp_tw_reuse = 1

# Memory Buffers
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608

Setting Descriptions

Parameter	Recommended Value	Purpose
`net.core.somaxconn`	1024	Maximum socket listen backlog. Increases pending connection queue size.
`net.core.netdev_max_backlog`	2048	Maximum packets queued at network device level. Helps handle burst traffic.
`net.ipv4.tcp_max_syn_backlog`	2048	Maximum SYN requests queued. Improves handling of connection floods.
`net.netfilter.nf_conntrack_max`	131072	Maximum tracked connections. Prevents connection tracking table exhaustion.
`net.netfilter.nf_conntrack_tcp_timeout_established`	1200	Timeout for established connections (seconds). Reduces stale entry buildup.
`net.ipv4.ip_local_port_range`	10240 65535	Range of local ports for outbound connections. Expands available ephemeral ports.
`net.ipv4.tcp_tw_reuse`	1	Allows reusing TIME_WAIT sockets. Reduces port exhaustion under high load.
`net.core.rmem_max`	8388608	Maximum receive socket buffer size (8MB). Improves high-bandwidth transfers.
`net.core.wmem_max`	8388608	Maximum send socket buffer size (8MB). Improves high-bandwidth transfers.

Applying Settings

Temporary (Until Reboot)

Apply settings immediately but they will be lost on reboot:

sudo sysctl -w net.core.somaxconn=1024
sudo sysctl -w net.core.netdev_max_backlog=2048
# ... repeat for each parameter

Persistent (Across Reboots)

Add settings to /etc/sysctl.conf or a file in /etc/sysctl.d/:

# Create a dedicated config file
cat <<EOF | sudo tee /etc/sysctl.d/99-cdn-manager.conf
# CDN Manager Network Tuning
net.core.somaxconn = 1024
net.core.netdev_max_backlog = 2048
net.ipv4.tcp_max_syn_backlog = 2048
net.netfilter.nf_conntrack_max = 131072
net.netfilter.nf_conntrack_tcp_timeout_established = 1200
net.ipv4.ip_local_port_range = 10240 65535
net.ipv4.tcp_tw_reuse = 1
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
EOF

# Apply all settings
sudo sysctl -p /etc/sysctl.d/99-cdn-manager.conf

Kubernetes Considerations

For Kubernetes deployments, these sysctl settings can be applied via:

Node-level configuration: Use DaemonSets or node provisioning scripts
Pod-level safe sysctls: Some sysctls can be set per-pod via securityContext.sysctls
Container runtime configuration: Configure via container runtime options

Note that some sysctls require privileged containers or node-level configuration.

Monitoring Impact

After applying these settings, monitor:

Connection establishment rates
TIME_WAIT socket count: netstat -n | grep TIME_WAIT | wc -l
Connection tracking table usage: cat /proc/sys/net/netfilter/nf_conntrack_count
Network buffer utilization via Grafana dashboards

Resource Configuration

Horizontal Pod Autoscaler (HPA)

The default HPA configuration is tuned for production workloads. For environments with variable load, consider adjusting the scale metrics:

Component	Default Scale Metrics	Tuning Consideration
Core Manager	CPU 50%, Memory 80%	Lower CPU threshold for faster scale-out
NGinx Gateway	CPU 75%, Memory 80%	Increase for cost optimization
MIB Frontend	CPU 75%, Memory 90%	Adjust based on operator concurrency

For detailed HPA configuration, see the Architecture Guide.

Resource Requests and Limits

Ensure resource requests and limits are appropriately sized for your workload. Under-provisioned resources can cause:

Pod evictions during high load
Increased latency due to CPU throttling
Slow scaling responses

Refer to the Configuration Guide for preset configurations and planning guidance.

Database Optimization

PostgreSQL

The PostgreSQL cluster is managed by the Cloudnative PG operator. For improved performance:

Connection Pooling: The application uses connection pooling by default
Replica Usage: Read queries can be offloaded to replicas for read-heavy workloads
Backup Scheduling: Schedule backups during low-traffic periods to minimize I/O impact

Redis

Redis provides in-memory caching for sessions and ephemeral state:

Memory Allocation: Ensure sufficient memory for cache hit rates
Persistence: RDB snapshots are enabled; adjust frequency based on durability needs

Kafka

Kafka handles event streaming for selection input and metrics:

Partition Count: Default partitions are sized for typical workloads
Replication Factor: Production deployments use 3 replicas for fault tolerance
Consumer Groups: The Selection Input Worker is limited to one consumer per partition

Monitoring Performance

Key Metrics to Watch

Monitor the following metrics for performance insights:

API Response Time: Track via Grafana dashboards
Pod CPU/Memory Usage: Identify resource bottlenecks
Kafka Lag: Monitor consumer lag for selection input processing
Database Connections: Watch for connection pool exhaustion

Grafana Dashboards

Pre-built dashboards are available at https://<manager-host>/grafana:

System Health: Overall cluster and application health
CDN Metrics: Routing and usage statistics
Resource Utilization: CPU, memory, and network usage per component

Troubleshooting Performance Issues

High Latency

Check pod distribution across nodes: kubectl get pods -o wide
Verify topology labels are applied: kubectl get nodes --show-labels
Review network latency between nodes
Check for resource contention: kubectl top pods

Slow Scaling

Verify HPA is enabled: kubectl get hpa
Check cluster capacity for scheduling new pods
Review HPA metrics: kubectl describe hpa acd-manager

Database Performance

Check PostgreSQL cluster status: kubectl get pods -l app=postgresql
Review slow query logs (if enabled)
Monitor connection pool usage

Next Steps

After reviewing performance tuning:

Architecture Guide - Understand component interactions
Configuration Guide - Detailed configuration options
Metrics & Monitoring Guide - Comprehensive monitoring setup
Troubleshooting Guide - Resolve performance issues

8 - Operations Guide

Day-to-day operational procedures and maintenance tasks

Overview

This guide covers day-to-day operational procedures for managing the AgileTV CDN Manager (ESB3027). Topics include routine maintenance, backup procedures, log management, and common operational tasks.

Prerequisites

Before performing operations, ensure you have:

kubectl access to the cluster
helm CLI installed
Access to the node where values.yaml is stored
Appropriate RBAC permissions for administrative tasks

Cluster Access

There are two supported methods for accessing the Kubernetes cluster:

SSH to a Server Node (Recommended for operations staff) - SSH into any Server node and run kubectl commands directly
Remote kubectl - Install kubectl on your local machine and configure it to connect to the cluster remotely

Method 1: SSH to Server Node (Recommended)

The kubectl command-line tool is pre-configured on all Server nodes and can be used directly without additional setup:

# SSH to any Server node
ssh root@<server-ip>

# Run kubectl commands directly
kubectl get nodes
kubectl get pods

This method is recommended for day-to-day operations as it requires no local configuration and provides direct access to the cluster.

Method 2: Remote kubectl from Local Machine

To use kubectl from your local workstation or laptop:

Step 1: Install kubectl

Download and install kubectl for your operating system:

Official Documentation: Install kubectl
macOS (Homebrew): brew install kubectl
Linux: Download from the official Kubernetes release page
Windows: Download from the official Kubernetes release page

Step 2: Copy kubeconfig from Server Node

# Copy kubeconfig from any Server node
scp root@<server-ip>:/etc/rancher/k3s/k3s.yaml ~/.kube/config

Step 3: Update kubeconfig

Edit the kubeconfig file to point to the correct server address:

# Replace localhost with the actual server IP
# macOS/Linux:
sed -i '' 's/127.0.0.1/<server-ip>/g' ~/.kube/config  # macOS
sed -i 's/127.0.0.1/<server-ip>/g' ~/.kube/config    # Linux

# Or manually edit ~/.kube/config and change:
# server: https://127.0.0.1:6443
# to:
# server: https://<server-ip>:6443

Step 4: Verify connectivity

kubectl get nodes

Managing Multiple Clusters

If you manage multiple Kubernetes clusters from the same machine, you can maintain multiple kubeconfig files:

# Set KUBECONFIG environment variable to include multiple config files
export KUBECONFIG=~/.kube/config-prod:~/.kube/config-lab

# View all contexts
kubectl config get-contexts

# Switch between clusters
kubectl config use-context <context-name>

# View current context
kubectl config current-context

For more information, see the official Kubernetes documentation: Organizing Cluster Access

Helm Commands

Helm releases are managed cluster-wide:

# List all releases
helm list

# View release history
helm history acd-manager

# Get deployed values
helm get values acd-manager -o yaml

# Get deployed manifest
helm get manifest acd-manager

Note: If using remote kubectl, ensure helm is installed on your local machine. See Helm Installation for instructions.

Helm Commands

Helm releases are managed cluster-wide:

# List all releases
helm list

# View release history
helm history acd-manager

# Get deployed values
helm get values acd-manager -o yaml

# Get deployed manifest
helm get manifest acd-manager

Backup Procedures

PostgreSQL Backup

PostgreSQL is managed by the Cloudnative PG operator, which provides continuous backup capabilities.

# Check backup status
kubectl get backup

# Create manual backup
kubectl apply -f - <<EOF
apiVersion: postgresql.cnpg.io/v1
kind: Backup
metadata:
  name: manual-backup-$(date +%Y%m%d-%H%M%S)
spec:
  cluster:
    name: acd-cluster-postgresql
EOF

# List available backups
kubectl get backup -o wide

# Restore from backup (requires downtime)
# See Upgrade Guide for restore procedures

Longhorn Volume Backups

Longhorn provides snapshot and backup capabilities for persistent volumes:

# List all volumes
kubectl get volumes -n longhorn-system

# Create snapshot via Longhorn UI
# Port-forward to Longhorn UI (do not expose via ingress)
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80

# Access: http://localhost:8080
# WARNING: Longhorn UI grants access to sensitive storage information
# and should never be exposed through the ingress controller

Accessing Internal Services

For debugging and troubleshooting, you may need direct access to internal services.

PostgreSQL

PostgreSQL is managed by the Cloudnative PG operator. Connection details are stored in the acd-cluster-postgresql-app Secret:

# View connection details
kubectl describe secret acd-cluster-postgresql-app

# Extract individual fields
PG_HOST=$(kubectl get secret acd-cluster-postgresql-app -o jsonpath='{.data.host}' | base64 -d)
PG_USER=$(kubectl get secret acd-cluster-postgresql-app -o jsonpath='{.data.username}' | base64 -d)
PG_PASS=$(kubectl get secret acd-cluster-postgresql-app -o jsonpath='{.data.password}' | base64 -d)
PG_DB=$(kubectl get secret acd-cluster-postgresql-app -o jsonpath='{.data.dbname}' | base64 -d)

# Connect via psql
kubectl exec -it acd-cluster-postgresql-0 -- psql -U $PG_USER -d $PG_DB

Secret fields: The CNPG operator populates the following fields: username, password, host, port, dbname, uri, jdbc-uri, fqdn-uri, fqdn-jdbc-uri, pgpass.

Redis

Redis runs on port 6379 with no authentication:

# Connect via redis-cli
kubectl exec -it acd-manager-redis-master-0 -- redis-cli

# Or connect from another pod
kubectl run redis-test --rm -it --image=redis -- redis-cli -h acd-manager-redis-master

Kafka

kafka-topics.sh –bootstrap-server :9095 –list

The selection_input topic is pre-configured for selection input events.

Kubernetes Port Forwarding

For accessing internal Kubernetes services that are not exposed via ingress or services, use kubectl port-forward to create a secure tunnel from your local machine to the service.

Basic Port Forwarding

# Forward local port to a service
kubectl port-forward -n <namespace> svc/<service-name> <local-port>:<service-port>

# Example: Forward local port 8080 to Grafana (port 3000)
kubectl port-forward -n default svc/acd-manager-grafana 8080:3000

Note: “Local” refers to the machine where you run kubectl. This can be:

A Server node in the cluster (common for administrative tasks)
A remote machine with kubectl configured to access the cluster

Accessing the Forwarded Service

Once the port-forward is established, access the service at http://localhost:<local-port> from the machine where you ran kubectl port-forward.

If running on a Server node: To access the forwarded port from your local workstation:

Ensure the firewall on the Server node allows traffic on the forwarded port from your network
Use the Server node’s IP address instead of localhost from your workstation

# From your workstation (if firewall allows)
curl http://<server-node-ip>:<local-port>

For simplicity, consider running port-forward from your local machine (if kubectl is configured for remote cluster access) rather than from a Server node.

Background Port Forwarding

To run port-forward in the background:

kubectl port-forward -n <namespace> svc/<service-name> <local-port>:<service-port> &

Security Considerations

Port forwarding is recommended for:

Administrative interfaces (e.g., Longhorn UI) that should not be publicly exposed
Debugging and troubleshooting internal services
Temporary access to services without modifying ingress configuration

The port-forward tunnel remains active only while the kubectl port-forward command is running. Press Ctrl+C to terminate the tunnel.

Example: The Longhorn storage UI is intentionally not exposed via ingress due to security risks. Access it via port-forward:
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80
Then navigate to http://localhost:8080 in your browser.

Longhorn Storage

Longhorn is a distributed block storage system for Kubernetes that provides persistent volumes for stateful applications such as PostgreSQL and Kafka.

Architecture

Longhorn deploys controller and replica engines on each node, forming a distributed storage system. When a volume is created, Longhorn replicates data across multiple nodes to ensure durability even in the event of node failures.

Storage Protocols:

iSCSI: Used for standard Read-Write-Once (RWO) volumes
NFS: Used for Read-Write-Many (RWX) volumes that can be mounted by multiple pods simultaneously

Configuration

The CDN Manager deploys Longhorn with a single replica configuration, which differs from the Longhorn default of 3 replicas. This configuration is optimized for the cluster architecture where:

Pod-node affinity is configured to schedule pods on the same node as their persistent volume data
This optimizes I/O performance by reducing network traffic
Data locality is maintained while still providing volume portability

Capacity Planning

Longhorn storage requires an additional 30% capacity headroom for internal operations and scaling. If less than 30% of the total partition capacity is available, Longhorn may mark volumes as “full” and prevent further writes.

For detailed storage requirements and disk partitioning guidance, see the System Requirements Guide.

Configuration Backup

Always backup your Helm values before making changes:

# Export current values
helm get values acd-manager -o yaml > ~/values-backup-$(date +%Y%m%d).yaml

# Backup custom values files
cp ~/values.yaml ~/values-backup-$(date +%Y%m%d).yaml

Backup Schedule Recommendations

Component	Frequency	Retention
PostgreSQL	Daily	30 days
Longhorn Snapshots	Before changes	7 days
Configuration	Before each change	Indefinite

Updating MaxMind GeoIP Databases

The MaxMind GeoIP databases (GeoIP2-City, GeoLite2-ASN, GeoIP2-Anonymous-IP) are used for GeoIP-based routing and validation features. These databases should be updated periodically to ensure accurate IP geolocation data.

Prerequisites

Updated MaxMind database files (.mmdb format) obtained from MaxMind
Access to the cluster via kubectl
Helm CLI installed

Update Procedure

Step 1: Create New Volume with Updated Databases

Run the volume generation utility with a unique volume name that includes a revision identifier:

# Mount the installation ISO if not already mounted
mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

# Generate new volume with updated databases
/mnt/esb3027/generate-maxmind-volume

When prompted:

Provide the paths to the three database files:
- GeoIP2-City.mmdb
- GeoLite2-ASN.mmdb
- GeoIP2-Anonymous-IP.mmdb
Enter a unique volume name with a revision number or date, for example:
- maxmind-geoip-2026-04
- maxmind-geoip-v2

Tip: Using a revision-based naming convention simplifies rollback if needed.

Step 2: Update Helm Configuration

Edit your values.yaml file to reference the new volume:

manager:
  maxmindDbVolume: maxmind-geoip-2026-04

Replace maxmind-geoip-2026-04 with the volume name you specified in Step 1.

Step 3: Apply Configuration Update

Upgrade the Helm release with the updated configuration:

helm upgrade acd-manager /mnt/esb3027/charts/acd-manager --values ~/values.yaml

Step 4: Rolling Restart (Optional)

To ensure all pods immediately use the new database files, perform a rolling restart of the manager deployment:

kubectl rollout restart deployment acd-manager

Monitor the rollout status:

kubectl rollout status deployment acd-manager

Step 5: Verify Update

Verify the pods are running with the new volume:

kubectl get pods
kubectl describe pod -l app.kubernetes.io/component=manager | grep -A 5 "Volumes"

Step 6: Clean Up Old Volume (Optional)

After verifying the new databases are working correctly, you can delete the old persistent volume:

# List persistent volumes to find the old one
kubectl get pv

# Delete the old volume
kubectl delete pv <old-volume-name>

Caution: Ensure the new volume is functioning correctly before deleting the old volume. Keep the old volume for at least 24-48 hours as a rollback option.

Rollback Procedure

If issues occur after updating the databases:

Revert the maxmindDbVolume value in your values.yaml to the previous volume name
Run helm upgrade with the reverted configuration
Optionally restart the deployment: kubectl rollout restart deployment acd-manager

Update Frequency Recommendations

Database	Recommended Update Frequency
GeoIP2-City	Weekly or monthly
GeoLite2-ASN	Monthly
GeoIP2-Anonymous-IP	Weekly or monthly

MaxMind releases database updates on a regular schedule. Subscribe to MaxMind notifications to stay informed of new releases.

Log Management

Application Logs

# View manager logs
kubectl logs -l app.kubernetes.io/component=manager

# Follow logs in real-time
kubectl logs -l app.kubernetes.io/component=manager -f

# View logs from specific pod
kubectl logs <pod-name>

# View previous instance logs (after crash)
kubectl logs <pod-name> -p

# View logs with timestamps
kubectl logs <pod-name> --timestamps

# View logs from all containers in pod
kubectl logs <pod-name> --all-containers

Component-Specific Logs

# Zitadel logs
kubectl logs -l app.kubernetes.io/name=zitadel

# Gateway logs
kubectl logs -l app.kubernetes.io/component=gateway

# Confd logs
kubectl logs -l app.kubernetes.io/component=confd

# MIB Frontend logs
kubectl logs -l app.kubernetes.io/component=mib-frontend

# PostgreSQL logs
kubectl logs -l app.kubernetes.io/name=postgresql

# Kafka logs
kubectl logs -l app.kubernetes.io/name=kafka

# Redis logs
kubectl logs -l app.kubernetes.io/name=redis

Log Aggregation

Logs are collected by Telegraf and sent to VictoriaMetrics:

# Access Grafana for log visualization
# https://<manager-host>/grafana

# Query logs via Grafana Explore
# Select VictoriaMetrics datasource and use log queries

Log Rotation

Container logs are automatically rotated by Kubernetes:

Default max size: 10MB per container
Default max files: 5 rotated files
Total per pod: ~50MB maximum

Scaling Operations

Manual Scaling

Note: If HPA (Horizontal Pod Autoscaler) is enabled for a deployment, manual scaling changes will be overridden by the HPA. To manually scale, you must first disable the HPA.

# Check if HPA is enabled
kubectl get hpa

# Disable HPA before manual scaling
kubectl patch hpa acd-manager -p '{"spec": {"minReplicas": null, "maxReplicas": null}}'

# Or delete the HPA entirely
kubectl delete hpa acd-manager

# Scale manager replicas
kubectl scale deployment acd-manager --replicas=3

# Scale gateway replicas
kubectl scale deployment acd-manager-gateway --replicas=2

# Scale MIB frontend replicas
kubectl scale deployment acd-manager-mib-frontend --replicas=2

HPA Configuration

# View HPA status
kubectl get hpa

# Describe HPA details
kubectl describe hpa acd-manager

# Edit HPA configuration
kubectl edit hpa acd-manager

Configuration Updates

Updating Helm Values

# Edit values file
vi ~/values.yaml

# Validate with dry-run
helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml \
  --dry-run

# Apply changes
helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml

# Verify rollout
kubectl rollout status deployment/acd-manager

Rolling Back Changes

# View revision history
helm history acd-manager

# Rollback to previous revision
helm rollback acd-manager

# Rollback to specific revision
helm rollback acd-manager <revision>

# Verify rollback
helm history acd-manager

Certificate Management

Checking Certificate Expiration

# Check TLS secret expiration
kubectl get secret acd-manager-tls -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates

# Check via Grafana dashboard
# Certificate expiration metrics are available in Grafana

Renewing Certificates

# For Helm-managed self-signed certificates
helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml \
  --set ingress.selfSigned=true

# For manual certificates, update the secret
kubectl create secret tls acd-manager-tls \
  --cert=new-tls.crt \
  --key=new-tls.key \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart pods to pick up new certificate
kubectl rollout restart deployment acd-manager

Health Checks

Component Health

# Check all pods
kubectl get pods

# Check specific component
kubectl get pods -l app.kubernetes.io/component=manager

# Check persistent volumes
kubectl get pvc

# Check cluster status
kubectl get nodes

# Check ingress
kubectl get ingress

API Health Endpoints

# Liveness check
curl -k https://<manager-host>/api/v1/health/alive

# Readiness check
curl -k https://<manager-host>/api/v1/health/ready

Database Health

# PostgreSQL cluster status
kubectl get clusters -n default

# Check PostgreSQL pods
kubectl get pods -l app.kubernetes.io/name=postgresql

# Kafka cluster status
kubectl get pods -l app.kubernetes.io/name=kafka

# Redis status
kubectl get pods -l app.kubernetes.io/name=redis

Maintenance Windows

Planned Maintenance

Before performing maintenance:

Notify users of potential service impact
Verify backups are current
Document the maintenance procedure
Prepare rollback plan

Node Maintenance

# Cordon node to prevent new pods
kubectl cordon <node-name>

# Drain node (evict pods)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Perform maintenance

# Uncordon node
kubectl uncordon <node-name>

Cluster Upgrades

See the Upgrade Guide for cluster upgrade procedures.

Troubleshooting Quick Reference

Common Commands

# Describe problematic pod
kubectl describe pod <pod-name>

# View pod events
kubectl get events --sort-by='.lastTimestamp'

# Check resource usage
kubectl top pods
kubectl top nodes

# Exec into container
kubectl exec -it <pod-name> -- /bin/sh

# Check network policies
kubectl get networkpolicies

# Check service endpoints
kubectl get endpoints

Restarting Components

# Restart deployment
kubectl rollout restart deployment/<deployment-name>

# Restart statefulset
kubectl rollout restart statefulset/<statefulset-name>

# Delete pod (auto-recreated)
kubectl delete pod <pod-name>

Security Operations

Rotating Service Account Tokens

# Delete service account secret (auto-regenerated)
kubectl delete secret <service-account-token-secret>

# Tokens are automatically regenerated

Updating RBAC Permissions

# View current roles
kubectl get roles
kubectl get clusterroles

# View role bindings
kubectl get rolebindings
kubectl get clusterrolebindings

# Edit role
kubectl edit role <role-name>

Audit Log Access

# K3s audit logs location
/var/lib/rancher/k3s/server/logs/audit.log

# View recent audit events
tail -f /var/lib/rancher/k3s/server/logs/audit.log

Disaster Recovery

Pod Recovery

Pods are automatically recreated if they fail:

# Check pod status
kubectl get pods

# If pod is stuck in Terminating
kubectl delete pod <pod-name> --force --grace-period=0

# If pod is stuck in Pending, check resources
kubectl describe pod <pod-name>
kubectl get events --sort-by='.lastTimestamp'

Node Failure Recovery

When a node fails:

Automatic: Pods are rescheduled on healthy nodes (after timeout)
Manual: Force delete stuck pods

# Force delete pods on failed node
kubectl delete pod --all --force --grace-period=0 \
  --field-selector spec.nodeName=<failed-node>

Data Recovery

For data recovery scenarios, refer to:

PostgreSQL: Cloudnative PG backup/restore procedures
Longhorn: Volume snapshot restoration
Kafka: Partition replication handles node failures

Routine Maintenance Checklist

Daily

Review Grafana dashboards for anomalies
Check alert notifications
Verify backup completion

Weekly

Review pod restart counts
Check certificate expiration dates
Review log storage usage
Verify HPA is functioning correctly

Monthly

Test backup restoration procedure
Review and rotate credentials if needed
Update documentation if configuration changed
Review resource utilization trends

Next Steps

After mastering operations:

Troubleshooting Guide - Deep dive into problem resolution
Performance Tuning Guide - Optimize system performance
Metrics & Monitoring Guide - Comprehensive monitoring setup
API Guide - REST API reference and automation

9 - Metrics & Monitoring Guide

Monitoring architecture and metrics collection

Overview

The CDN Manager includes a comprehensive monitoring stack based on VictoriaMetrics for time-series data storage, Telegraf for metrics collection, and Grafana for visualization. This guide describes the monitoring architecture and how to access and use the monitoring capabilities.

Quick Links

Guide	Description
Grafana Dashboards	Using and customising the built-in and advanced Grafana dashboards
Grafana Authentication & Roles	Configuring Grafana authentication, roles, and permissions
Alerts & Alarms	Configuring and managing alerts and alarms

Architecture

Components

Component	Purpose
Telegraf	Metrics collector running on each node, gathering system and application metrics
VictoriaMetrics Agent	Metrics scraper and forwarder; scrapes Prometheus endpoints and forwards to VictoriaMetrics
VictoriaMetrics (Short-term)	Time-series database for operational dashboards (30-90 day retention)
VictoriaMetrics (Long-term)	Time-series database for billing and compliance (1+ year retention)
Grafana	Visualization and dashboard platform; deployed as two replicas for high availability
Alertmanager	Alert routing and notification management

Metrics Flow

The following diagram illustrates how metrics flow through the monitoring stack:

flowchart TB
    subgraph External["External Sources"]
        Streamers[Streamers/External Clients]
    end

    subgraph Cluster["Kubernetes Cluster"]
        Telegraf[Telegraf DaemonSet]

        subgraph Applications["Application Components"]
            Director[CDN Director]
            Kafka[Kafka]
            Redis[Redis]
            Manager[ACD Manager]
            Alertmanager[Alertmanager]
        end

        VMAgent[VictoriaMetrics Agent]

        subgraph Storage["Storage"]
            VMShort[VictoriaMetrics<br/>Short-term]
            VMLong[VictoriaMetrics<br/>Long-term]
        end

        Grafana[Grafana<br/>2 replicas, HA]
        PostgreSQL[(PostgreSQL)]
        Zitadel[Zitadel]
    end

    Streamers -->|Push metrics| Telegraf
    Telegraf -->|remote_write| VMShort
    Telegraf -->|remote_write| VMLong

    Director -->|Scrape| VMAgent
    Kafka -->|Scrape| VMAgent
    Redis -->|Scrape| VMAgent
    Manager -->|Scrape| VMAgent
    Alertmanager -->|Scrape| VMAgent

    VMAgent -->|remote_write| VMShort
    VMAgent -->|remote_write| VMLong

    VMShort -->|Query| Grafana
    VMLong -->|Query| Grafana

    Grafana <-->|Shared state| PostgreSQL
    Grafana -->|OAuth2 / OIDC| Zitadel

Metrics Flow Summary:

External metrics ingestion:
- External clients (streamers) push metrics to Telegraf
- Telegraf forwards metrics via remote_write to both VictoriaMetrics instances
Internal metrics scraping:
- VictoriaMetrics Agent scrapes Prometheus endpoints from:
  - CDN Director instances
  - Kafka cluster
  - Redis
  - ACD Manager components
  - Alertmanager
- VMAgent forwards scraped metrics via remote_write to both VictoriaMetrics instances
Data visualization:
- Grafana queries both VictoriaMetrics databases depending on the dashboard requirements
- Operational dashboards use short-term storage
- Billing and compliance dashboards use long-term storage

Metrics Collection

Application Metrics

Applications expose metrics on Prometheus-compatible endpoints. VictoriaMetrics Agent (VMAgent) scrapes these endpoints and forwards metrics to VictoriaMetrics via remote_write.

System Metrics

Telegraf collects system-level metrics including:

CPU usage
Memory utilization
Disk I/O
Network statistics
Process metrics

Kubernetes Metrics

Cluster metrics are collected including:

Pod resource usage
Node status
Deployment status
Persistent volume usage

Metrics Retention

VictoriaMetrics is configured with default retention policies. For custom retention settings, modify the VictoriaMetrics configuration in your values.yaml:

acd-metrics:
  victoria-metrics-single:
    retentionPeriod: "3"  # Retention period in months

Troubleshooting

Metrics Not Appearing

If metrics are not appearing in Grafana:

Check Telegraf pods:

kubectl get pods -l app.kubernetes.io/component=telegraf

Check Telegraf logs:

kubectl logs -l app.kubernetes.io/component=telegraf

Verify VictoriaMetrics is running:

kubectl get pods -l app.kubernetes.io/component=victoria-metrics

Check application metrics endpoints:

kubectl exec <pod-name> -- curl localhost:8080/metrics

For dashboard and authentication issues, see the Grafana Dashboards and Grafana Authentication & Roles guides.

Next Steps

After setting up monitoring:

Grafana Authentication & Roles - Configure SSO and permissions before accessing Grafana
Grafana Dashboards - Explore and customise dashboards
Alerts & Alarms - Set up alerting and notifications
Operations Guide - Day-to-day operational procedures
Troubleshooting Guide - Resolve monitoring issues
API Guide - Access metrics via API

9.1 - Grafana Authentication & Roles

Configuring Grafana authentication, roles, and permissions via Zitadel

Overview

Grafana authentication is delegated entirely to Zitadel via OAuth2/OIDC. Local username/password login is not available to end users. When a user logs into Grafana, they are redirected to Zitadel to authenticate, and their Grafana role is automatically determined by the Zitadel project roles assigned to their account.

The OIDC integration between Grafana and Zitadel is configured automatically at install time — no manual Zitadel application registration is required.

How It Works

During installation, an init container runs before Grafana starts and:

Authenticates with Zitadel using a machine-account service key.
Registers a Grafana OIDC application in the Zitadel project (or re-uses an existing one if already registered).
Writes the resulting client_id and client_secret into a Kubernetes Secret, which Grafana picks up on startup.

This means the Grafana OIDC application in Zitadel is managed automatically and does not need to be created or modified manually.

Role Mapping

Grafana roles are mapped from Zitadel project roles using the following rule:

Zitadel Project Role	Grafana Role
`grafana_admin`	Admin — full access, can manage users, datasources, and dashboards
(any other role, or no role)	Viewer — read-only access to dashboards

Note: There is no Grafana Editor role mapped by default. All authenticated users who are not explicitly granted grafana_admin receive Viewer access. If you need an Editor tier, see Customising the Role Mapping.

The mapping is enforced on every login. If a user’s Zitadel role changes, the change takes effect the next time they log into Grafana.

Prerequisites

Zitadel is configured and accessible at https://<manager-host>
At least one Zitadel user account exists (see Next Steps — Create User Accounts)
Grafana is accessed via the correct DNS hostname (see Accessing Grafana)

Accessing Grafana

Grafana is accessible at:

https://<manager-host>/grafana

Important: Grafana must be accessed using the DNS name specified in the first entry of global.hosts.manager in your configuration. Accessing Grafana via an IP address or an alternative hostname will cause OAuth2 redirect URI mismatches and CORS errors, preventing login from completing successfully.

To log in:

Navigate to https://<manager-host>/grafana
Click “Login with Zitadel”
Authenticate with your Zitadel account credentials

Granting Admin Access

By default, all Zitadel users who log into Grafana receive Viewer access. To grant a user Admin access, assign them the grafana_admin project role in Zitadel.

Step 1: Ensure the `grafana_admin` Role Exists

Log into the Zitadel Console at https://<manager-host>/ui/console
Navigate to Projects and open the ZITADEL project
Click the Roles tab
Check whether a role named grafana_admin already exists
If it does not exist, click New Role and create it:
- Key: grafana_admin
- Display Name: Grafana Admin (or any label you prefer)
- Click Save

Step 2: Assign the Role to a User

In the Zitadel Console, navigate to Users and open the user you want to grant admin access to
Click the Authorizations tab
Click New Authorization
Select the ZITADEL project
Select the grafana_admin role
Click Save

The user will have Grafana Admin access the next time they log in.

Revoking Admin Access

To demote a user back to Viewer, remove the grafana_admin authorization from their account:

In the Zitadel Console, open the user’s Authorizations tab
Find the grafana_admin authorization on the ZITADEL project
Click the delete icon to remove it

The change takes effect on their next Grafana login.

Customising the Role Mapping

The role mapping expression is configured in values.yaml under grafana."grafana.ini".auth.generic_oauth.role_attribute_path. It uses JMESPath syntax evaluated against the OIDC token’s role claims.

The default expression is:

grafana:
  "grafana.ini":
    auth.generic_oauth:
      role_attribute_path: >-
        contains(keys("urn:zitadel:iam:org:project:roles"), 'grafana_admin') && 'Admin' || 'Viewer'

Example: Adding an Editor Tier

To map a grafana_editor Zitadel role to Grafana’s Editor role, create the grafana_editor role in Zitadel (following the same steps as above) and extend the expression:

grafana:
  "grafana.ini":
    auth.generic_oauth:
      role_attribute_path: >-
        contains(keys("urn:zitadel:iam:org:project:roles"), 'grafana_admin') && 'Admin'
        || contains(keys("urn:zitadel:iam:org:project:roles"), 'grafana_editor') && 'Editor'
        || 'Viewer'

Apply the change using the standard upgrade procedure in the Configuration Guide.

Blocking Unauthenticated Access

By default, role_attribute_strict is set to false, which means any authenticated Zitadel user can log into Grafana as a Viewer even if they have no explicit Grafana role assigned. To restrict Grafana access to only users who have been explicitly granted a role, set this to true:

grafana:
  "grafana.ini":
    auth.generic_oauth:
      role_attribute_strict: true

With role_attribute_strict: true, users who do not match any role in the role_attribute_path expression will be denied access entirely.

Managing Users in Grafana

User accounts in Grafana are created automatically on first login via Zitadel. There is no need to pre-create users in the Grafana UI.

To view and manage users who have logged in:

Log into Grafana as an Admin
Navigate to Administration > Users and access > Users

From here you can see each user’s current role, last login time, and authentication provider. Role changes should always be made via Zitadel (as described above) rather than directly in Grafana, as they will be overwritten on the user’s next login.

Break-Glass Admin Access

A local Grafana admin account is available as a break-glass fallback for situations where Zitadel is unavailable. This account is not accessible via the standard login page (which only shows the Zitadel SSO button).

To use the local admin account, navigate directly to:

https://<manager-host>/grafana/login

The default credentials are listed in the Glossary. Change the default password immediately after installation.

Security recommendation: The break-glass account should be used only for emergency access. Do not use it for routine administration.

Troubleshooting

OAuth2 Redirect URI Mismatch / CORS Errors

Grafana is registered in Zitadel with the redirect URI https://<manager-host>/grafana/login/generic_oauth, derived from the first entry of global.hosts.manager. Accessing Grafana via a different hostname or IP address will not match this URI and will cause the login to fail.

Resolution: Always access Grafana via the configured hostname. If the hostname has changed, re-run the helm upgrade to re-register the application with the updated URI.

User Receives Viewer Instead of Admin

The grafana_admin role is not included in the user’s Zitadel token.

Resolution:

Confirm the grafana_admin role exists on the ZITADEL project in the Zitadel Console
Confirm the role is assigned to the user under their Authorizations tab
Ask the user to log out of Grafana and log back in — role changes are applied on the next login, not the current session

role_attribute_strict may be set to true and the user has no matching Zitadel role.

Resolution: Either assign the user an appropriate Zitadel project role, or set role_attribute_strict: false in values.yaml to allow all authenticated users Viewer access.

Admin Role Assigned in Zitadel but User Still Gets Viewer

The grafana_admin role is correctly assigned to the user in Zitadel, but Grafana still grants them Viewer access. This indicates that role claims are not being included in the Zitadel userinfo response.

Grafana determines roles by calling the Zitadel userinfo endpoint (/oidc/v1/userinfo) and evaluating the urn:zitadel:iam:org:project:roles claim. Zitadel only includes this claim when the Grafana OIDC application has Access Token Role Assertions enabled. If the claim is absent, the role_attribute_path expression always falls through to 'Viewer'.

To verify and fix:

Log into the Zitadel Console at https://<manager-host>/ui/console
Navigate to Projects > ZITADEL > Applications > Grafana
Open the Token Settings tab
Ensure Access Token Role Assertions is enabled
Save the change

The fix takes effect on the user’s next login — no Grafana or Helm changes are required.

Grafana OIDC App Not Registered in Zitadel

If the init container failed during installation, the Grafana OIDC application may not have been created in Zitadel.

Resolution: Check the init container logs for errors:

kubectl logs -l app.kubernetes.io/component=grafana --previous -c zitadel-oauth-setup

Common causes are Zitadel not being ready when the init container ran, or a machine-key permission issue. Re-running the helm upgrade will re-trigger the init container and attempt registration again.

Next Steps

Grafana Dashboards - Using and customising dashboards
Alerts & Alarms - Configure alerting and notifications
Metrics & Monitoring Overview - Return to the monitoring overview

9.2 - Grafana Dashboards

Using and customising Grafana dashboards

Overview

Grafana is the primary visualization platform for the CDN Manager monitoring stack. It provides pre-built dashboards for cluster health, application performance, and billing analytics, and is accessible via the manager ingress.

Prerequisites

Grafana is deployed and running (verify with kubectl get pods -l app.kubernetes.io/component=grafana)
A Zitadel user account is available for login
Grafana is accessed via the correct DNS hostname (see Grafana Authentication & Roles)

Accessing Grafana

Grafana is accessible via the manager ingress:

URL: https://<manager-host>/grafana

To log in:

Navigate to https://<manager-host>/grafana
Click the “Login with Zitadel” button
Authenticate with your Zitadel account credentials

Important: Grafana must be accessed using the DNS name specified in the first entry of global.hosts.manager in your configuration. Accessing Grafana via an IP address or an alternative hostname will cause OAuth2 redirect URI mismatches and CORS errors, preventing login from completing successfully.

For details on authentication and role configuration, see Grafana Authentication & Roles.

Standard Dashboards

Accessing Dashboards

After logging into Grafana:

Navigate to Dashboards in the left menu
Browse the folder structure to find the dashboard you need
Click on a dashboard to open it

Dashboards are organised into the following folders:

Alerting — alert state history and alerting system health
Billing — redirect counts for billing analytics
CDN Manager — ACD Manager API performance
Hardware — host-level CPU, memory, disk, and network telemetry
Infrastructure — Kubernetes cluster, Kafka, Longhorn, and Redis health
Streaming — CDN routing, streamer performance, and QoE
Internal Debugging — low-level ACD Director diagnostics

Alerting

Active Alarms

A live view of all currently firing alerts. Shows the alert name, severity, affected host, and description. Use this as the first stop when investigating an active incident.

Alert Statistics

Historical view of alert firing activity over time. Shows which alert groups and individual rules have been firing, with timelines and trend charts. Useful for identifying recurring or flapping alerts.

vmalert

Operational health dashboard for the vmalert component itself. Covers evaluation rate, evaluation errors, alerting and recording rule counts, remote write throughput, and resource usage. Use this to verify the alerting pipeline is functioning correctly.

Billing

Billing Dashboard

Tracks redirect volumes for billing and usage analytics. Shows initial managed and unmanaged redirects, segment redirects, and endpoint redirects — both as totals and as ratios over time. Data is sourced from long-term VictoriaMetrics storage to support historical reporting.

CDN Manager

CDN Manager API

Health and performance dashboard for the ACD Manager REST API. Covers:

Overview: API health status, active pod count, total request volume, 5xx error rate, and average latency
Traffic: Request rate by pod, distribution across API endpoints
Errors: 5xx errors per endpoint, response code breakdown per endpoint, error rate by pod
Latency: P99 and average latency by endpoint, overall API response latency
Resources & Auth: Route validation API activity

Hardware

HW Metrics

Condensed host hardware overview covering CPU usage and load averages, memory utilisation, network interface throughput, swap usage, and root filesystem disk space. Suitable for day-to-day health checks across all cluster nodes.

An expanded HW Metrics (Advanced) dashboard is available as part of the Advanced Dashboards licence.

Infrastructure

k3s Cluster Infrastructure

Kubernetes cluster health overview using node-exporter and kube-state-metrics. Covers:

Cluster Overview: Node count, running pod count, OOMKilled containers, and overall cluster health status
Compute: CPU usage, memory usage, and load average per node
Network: Inbound and outbound bytes per node
Disk: Read/write throughput and I/O pressure per node
Longhorn PVC Disk Usage: Usage percentage per persistent volume
Workload Health: Pod restart counts and OOMKill occurrences

Kafka

Kafka broker health using JMX exporter metrics. Covers:

Cluster Health: Active controller, broker state, topic and partition counts, offline and under-replicated partitions, active and fenced broker counts, metadata log lag
Throughput: Bytes in/out and messages in by topic, replication bytes in/out
Internals: Request handler idle percentage, network processor idle percentage

Longhorn Storage

Persistent storage health for the Longhorn distributed block storage layer. Covers:

Overview: Total, healthy, degraded, and faulted volume counts; nodes down
Capacity: Total cluster capacity, used, and available storage
Volume Detail: Usage percentage per volume, actual size per volume, volume robustness state, volumes approaching capacity (>85%)
Node & Disk: Disk usage percentage and available bytes per node, node condition checks

Redis

Redis instance health using redis-exporter metrics. Covers:

Instance Health: Status, uptime, connected and blocked clients, slow log length, rejected connections
Memory: Usage and fragmentation ratio
Throughput & Keyspace: Commands processed, network I/O, keyspace hit rate, keys per database
Evictions & Persistence: Evictions, expirations, RDB unsaved changes
CPU & Connections: CPU usage and connection metrics
Command Analysis: Per-command breakdown

Streaming

Extended Monitoring

The primary operational dashboard for CDN routing activity. This is the home dashboard displayed on Grafana login. Covers:

Latency Statistics: ACD router latency and CDN latency over time
Redirects: Total redirect volume, status code breakdown, managed vs unmanaged ratio
Content Popularity: Top 10 requested content and top 10 most rapidly increasing popularity scores
CDN Selection: Redirect distribution across CDN endpoints, current and historical ratios
CDN Failovers and Retries: Failover events and retry rates by CDN
Host Selection: Endpoint request distribution
Session Statistics: Active session counts and session type breakdown
Client Responses: Client-facing HTTP status code distribution
Incoming Requests: Raw request volume
HTTPS Certificate Statistics: Certificate validity and expiry indicators
Warnings & Errors: Application-level warnings and errors over time
LUA Statistics: Lua exception counts and execution time
Configuration Change History: Timeline of routing configuration changes

Router Monitoring

External-facing view of ACD Director routing activity. Shows the number of initial routing decisions made, HTTP status code distribution, incoming HTTP/HTTPS request volumes, and selection input metrics. Useful for a high-level view of traffic hitting the directors.

QoE Monitoring

Quality of Experience scoring dashboard. Shows average QoE scores per host, per session group, per CDN, and per agent, as well as the initial CDN selection rate. Use this to identify CDN providers or content hosts that are delivering a degraded experience.

Streamer Statistics

Condensed view of streamer node performance, covering network ingress/egress throughput, TCP and HTTP connection counts, active session counts, HTTP request rates and response codes, response times (ingress and egress), and storage/memory/CPU. Suitable for routine streamer health monitoring.

An expanded Streamer Statistics (Advanced) dashboard is available as part of the Advanced Dashboards licence.

Internal Debugging

These dashboards expose low-level ACD Director internals and are primarily intended for advanced diagnostics and support investigations.

Debugging Information

Lua runtime statistics from the ACD Director: exception counts, active Lua context count, time spent in Lua execution, and router latency. Use when investigating unexpected Director behaviour or Lua errors.

ACD: Incoming Internet Connections

SSL-level connection statistics at the Director: SSL warnings and errors, valid and invalid HTTP and HTTPS request counts from external clients. Use when investigating TLS handshake failures or unexpected rejection rates.

Performance Metrics

ACD Director process-level resource usage: router CPU utilisation, router memory usage, and Lua memory consumption. Useful for identifying resource pressure on the Director process itself rather than the host.

Prometheus: ACD

ACD application metrics exposed via Prometheus: active and total session counts, session type breakdown, managed and unmanaged redirect counts, QoE corrections, manifest parse failures, initial endpoint request counts, HTTP request rates, and logged warnings and errors.

CDN Failures

CDN-level failure tracking: response code distribution from CDN backends, CDN-level failover events, host-level failovers, and host retry counts. Use when investigating CDN reliability issues or failover behaviour.

ACD: CDN Latencies Detail

Detailed CDN latency analysis with configurable percentile plots and a full latency histogram. Use when investigating tail latency issues on specific CDN backends.

ACD: Router Latencies

ACD Director routing latency distributions for both 2xx (successful) and 3xx (redirect) responses, visualised as heatmap buckets over time. Use alongside CDN Latencies Detail to separate router processing time from CDN response time.

Prometheus/ACD: SubRunners

Internal async processing queue depth and throughput metrics for the ACD Director subrunner system: client connection counts, low/medium/high priority queue depths (current and max), send/receive data block usage, wakeup counts, overload events, and autopause activations. Use when investigating Director throughput bottlenecks or queue backpressure.

Advanced Dashboards

Advanced dashboards are a paid add-on that unlocks more detailed variants of two standard dashboards, providing deeper visibility for performance investigation and capacity planning.

Licensing: Advanced dashboards require a separate licence key. To obtain a key for your deployment, contact your AgileTV account representative.

Enabling Advanced Dashboards

Once you have your licence key, add the following to your values.yaml:

dashboards:
  advanced:
    licenceKey: "<your-licence-key>"

Then apply the change by following the upgrade procedure in the Configuration Guide. The advanced dashboards will become available in Grafana automatically once the upgrade completes.

HW Metrics (Advanced)

Expanded hardware telemetry that supplements the standard HW Metrics dashboard with additional depth and additional sections: kernel metrics, TCP and UDP network stack statistics, per-interface error counters, disk IOPS, and metrics collection velocity. Use this when investigating performance anomalies surfaced by the standard dashboard or by alerts.

Streamer Statistics (Advanced)

Full streamer telemetry that supplements the standard Streamer Statistics dashboard. Includes all standard metrics plus:

OTT JCQ: Server group and node request rates and ratios, circuit breaker state (closed/open backends), global backend stats, pending requests per backend, and pop-out per backend
Account Records: Session counts, total traffic in/out, HTTP request rates (ingress/egress), cache hit ratio, backend request rates, and response times
Detailed network error and drop counters from /proc/net/dev

CDN Director Metrics

Director DNS Names in Grafana

CDN Director instances are identified in Grafana by their DNS name, which is derived from the name field in global.hosts.routers:

global:
  hosts:
    routers:
      - name: my-router-1
        address: 192.0.2.1

The DNS name used in Grafana dashboards will be: my-router-1.external

This naming convention is automatically applied for all configured directors.

Customising Dashboards

Permissions: Creating and importing dashboards requires Grafana Admin access. See Grafana Authentication & Roles for details on granting admin rights.

The pre-provisioned dashboards are read-only and managed by the Helm chart — changes made to them in the Grafana UI will not persist across upgrades. To create persistent custom dashboards:

In Grafana, navigate to Dashboards > New > New Dashboard
Add panels using the VictoriaMetrics or Prometheus datasource
Save the dashboard to a folder of your choice

Custom dashboards saved this way are stored in the Grafana PostgreSQL database and are unaffected by Helm upgrades.

Note: Do not save custom dashboards into the provisioned folders (Alerting, Billing, CDN Manager, etc.). Grafana marks these folders as provisioned and may behave unexpectedly if user dashboards are mixed in.

Customising a Pre-provisioned Dashboard

If you want a modified version of one of the built-in dashboards as a starting point:

Open the dashboard you want to customise
Click the Share icon (top toolbar) > Export > Save to file to download the dashboard JSON
In Grafana, navigate to Dashboards > New > Import
Upload the downloaded JSON file
On the import screen, give the dashboard a new name to distinguish it from the original, and choose a destination folder outside the provisioned set
Click Import

You now have an independently editable copy. The original provisioned dashboard remains unchanged and will continue to be updated by future Helm upgrades. Your copy is stored in PostgreSQL and persists across upgrades independently.

Troubleshooting

Dashboard Loading Issues

If dashboards fail to load:

Check Grafana pods:

kubectl get pods -l app.kubernetes.io/component=grafana

Review Grafana logs:

kubectl logs -l app.kubernetes.io/component=grafana

Verify datasource configuration in Grafana UI

For login and authentication issues, see Grafana Authentication & Roles.

Next Steps

Alerts & Alarms - Set up alerting and notifications
Operations Guide - Day-to-day operational procedures
Metrics & Monitoring Overview - Return to the monitoring overview

9.3 - Alerts & Alarms

Configuring and managing alerts and alarms

Overview

The CDN Manager ships a set of pre-configured alerting rules evaluated by vmalert against VictoriaMetrics. When a rule fires, the alert is routed to Alertmanager, which handles deduplication, grouping, silencing, and delivery to configured notification channels.

This page documents every built-in alert rule, what it means, its severity, and the recommended operator action.

Alert Severity Levels

Severity	Meaning
critical	Immediate action required. The condition poses a risk to data integrity, service availability, or active traffic.
warning	Investigate soon. The condition is not immediately harmful but will degrade into a critical state if left unattended.

Alert Groups

Alerts are organised into the following groups, each evaluated on a 15-second interval.

infra-disk — Disk space and I/O
infra-compute — CPU and memory
infra-network — Network errors and traffic anomalies
longhorn — Persistent storage health

infra-disk

Monitors disk space utilisation and I/O latency on cluster nodes.

StorageFillingUp

Property	Value
Severity	warning
Condition	Root filesystem usage exceeds 85%
Must persist for	2 minutes

What it means: A node’s root filesystem is running low on space. If left unchecked this will progress to a full disk, which can cause pod evictions, write failures, and potential data loss.

Recommended actions:

Identify the node from the host label in the alert.

Log into the node and check disk usage:

df -h /
du -sh /var/log/* | sort -rh | head -20

Clear old log files, unused container images, or temporary files:

# On the node
journalctl --vacuum-size=500M
crictl rmi --prune

If disk usage is due to application data growth, consider expanding the volume or adjusting retention settings. See Metrics Retention.

HighDiskLatency

Property	Value
Severity	warning
Condition	Average disk write latency exceeds 100 ms
Must persist for	2 minutes

What it means: Disk write operations are taking longer than 100 ms on average. High disk latency can degrade database performance (PostgreSQL, VictoriaMetrics) and cause timeouts in write-heavy components.

Recommended actions:

Identify the affected disk from the name label in the alert.
Check for I/O-intensive processes on the node:
```
iostat -x 2 5
iotop -o
```
Check for Longhorn replica rebuilds or rebalancing activity, which can saturate disk I/O.
If the issue persists on a production node, review whether the storage hardware meets the System Requirements.

infra-compute

Monitors CPU and memory utilisation on cluster nodes.

CpuSaturation

Property	Value
Severity	warning
Condition	Total CPU usage exceeds 90%
Must persist for	5 minutes

What it means: A node is running at near-full CPU capacity. Sustained CPU saturation causes request latency increases across all workloads on that node and may result in pod throttling.

Recommended actions:

Identify the saturated node from the host label in the alert.
Check which pods are consuming CPU:
```
kubectl top pods --sort-by=cpu -A
```
Check for runaway processes on the node:
```
top -b -n 1 | head -20
```
If saturation is caused by a legitimate workload spike (e.g. CDN traffic burst), consider scaling the deployment or redistributing load across nodes.

MemoryCriticallyLow

Property	Value
Severity	critical
Condition	Available RAM falls below 10%
Must persist for	2 minutes

What it means: The node has very little free memory remaining. The Linux OOM killer may begin terminating processes, which can cause abrupt pod restarts, data corruption in in-memory caches, and service unavailability.

Recommended actions:

Identify the affected node from the host label in the alert.
Immediately check for memory-leaking or oversized pods:
```
kubectl top pods --sort-by=memory -A
```
Identify and restart any pods showing abnormal memory consumption:
```
kubectl rollout restart deployment/<name>
```
Check kernel OOM kill log for any processes already killed:
```
dmesg | grep -i "oom\|killed"
```
Review memory resource limits and requests for affected deployments and adjust if necessary.

SwapUsageDetected

Property	Value
Severity	warning
Condition	Swap usage exceeds 5%
Must persist for	1 minute

What it means: The node is swapping memory to disk. Swap usage in a Kubernetes cluster is a strong indicator of memory pressure. It degrades performance significantly and may mask an underlying memory shortage that could escalate to a MemoryCriticallyLow event.

Recommended actions:

Treat this as an early warning for the same conditions as MemoryCriticallyLow.
Identify memory-intensive pods and investigate whether resource limits are configured appropriately.
Swap should ideally never be active on a production Kubernetes node. If it persists, escalate to a memory capacity review.

infra-network

Monitors network interface errors and traffic anomalies on cluster nodes.

NetworkInterfaceErrors

Property	Value
Severity	critical
Condition	Any non-zero rate of inbound or outbound packet errors on a network interface
Must persist for	1 minute

What it means: A network interface is dropping or corrupting packets. Even a low error rate can cause TCP retransmissions, increased latency, and connection failures — directly impacting CDN Director communication and external traffic delivery.

Recommended actions:

Identify the affected host and interface from the host and interface labels in the alert.

Check interface error counters on the node:

ip -s link show <interface>
ethtool -S <interface> | grep -i error

Check for duplex/speed mismatches between the node NIC and the upstream switch:
```
ethtool <interface> | grep -E "Speed|Duplex"
```
Escalate to network/hardware team if errors are persistent and cannot be attributed to a software configuration issue.

SuddenNetworkEgressDrop

Property	Value
Severity	critical
Condition	Egress throughput drops to less than 50% of the 2-minute baseline, when baseline traffic is above 1 Mbit/s
Must persist for	1 minute

What it means: A significant, sudden reduction in outbound traffic has been detected. This typically indicates a upstream network failure, link fault, or a routing issue. A CDN node that stops transmitting traffic is effectively out of service.

Recommended actions:

Identify the affected node and interface from the alert labels.

Verify the node’s network connectivity:

ping <gateway-ip>
traceroute <upstream-endpoint>

Check for interface errors or link-down events:

ip link show
dmesg | grep -i "link\|eth\|nic"

Verify that upstream routing and firewall rules have not changed.
If the node is healthy and traffic has legitimately dropped (e.g. a CDN traffic shift), the alert can be silenced if the traffic reduction is expected and understood.

SuddenNetworkIngressSpike

Property	Value
Severity	warning
Condition	Ingress throughput exceeds twice the 5-minute baseline
Must persist for	1 minute

What it means: A sudden surge of inbound traffic has been detected. This may indicate a legitimate traffic event (e.g. a large stream audience spike), a DDoS attempt, or a misconfigured client sending unexpected volume.

Recommended actions:

Identify the affected node and interface from the alert labels.

Review active connections and top talkers:

ss -s
netstat -an | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head

Correlate with CDN Director metrics in Grafana to determine whether the spike is legitimate CDN traffic.
If the spike is unexpected and sustained, consider rate-limiting or blocking the source at the network edge.

longhorn

Monitors the health of Longhorn distributed block storage, which backs persistent volumes for PostgreSQL, VictoriaMetrics, and other stateful components.

Note: Longhorn alert rules are always present in the alert configuration, but will not fire in environments where Longhorn is not installed (e.g. cloud deployments using external storage).

LonghornVolumeDegraded

Property	Value
Severity	warning
Condition	A Longhorn volume’s robustness state is `Degraded`
Must persist for	2 minutes

What it means: A Longhorn volume has fewer healthy replicas than its configured replication factor. Data is not at immediate risk, but the volume has reduced redundancy. A single additional node or disk failure could result in data loss or volume unavailability.

Recommended actions:

Identify the affected volume from the volume label in the alert.
Open the Longhorn UI and inspect the volume’s replica status.
Check whether a replica is in the process of rebuilding (this is normal after a node restart). Rebuilding may take several minutes depending on volume size.
If a replica has failed and is not rebuilding, attempt to evict and re-schedule it via the Longhorn UI.
Investigate the health of the node that hosted the failed replica:
```
kubectl get nodes
kubectl describe node <node-name>
```

LonghornVolumeFaulted

Property	Value
Severity	critical
Condition	A Longhorn volume’s robustness state is `Faulted`
Must persist for	1 minute

What it means: A Longhorn volume has lost all healthy replicas and is no longer accessible. Any workload that depends on this volume (e.g. PostgreSQL, VictoriaMetrics) will be unable to write and may crash. Data may be at risk.

Recommended actions:

Identify the affected volume from the volume label.

Immediately check which pods are using the volume:

kubectl get pods -A -o wide | grep -i <volume-name>

Open the Longhorn UI. Check whether any replicas are still present and whether they can be recovered.
Do not delete faulted replicas without first attempting recovery — they may contain the only copy of the data.
Contact AgileTV support if the volume cannot be recovered, providing Longhorn UI screenshots and node logs.

LonghornNodeDown

Property	Value
Severity	critical
Condition	A Longhorn node reports a non-ready state
Must persist for	2 minutes

What it means: A storage node is unreachable or unhealthy from Longhorn’s perspective. All volumes with replicas on this node are at reduced redundancy. If more than one node goes down simultaneously, faulted volumes and data loss become a risk.

Recommended actions:

Identify the affected node from the node label in the alert.

Check the node’s status in Kubernetes:

kubectl get nodes
kubectl describe node <node-name>

Attempt to SSH to the node and check system health:
```
ssh root@<node-ip>
systemctl status k3s
```
If the node has crashed and cannot be recovered quickly, consider evicting its Longhorn replicas to allow rebuilding on healthy nodes — but only if the remaining healthy nodes have sufficient capacity.

LonghornDiskSpaceLow

Property	Value
Severity	warning
Condition	Available Longhorn disk space on a node falls below 15%
Must persist for	2 minutes

What it means: A node’s Longhorn-managed disk is running low on storage. When Longhorn disk space is exhausted, it cannot schedule new replicas or accommodate volume growth, which can lead to LonghornVolumeDegraded or LonghornVolumeFaulted conditions.

Recommended actions:

Identify the affected node and disk from the node and disk labels in the alert.
Open the Longhorn UI and check which volumes have replicas on this disk.
Check for snapshots or backups that can be cleaned up to reclaim space.
If space cannot be reclaimed, consider adding a disk to the node or expanding the underlying block device.
Review Metrics Retention settings — reducing VictoriaMetrics retention is often the fastest way to reclaim Longhorn disk space in a monitoring-heavy deployment.

Adding Custom Alert Rules

Additional alert rules can be defined by extending the victoria_metrics_alert.server.config.alerts.groups list in your values.yaml. Rules follow the Prometheus alerting rule format.

Example: Adding a Custom Alert

The following example adds an alert group that fires when a Kafka consumer lag exceeds a threshold:

victoria_metrics_alert:
  server:
    config:
      alerts:
        groups:
          # ... existing groups are preserved alongside your additions ...
          - name: kafka
            interval: 15s
            rules:
              - alert: KafkaConsumerLagHigh
                expr: kafka_consumer_group_lag > 10000
                for: 5m
                labels:
                  severity: warning
                annotations:
                  summary: "High consumer lag on {{ $labels.topic }}"
                  description: "Consumer group {{ $labels.group }} is {{ $value }} messages behind on topic {{ $labels.topic }}."

Apply the change using the standard upgrade procedure in the Configuration Guide.

Rule Fields Reference

Field	Required	Description
`alert`	Yes	Alert name. Must be unique within the group.
`expr`	Yes	PromQL expression. The alert fires when this evaluates to a non-zero/non-empty result.
`for`	No	How long the condition must hold before the alert fires. Omitting this fires immediately.
`labels.severity`	Recommended	Set to `critical` or `warning` to match the built-in routing rules.
`annotations.summary`	Recommended	Short human-readable description. Supports Go template labels (e.g. `{{ $labels.host }}`).
`annotations.description`	Recommended	Detailed description with context for the on-call operator.

Tip: Use the Alertmanager UI (https://<manager-host>/alertmanager) to verify that fired alerts are being received and routed correctly after adding new rules.

Configuring Alert Routes

By default, all alerts are routed to the built-in null receiver, which silently discards them. To receive alerts, configure one or more receivers and update the routing rules — all within the alertmanager.config section of your values.yaml.

Route Structure

The top-level route defines the default behaviour. Child routes under routes match alerts by label and direct them to specific receivers:

alertmanager:
  config:
    route:
      receiver: 'null'          # Default: discard unmatched alerts
      group_by: ['alertname']
      group_wait: 10s           # Wait before sending first notification for a new group
      group_interval: 10s       # Wait before sending updated notifications for a group
      repeat_interval: 1h       # Re-notify if an alert is still firing after this period
      routes:
        - matchers:
            - severity="critical"
          receiver: 'slack'
        - matchers:
            - severity="warning"
          receiver: 'email-warning'

Routes are evaluated top-to-bottom. The first matching route wins unless continue: true is set on the route.

Notification Channels

Email

Email requires an SMTP server to be configured globally. Both a critical and warning receiver can be defined independently.

alertmanager:
  config:
    global:
      smtp_smarthost: 'smtp.example.com:587'
      smtp_from: 'alertmanager@example.com'
      smtp_require_tls: true
    route:
      routes:
        - matchers:
            - severity="critical"
          receiver: 'email-critical'
        - matchers:
            - severity="warning"
          receiver: 'email-warning'
    receivers:
      - name: 'null'
      - name: 'email-critical'
        email_configs:
          - to: 'oncall@example.com'
            send_resolved: true
      - name: 'email-warning'
        email_configs:
          - to: 'alerts@example.com'
            send_resolved: true

Slack

Requires an incoming webhook URL created in your Slack workspace.

alertmanager:
  config:
    route:
      routes:
        - matchers:
            - severity="critical"
          receiver: 'slack'
    receivers:
      - name: 'null'
      - name: 'slack'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
            channel: '#alerts'
            send_resolved: true
            title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
            text: |
              *Severity:* {{ .CommonLabels.severity }}
              *Host:* {{ .CommonLabels.host }}
              {{ range .Alerts }}{{ .Annotations.description }}{{ end }}

Requires a Telegram bot token and a channel/group chat ID. Create a bot via @BotFather and add it to your alert channel before configuring.

alertmanager:
  config:
    route:
      routes:
        - matchers:
            - severity="critical"
          receiver: 'telegram'
    receivers:
      - name: 'null'
      - name: 'telegram'
        telegram_configs:
          - bot_token: 'your-bot-token'
            chat_id: -1234567890
            parse_mode: 'Markdown'
            send_resolved: true
            message: |
              *Alert:* {{ .CommonLabels.alertname }}
              *Severity:* {{ .CommonLabels.severity }}
              *Host:* {{ .CommonLabels.host }}
              {{ range .Alerts }}
                {{ .Annotations.description }}
              {{ end }}

Finding your chat ID: Add your bot to the channel or group, send a message, then call https://api.telegram.org/bot<token>/getUpdates and read the chat.id from the response. Note that group and channel chat IDs are negative numbers.

Combining Multiple Receivers

Routes and receivers can be combined to send different alert severities to different channels simultaneously. For example, critical alerts to PagerDuty and Slack, warnings to email only:

alertmanager:
  config:
    route:
      receiver: 'null'
      routes:
        - matchers:
            - severity="critical"
          receiver: 'slack'
          continue: true        # Continue matching so the next route also fires
        - matchers:
            - severity="critical"
          receiver: 'email-critical'
        - matchers:
            - severity="warning"
          receiver: 'email-warning'
    receivers:
      - name: 'null'
      - name: 'slack'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
            channel: '#critical-alerts'
            send_resolved: true
      - name: 'email-critical'
        email_configs:
          - to: 'oncall@example.com'
            send_resolved: true
      - name: 'email-warning'
        email_configs:
          - to: 'alerts@example.com'
            send_resolved: true

Apply any receiver or routing changes using the standard upgrade procedure in the Configuration Guide.

Silencing Alerts

Silences suppress alert notifications for a defined time window without disabling the underlying alert rule. They are useful during planned maintenance, known incidents, or when investigating a non-urgent condition.

Silences are managed via the Alertmanager UI, accessible at:

https://<manager-host>/alertmanager

Creating a Silence

Navigate to the Alertmanager UI and click Silences in the top navigation.
Click Create Silence.
Set the Start and End times for the silence window.
Add one or more matchers to scope which alerts are suppressed. For example:
- alertname = StorageFillingUp — silence a specific alert
- severity = warning — silence all warnings
- host = node-01 — silence all alerts from a specific host
Add a Comment describing the reason for the silence (e.g. Planned disk expansion on node-01).
Click Create. The silence takes effect immediately.

Expiring a Silence

Silences expire automatically at the configured end time. To remove a silence early, navigate to Silences in the Alertmanager UI, locate the silence, and click Expire.

Next Steps

Operations Guide - Day-to-day operational procedures
Troubleshooting Guide - Resolve underlying issues surfaced by alerts
Metrics & Monitoring Overview - Return to the monitoring overview

10 - API Guide

REST API reference and integration examples

Overview

The CDN Manager exposes versioned HTTP APIs under /api (v1 and v2), using JSON payloads by default. When sending request bodies, set Content-Type: application/json. Server errors typically respond with { "message": "..." } where available, or an empty body with the relevant status code.

Authentication uses a two-step flow:

Create a session
Exchange that session for an access token with grant_type=session

Use the access token in Authorization: Bearer <token> when calling bearer-protected routes. CORS preflight (OPTIONS) is supported and wildcard origins are accepted by default.

Durations such as TTLs use humantime strings (for example, 60s, 5m, 1h).

Base URL

All API endpoints are relative to:

https://<manager-host>/api

API Reference Guides

The API documentation is organized by functional area:

Guide	Description
Authentication API	Login, token exchange, logout, and session management
Health API	Liveness and readiness probes
Selection Input API	Key-value and list storage with search capabilities
Data Store API	Generic JSON key/value storage
Subnets API	CIDR-to-value mappings for routing decisions
Routing API	GeoIP lookups and IP validation
Discovery API	Host and namespace discovery
Metrics API	Metrics submission and aggregation
Configuration API	Configuration document management
Operator UI API	Blocked tokens, user agents, and referrers
OpenAPI Specification	Complete OpenAPI 3.0 specification

Authentication Flow

All authenticated API calls follow the same authentication flow. For detailed instructions, see the Authentication API Guide.

Quick Start:

# Step 1: Login to get session
curl -s -X POST "https://cdn-manager/api/v1/auth/login" \
  -H "Content-Type: application/json" \
  -d '{
    "email": "user@example.com",
    "password": "Password1!"
  }' | tee /tmp/session.json

SESSION_ID=$(jq -r '.session_id' /tmp/session.json)
SESSION_TOKEN=$(jq -r '.session_token' /tmp/session.json)

# Step 2: Exchange session for access token
curl -s -X POST "https://cdn-manager/api/v1/auth/token" \
  -H "Content-Type: application/json" \
  -d "$(jq -nc --arg sid "$SESSION_ID" --arg st "$SESSION_TOKEN" \
    '{session_id:$sid,session_token:$st,grant_type:"session",scope:"openid"}')" \
  | tee /tmp/token.json

ACCESS_TOKEN=$(jq -r '.access_token' /tmp/token.json)

# Step 3: Call a protected endpoint
curl -s "https://cdn-manager/api/v1/metrics" \
  -H "Authorization: Bearer ${ACCESS_TOKEN}"

Error Responses

The API uses standard HTTP response codes to indicate the success or failure of an API request.

Most errors return an empty response body with the relevant HTTP status code (e.g., 404 Not Found or 409 Conflict).

In some cases, the server may return a JSON body containing a user-facing error message:

{
  "message": "Human-readable error message"
}

Next Steps

After learning the API:

Operations Guide - Day-to-day operational procedures
Troubleshooting Guide - Resolve API issues
Configuration Guide - Full configuration reference

10.1 - Authentication API

Authentication and session management

Overview

The Authentication API provides endpoints for user authentication, session management, and token exchange. All authenticated API calls require a valid access token obtained through the authentication flow.

Base URL

https://<manager-host>/api/v1/auth

Endpoints

POST /api/v1/auth/login

Create a session from email/password credentials.

Request:

POST /api/v1/auth/login
Content-Type: application/json

{
  "email": "user@example.com",
  "password": "Password1!"
}

Success Response (200):

{
  "session_id": "session-1",
  "session_token": "token-1",
  "verified_at": "2024-01-01T00:00:00Z",
  "expires_at": "2024-01-01T01:00:00Z"
}

Errors:

401 - Authentication failure (invalid credentials)
500 - Backend/state errors

POST /api/v1/auth/token

Exchange a session for an access token (required for bearer auth).

Request:

POST /api/v1/auth/token
Content-Type: application/json

{
  "session_id": "session-1",
  "session_token": "token-1",
  "grant_type": "session",
  "scope": "openid profile"
}

Success Response (200):

{
  "access_token": "<token>",
  "scope": "openid profile",
  "expires_in": 3600,
  "token_type": "bearer"
}

Token Scopes

The scope parameter in the token exchange request is a space-separated string of permissions requested for the access token.

Scope Resolution When a token is requested, the backend system filters the requested scopes against the user’s actual permissions. The resulting access token will only contain the subset of requested scopes that the user is authorized to possess.

Naming and Design Scope names are defined by the applications that consume the tokens, not by the central IAM system. To prevent collisions between different applications or modules, it is highly recommended that application developers use URN-style prefixes for scope names (e.g., urn:acd:manager:config:read).

Errors:

401 - Authentication failure (invalid session)
500 - Backend/state errors

POST /api/v1/auth/logout

Revoke a session. Note: This does not revoke issued access tokens; they remain valid until expiration.

Request:

POST /api/v1/auth/logout
Content-Type: application/json

{
  "session_id": "session-1",
  "session_token": "token-1"
}

Success Response (200):

{
  "status": "Ok"
}

Errors:

400 - Invalid session parameters
500 - Backend/state errors

Complete Authentication Flow Example

# Step 1: Login to get session
curl -s -X POST "https://cdn-manager/api/v1/auth/login" \
  -H "Content-Type: application/json" \
  -d '{
    "email": "user@example.com",
    "password": "Password1!"
  }' | tee /tmp/session.json

SESSION_ID=$(jq -r '.session_id' /tmp/session.json)
SESSION_TOKEN=$(jq -r '.session_token' /tmp/session.json)

# Step 2: Exchange session for access token
curl -s -X POST "https://cdn-manager/api/v1/auth/token" \
  -H "Content-Type: application/json" \
  -d "$(jq -nc --arg sid "$SESSION_ID" --arg st "$SESSION_TOKEN" \
    '{session_id:$sid,session_token:$st,grant_type:"session",scope:"openid"}')" \
  | tee /tmp/token.json

ACCESS_TOKEN=$(jq -r '.access_token' /tmp/token.json)

# Step 3: Call a protected endpoint
curl -s "https://cdn-manager/api/v1/metrics" \
  -H "Authorization: Bearer ${ACCESS_TOKEN}"

Using the Access Token

Once you have obtained an access token, include it in the Authorization header of all API requests:

Authorization: Bearer <access_token>

Example:

curl -s "https://cdn-manager/api/v1/configuration" \
  -H "Authorization: Bearer ${ACCESS_TOKEN}"

Token Expiration

Access tokens expire after the duration specified in expires_in (typically 3600 seconds / 1 hour). When a token expires, you must re-authenticate to obtain a new token.

Next Steps

Health API - Liveness and readiness probes
Selection Input API - Key-value and list storage
OpenAPI Specification - Complete API specification

10.2 - Health API

Liveness and readiness probe endpoints

Overview

The Health API provides endpoints for Kubernetes health probes and service health checking.

Base URL

https://<manager-host>/api/v1/health

Endpoints

GET /api/v1/health/alive

Liveness probe that indicates whether the service is running. Always returns 200 OK.

Request:

GET /api/v1/health/alive

Response (200):

{
  "status": "Ok"
}

Use Case: Kubernetes liveness probe to determine if the pod should be restarted.

GET /api/v1/health/ready

Readiness probe that checks service readiness including downstream dependencies.

Request:

GET /api/v1/health/ready

Success Response (200):

{
  "status": "Ok"
}

Failure Response (503):

{
  "status": "Fail"
}

Use Case: Kubernetes readiness probe to determine if the pod should receive traffic. Returns 503 if any downstream dependencies (database, Kafka, Redis) are unavailable.

Kubernetes Configuration

Example Kubernetes probe configuration:

livenessProbe:
  httpGet:
    path: /api/v1/health/alive
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /api/v1/health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Next Steps

Authentication API - User authentication
Selection Input API - Key-value and list storage
OpenAPI Specification - Complete API specification

10.3 - Selection Input API

Key-value and list storage with search capabilities

Overview

The Selection Input API provides JSON key/value storage with search capabilities. It supports two API versions (v1 and v2) with different operation models.

Base URL

https://<manager-host>/api/v1/selection_input
https://<manager-host>/api/v2/selection_input

Version Comparison

Feature	v1 `/api/v1/selection_input`	v2 `/api/v2/selection_input`
Primary operation	Merge/UPSERT (POST)	Insert/Replace (PUT)
List append	N/A	POST to push to list
Search syntax	Wildcard prefix (`foo*` implicit)	Full wildcard (`foo*` explicit)
Query params	`search`, `sort`, `limit`, `ttl`	`search`, `ttl`, `correlation_id`
Sort support	Yes (`asc`/`desc`)	No
Limit support	Yes	No
Use case	Simple key-value with optional search	List-like operations, full wildcard

When to Use Each Version

Scenario	Recommended Version
Simple key-value storage	v1
List/queue operations (append to array)	v2 POST
Full wildcard pattern matching	v2
Need to sort or paginate results	v1

v1 Endpoints

GET /api/v1/selection_input/{path}

Fetch stored JSON. If value is an object, optional search/limit/sort applies to its keys.

Query Parameters:

search - Wildcard prefix search (adds * implicitly)
sort - Sort order (asc or desc)
limit - Maximum results (must be > 0)

Success Response (200):

{
  "foo": 1,
  "foobar": 2
}

Errors:

404 - Path does not exist
400 - Invalid search/sort/limit parameters
500 - Backend failure

Example:

curl -s "https://cdn-manager/api/v1/selection_input/config?search=foo&limit=2"

POST /api/v1/selection_input/{path}

Upsert (merge) JSON at path. Nested objects are merged recursively.

Query Parameters:

ttl - Expiry time as humantime string (e.g., 10m, 1h)

Request:

{
  "feature_flag": true,
  "ratio": 0.5
}

Success: 201 Created echoing the payload

Errors:

500 / 503 - Backend failure

Example:

curl -s -X POST "https://cdn-manager/api/v1/selection_input/config?ttl=10m" \
  -H "Content-Type: application/json" \
  -d '{
    "feature_flag": true,
    "ratio": 0.5
  }'

DELETE /api/v1/selection_input/{path}

Delete stored value.

Success: 204 No Content

Errors: 503 - Backend failure

v2 Endpoints

GET /api/v2/selection_input/{path}

Fetch stored JSON with optional wildcard filtering.

Query Parameters:

search - Full wildcard pattern (e.g., foo*, *bar*)
correlation_id - Accepted but currently ignored (logging only)

Success Response (200):

{
  "foo": 1,
  "foobar": 2
}

Errors:

400 - Invalid search pattern
404 - Path does not exist
500 - Backend failure

Example:

curl -s "https://cdn-manager/api/v2/selection_input/config?search=foo*"

PUT /api/v2/selection_input/{path}

Insert/replace value. Old value is discarded.

Query Parameters:

ttl - Expiry time as humantime string

Request:

{
  "items": ["a", "b", "c"]
}

Success: 200 OK

Example:

curl -s -X PUT "https://cdn-manager/api/v2/selection_input/catalog" \
  -H "Content-Type: application/json" \
  -d '{
    "items": ["a", "b", "c"]
  }'

POST /api/v2/selection_input/{path}

Push a value to the back of a list-like entry (append to array).

Query Parameters:

ttl - Expiry time as humantime string

Request (any JSON value):

{
  "item": 42
}

Or a simple string:

"ready-for-publish"

Success: 200 OK

Example:

curl -s -X POST "https://cdn-manager/api/v2/selection_input/queue" \
  -H "Content-Type: application/json" \
  -d '"ready-for-publish"'

DELETE /api/v2/selection_input/{path}

Delete stored value.

Success: 204 No Content

Next Steps

Data Store API - Generic key/value storage
Subnets API - CIDR-to-value mappings
OpenAPI Specification - Complete API specification

10.4 - Data Store API

Generic JSON key/value storage

Overview

The Data Store API provides generic JSON key/value storage for short-lived or simple structured data.

Base URL

https://<manager-host>/api/v1/datastore

Endpoints

GET /api/v1/datastore

List all known keys.

Query Parameters:

show_hidden - Boolean (default false). When true, includes internal keys starting with _.

Success Response (200):

["user:123", "config:settings", "session:abc"]

Hidden Keys: Keys starting with _ are reserved for internal use (e.g., subnet service). Writing to hidden keys via the datastore API returns 400 Bad Request.

GET /api/v1/datastore/{key}

Retrieve the JSON value for a specific key.

Success Response (200): The stored JSON value

Errors:

404 - Key does not exist
500 - Backend failure

Example:

curl -s "https://cdn-manager/api/v1/datastore/user:123"

POST /api/v1/datastore/{key}

Create a new JSON value at the specified key. Fails if the key already exists.

Query Parameters:

ttl - Expiry time as humantime string (e.g., 60s, 1h)

Request:

{
  "id": 123,
  "name": "alice"
}

Success: 201 Created

Errors:

409 Conflict - Key already exists
500 - Backend failure

Example:

curl -s -X POST "https://cdn-manager/api/v1/datastore/user:123?ttl=1h" \
  -H "Content-Type: application/json" \
  -d '{"id":123,"name":"alice"}'

PUT /api/v1/datastore/{key}

Update or replace the JSON value at an existing key.

Query Parameters:

ttl - Expiry time as humantime string

Success: 200 OK

Errors:

404 - Key does not exist
500 - Backend failure

Example:

curl -s -X PUT "https://cdn-manager/api/v1/datastore/user:123" \
  -H "Content-Type: application/json" \
  -d '{"id":123,"name":"alice-updated"}'

DELETE /api/v1/datastore/{key}

Delete the value at the specified key. Idempotent operation.

Success: 204 No Content

Errors: 500 - Backend failure

Example:

curl -s -X DELETE "https://cdn-manager/api/v1/datastore/user:123"

Next Steps

Subnets API - CIDR-to-value mappings
Routing API - GeoIP lookups
OpenAPI Specification - Complete API specification

10.5 - Subnets API

CIDR-to-value mappings for routing decisions

Overview

The Subnets API manages CIDR-to-value mappings used for routing decisions. This allows classification of IP ranges for routing purposes.

Base URL

https://<manager-host>/api/v1/subnets

Endpoints

PUT /api/v1/subnets

Create or update subnet mappings.

Request:

{
  "192.168.1.0/24": "office",
  "10.0.0.0/8": "internal",
  "203.0.113.0/24": "external"
}

Success: 200 OK

Errors:

400 - Invalid CIDR format
500 - Backend failure

Example:

curl -s -X PUT "https://cdn-manager/api/v1/subnets" \
  -H "Content-Type: application/json" \
  -d '{
    "192.168.1.0/24": "office",
    "10.0.0.0/8": "internal"
  }'

GET /api/v1/subnets

List all subnet mappings.

Success Response (200): JSON object of CIDR-to-value mappings

Example:

curl -s "https://cdn-manager/api/v1/subnets" | jq '.'

DELETE /api/v1/subnets

Delete all subnet mappings.

Success: 204 No Content

GET /api/v1/subnets/byKey/{subnet}

Retrieve subnet mappings whose CIDR begins with the given prefix.

Example:

curl -s "https://cdn-manager/api/v1/subnets/byKey/192.168" | jq '.'

GET /api/v1/subnets/byValue/{value}

Retrieve subnet mappings with the given classification value.

Example:

curl -s "https://cdn-manager/api/v1/subnets/byValue/office" | jq '.'

DELETE /api/v1/subnets/byKey/{subnet}

Delete subnet mappings whose CIDR begins with the given prefix.

DELETE /api/v1/subnets/byValue/{value}

Delete subnet mappings with the given classification value.

Next Steps

Routing API - GeoIP lookups and IP validation
Discovery API - Host and namespace discovery
OpenAPI Specification - Complete API specification

10.6 - Routing API

GeoIP lookups and IP validation

Overview

The Routing API provides GeoIP information lookup and IP address validation for routing decisions.

Base URL

https://<manager-host>/api/v1/routing

Endpoints

GET /api/v1/routing/geoip

Look up GeoIP information for an IP address.

Query Parameters:

ip - IP address to look up

Success Response (200):

{
  "city": {
    "name": "Washington"
  },
  "asn": 64512
}

Errors:

400 - Invalid IP format
500 - Backend failure

Caching: Cache-Control: public, max-age=86400 (24 hours)

Example:

curl -s "https://cdn-manager/api/v1/routing/geoip?ip=149.101.100.0"

GET /api/v1/routing/validate

Validate if an IP address is allowed (not blocked).

Query Parameters:

ip - IP address to validate

Success Response (200): Empty body (IP is allowed)

Forbidden Response (403):

Access Denied

Errors:

400 - Invalid IP format
500 - Backend failure

Caching: Cache-Control headers included (default: max-age=300, configurable via [tuning] section)

Example:

curl -i "https://cdn-manager/api/v1/routing/validate?ip=149.101.100.0"

Use Cases

GeoIP-Based Routing

Use the /geoip endpoint to determine the geographic location and ASN of an IP address for routing decisions:

# Get location data for routing
IP_INFO=$(curl -s "https://cdn-manager/api/v1/routing/geoip?ip=203.0.113.50")
CITY=$(echo "$IP_INFO" | jq -r '.city.name')
ASN=$(echo "$IP_INFO" | jq -r '.asn')

echo "Routing based on city: $CITY, ASN: $ASN"

IP Validation

Use the /validate endpoint to check if an IP is allowed before processing requests:

# Check if IP is allowed
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" \
  "https://cdn-manager/api/v1/routing/validate?ip=203.0.113.50")

if [ "$RESPONSE" = "200" ]; then
  echo "IP is allowed"
elif [ "$RESPONSE" = "403" ]; then
  echo "IP is blocked"
fi

Next Steps

Discovery API - Host and namespace discovery
Metrics API - Metrics submission and aggregation
OpenAPI Specification - Complete API specification

10.7 - Discovery API

Host and namespace discovery

Overview

The Discovery API provides information about discovered hosts and namespaces. Discovery is configured via the Helm chart values.yaml file. Each entry defines a namespace with a list of hostnames.

Base URL

https://<manager-host>/api/v1/discovery

Endpoints

GET /api/v1/discovery/hosts

Return discovered hosts grouped by namespace.

Success Response (200):

{
  "directors": [
    { "name": "director-1.example.com" }
  ],
  "edge-servers": [
    { "name": "cdn1.example.com" },
    { "name": "cdn2.example.com" }
  ]
}

Example:

curl -s "https://cdn-manager/api/v1/discovery/hosts"

GET /api/v1/discovery/namespaces

Return discovery namespaces with their corresponding Confd URIs.

Success Response (200):

[
  {
    "namespace": "edge-servers",
    "confd_uri": "/api/v1/confd/edge-servers"
  },
  {
    "namespace": "directors",
    "confd_uri": "/api/v1/confd/directors"
  }
]

Example:

curl -s "https://cdn-manager/api/v1/discovery/namespaces"

Configuration

Discovery is configured via the Helm chart values.yaml file under manager.discovery:

manager:
  discovery:
    - namespace: "directors"
      hosts:
        - director-1.example.com
        - director-2.example.com
    - namespace: "edge-servers"
      hosts:
        - cdn1.example.com
        - cdn2.example.com

Each entry defines a namespace with a list of hostnames. Optionally, a pattern field can be specified for regex-based host matching.

Next Steps

Metrics API - Metrics submission and aggregation
Configuration API - Configuration document management
OpenAPI Specification - Complete API specification

10.8 - Metrics API

Metrics submission and aggregation

Overview

The Metrics API allows submission and retrieval of metrics data from CDN components.

Base URL

https://<manager-host>/api/v1/metrics

Endpoints

POST /api/v1/metrics

Submit metrics data.

Request:

{
  "example.com": {
    "metric1": 100,
    "metric2": 200
  }
}

Success: 200 OK

Errors: 500 - Validation/backend errors

Example:

curl -s -X POST "https://cdn-manager/api/v1/metrics" \
  -H "Content-Type: application/json" \
  -d '{
    "example.com": {
      "metric1": 100,
      "metric2": 200
    }
  }'

GET /api/v1/metrics

Return aggregated metrics per host.

Response: JSON object with aggregated metrics per host

Note: Metrics are stored per host for up to 5 minutes. Hosts that stop reporting disappear from aggregation after that window. When no metrics are being reported, returns empty object {}.

Example:

curl -s "https://cdn-manager/api/v1/metrics"

Metrics Retention

Metrics are stored for up to 5 minutes in the aggregation layer
For long-term metrics storage, data is forwarded to VictoriaMetrics
Query historical metrics via Grafana dashboards at /grafana

Next Steps

Configuration API - Configuration document management
Operator UI API - Blocked tokens, user agents, and referrers
OpenAPI Specification - Complete API specification

10.9 - Configuration API

Configuration document management

Overview

The Configuration API provides endpoints for managing the system configuration document. ETag is supported; send If-None-Match for conditional GET (may return 304).

Operational Note: This API is intended for internal verification only. Behavior is undefined in multi-replica clusters because pods do not coordinate config writes.

Base URL

https://<manager-host>/api/v1/configuration

Endpoints

GET /api/v1/configuration

Retrieve the configuration document.

Success: 200 OK with configuration JSON

Conditional GET: Returns 304 Not Modified if If-None-Match header matches current ETag

Example:

# Get ETag from response headers
etag=$(curl -s -D- "https://cdn-manager/api/v1/configuration" | awk '/ETag/{print $2}')

# Conditional GET - returns 304 if config unchanged
curl -s -H "If-None-Match: $etag" "https://cdn-manager/api/v1/configuration" -o /tmp/cfg.json -w "%{http_code}\n"

PUT /api/v1/configuration

Replace the configuration document.

Request:

{
  "feature_flag": false,
  "ratio": 0.25
}

Success: 200 OK

Errors:

400 - Invalid configuration format
500 - Backend failure

DELETE /api/v1/configuration

Delete the configuration document.

Success: 200 OK

ETag Usage

The configuration API supports ETags for optimistic concurrency control:

# 1. Get current config and ETag
response=$(curl -s -D headers.txt "https://cdn-manager/api/v1/configuration")
etag=$(grep -i ETag headers.txt | cut -d' ' -f2 | tr -d '\r')

# 2. Modify the config as needed
modified_config=$(echo "$response" | jq '.feature_flag = true')

# 3. Update with ETag to prevent overwriting concurrent changes
curl -s -X PUT "https://cdn-manager/api/v1/configuration" \
  -H "Content-Type: application/json" \
  -H "If-Match: $etag" \
  -d "$modified_config"

Next Steps

Operator UI API - Blocked tokens, user agents, and referrers
OpenAPI Specification - Complete API specification

10.10 - Operator UI API

Blocked tokens, user agents, and referrers

Overview

The Operator UI API provides read-only helpers exposing curated selection input content for the operator interface.

Query Parameters: search, sort, limit (same as selection input v1)

Note: Stored keys for user agents/referrers are URL-safe base64; responses decode them to human-readable values.

Base URL

https://<manager-host>/api/v1/operator_ui

Endpoints

Blocked Household Tokens

GET /api/v1/operator_ui/modules/blocked_tokens

List all blocked household tokens.

Success Response (200):

[
  {
    "household_token": "house-001_token-abc",
    "expire_time": 1625247600
  }
]

GET /api/v1/operator_ui/modules/blocked_tokens/{token}

Get details for a specific blocked token.

Success Response (200):

{
  "household_token": "house-001_token-abc",
  "expire_time": 1625247600
}

Blocked User Agents

GET /api/v1/operator_ui/modules/blocked_user_agents

List all blocked user agents.

Success Response (200):

[
  {
    "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
  },
  {
    "user_agent": "curl/7.68.0"
  }
]

GET /api/v1/operator_ui/modules/blocked_user_agents/{encoded}

Get details for a specific blocked user agent. The path variable is URL-safe base64 encoded.

Example:

# Encode the user agent
ENC=$(python3 -c "import base64; print(base64.urlsafe_b64encode(b'curl/7.68.0').decode().rstrip('='))")

# Get details
curl -s "https://cdn-manager/api/v1/operator_ui/modules/blocked_user_agents/$ENC"

Blocked Referrers

GET /api/v1/operator_ui/modules/blocked_referrers

List all blocked referrers.

Success Response (200):

[
  {
    "referrer": "https://spam-example.com"
  }
]

GET /api/v1/operator_ui/modules/blocked_referrers/{encoded}

Get details for a specific blocked referrer. The path variable is URL-safe base64 encoded.

Example:

# Encode the referrer
ENC=$(python3 -c "import base64; print(base64.urlsafe_b64encode(b'spam-example.com').decode().rstrip('='))")

# Get details
curl -s "https://cdn-manager/api/v1/operator_ui/modules/blocked_referrers/$ENC"

URL-Safe Base64 Encoding

The Operator UI API uses URL-safe base64 encoding for path parameters. To encode values:

Python:

import base64

# Encode
encoded = base64.urlsafe_b64encode(b'value').decode().rstrip('=')

# Decode
decoded = base64.urlsafe_b64decode(encoded + '=' * (-len(encoded) % 4)).decode()

Bash (with openssl):

# Encode
echo -n "value" | openssl base64 -urlsafe | tr -d '='

# Decode
echo "encoded" | openssl base64 -urlsafe -d

Next Steps

OpenAPI Specification - Complete API specification

10.11 - OpenAPI Specification

Complete OpenAPI 3.0 specification

Overview

The CDN Manager API is documented using the OpenAPI 3.0 specification. This appendix provides the complete specification for reference and for generating API clients.

OpenAPI Specification (YAML)

openapi: 3.0.3
info:
  title: AgileTV CDN Manager API
  version: "1.0"
servers:
  - url: https://<manager-host>/api
    description: CDN Manager API server
paths:
  /v1/auth/login:
    post:
      summary: Login and create session
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/LoginRequest'
      responses:
        '200':
          description: Session created
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/LoginResponse'
        '401': { description: Unauthorized, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
        '500': { description: Internal error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/auth/token:
    post:
      summary: Exchange session for access token
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/TokenRequest'
      responses:
        '200':
          description: Access token
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/TokenResponse'
        '401': { description: Unauthorized, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
        '500': { description: Internal error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/auth/logout:
    post:
      summary: Revoke session
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/LogoutRequest'
      responses:
        '200': { description: Revoked, content: { application/json: { schema: { $ref: '#/components/schemas/LogoutResponse' } } } }
        '401': { description: Unauthorized, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
        '500': { description: Internal error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/selection_input{tail}:
    get:
      summary: Read selection input
      parameters:
        - $ref: '#/components/parameters/Tail'
        - $ref: '#/components/parameters/Search'
        - $ref: '#/components/parameters/Sort'
        - $ref: '#/components/parameters/Limit'
      responses:
        '200': { description: JSON value }
        '400': { description: Bad request, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
        '404': { description: Not found }
        '500': { description: Backend failure }
    post:
      summary: Merge selection input
      parameters:
        - $ref: '#/components/parameters/Tail'
        - $ref: '#/components/parameters/Ttl'
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/AnyJson'
      responses:
        '201': { description: Created, content: { application/json: { schema: { $ref: '#/components/schemas/AnyJson' } } } }
        '500': { description: Backend failure }
        '503': { description: Service unavailable }
    delete:
      summary: Delete selection input
      parameters:
        - $ref: '#/components/parameters/Tail'
      responses:
        '204': { description: Deleted }
        '503': { description: Service unavailable }
  /v2/selection_input{tail}:
    get:
      summary: Read selection input v2
      parameters:
        - $ref: '#/components/parameters/TailV2'
        - $ref: '#/components/parameters/Search'
      responses:
        '200': { description: JSON value }
        '400': { description: Invalid search pattern }
        '404': { description: Not found }
        '500': { description: Backend failure }
    put:
      summary: Replace selection input v2
      parameters:
        - $ref: '#/components/parameters/TailV2'
        - $ref: '#/components/parameters/Ttl'
        - $ref: '#/components/parameters/CorrelationId'
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/AnyJson'
      responses:
        '200': { description: Updated }
        '500': { description: Backend failure }
    post:
      summary: Push to selection input v2
      parameters:
        - $ref: '#/components/parameters/TailV2'
        - $ref: '#/components/parameters/Ttl'
        - $ref: '#/components/parameters/CorrelationId'
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/AnyJson'
      responses:
        '200': { description: Pushed }
        '500': { description: Backend failure }
    delete:
      summary: Delete selection input v2
      parameters:
        - $ref: '#/components/parameters/TailV2'
      responses:
        '204': { description: Deleted }
        '500': { description: Backend failure }
  /v1/configuration:
    get:
      summary: Read configuration
      responses:
        '200': { description: Configuration, content: { application/json: { schema: { $ref: '#/components/schemas/AnyJson' } } }, headers: { ETag: { schema: { type: string } } } }
        '304': { description: Not modified }
        '500': { description: Backend failure }
    put:
      summary: Replace configuration
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/AnyJson'
      responses:
        '200': { description: Replaced }
        '500': { description: Backend failure }
    delete:
      summary: Delete configuration
      responses:
        '200': { description: Deleted }
        '500': { description: Backend failure }
  /v1/routing/geoip:
    get:
      summary: GeoIP lookup
      parameters:
        - name: ip
          in: query
          required: true
          schema: { type: string }
      responses:
        '200': { description: GeoIP data, content: { application/json: { schema: { $ref: '#/components/schemas/GeoIpResponse' } } } }
        '400': { description: Invalid IP }
        '500': { description: Backend failure }
  /v1/routing/validate:
    get:
      summary: Validate routing
      parameters:
        - name: ip
          in: query
          required: true
          schema: { type: string }
      responses:
        '200': { description: Allowed }
        '403': { description: Access Denied }
        '400': { description: Invalid IP }
        '500': { description: Backend failure }
  /v1/metrics:
    post:
      summary: Ingest metrics
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/MetricsIngress'
      responses:
        '200': { description: Stored }
        '500': { description: Validation/back-end error }
    get:
      summary: Aggregate metrics
      responses:
        '200': { description: Aggregated metrics, content: { application/json: { schema: { $ref: '#/components/schemas/AnyJson' } } } }
        '500': { description: Backend failure }
  /v1/discovery/hosts:
    get:
      summary: List discovered hosts by namespace
      responses:
        '200':
          description: Discovered hosts keyed by namespace
          content:
            application/json:
              schema:
                type: object
                additionalProperties:
                  type: array
                  items:
                    $ref: '#/components/schemas/DiscoveryHost'
        '500': { description: Backend failure }
  /v1/discovery/namespaces:
    get:
      summary: List discovery namespaces with Confd URIs
      responses:
        '200':
          description: Namespaces with Confd links
          content:
            application/json:
              schema:
                type: array
                items:
                  $ref: '#/components/schemas/DiscoveryNamespace'
        '500': { description: Backend failure }
  /v1/datastore:
    get:
      summary: List datastore keys
      responses:
        '200': { description: Keys list, content: { application/json: { schema: { type: array, items: { type: string } } } } }
        '500': { description: Backend failure }
  /v1/datastore/{key}:
    get:
      summary: Get a JSON value by key
      parameters:
        - name: key
          in: path
          required: true
          schema: { type: string }
      responses:
        '200': { description: JSON value, content: { application/json: { schema: { $ref: '#/components/schemas/AnyJson' } } } }
        '404': { description: Not found }
        '500': { description: Backend failure }
    post:
      summary: Create a JSON value at key
      parameters:
        - name: key
          in: path
          required: true
          schema: { type: string }
        - $ref: '#/components/parameters/Ttl'
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/AnyJson'
      responses:
        '201': { description: Created }
        '409': { description: Conflict (already exists) }
        '500': { description: Backend failure }
    put:
      summary: Update/replace a JSON value at key
      parameters:
        - name: key
          in: path
          required: true
          schema: { type: string }
        - $ref: '#/components/parameters/Ttl'
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/AnyJson'
      responses:
        '200': { description: Updated }
        '404': { description: Not found }
        '500': { description: Backend failure }
    delete:
      summary: Delete a datastore key
      parameters:
        - name: key
          in: path
          required: true
          schema: { type: string }
      responses:
        '204': { description: Deleted }
        '500': { description: Backend failure }
  /v1/subnets:
    get:
      summary: List all subnet mappings
      responses:
        '200': { description: Subnet mappings, content: { application/json: { schema: { type: object, additionalProperties: { type: string } } } } }
        '500': { description: Backend failure }
    put:
      summary: Create or update subnet mappings
      requestBody:
        required: true
        content:
          application/json:
            schema:
              type: object
              additionalProperties:
                type: string
              description: Map of CIDR strings to classification values
      responses:
        '200': { description: Created }
        '400': { description: Invalid CIDR format }
        '500': { description: Backend failure }
    delete:
      summary: Delete all subnet mappings
      responses:
        '204': { description: Deleted }
        '500': { description: Backend failure }
  /v1/subnets/byKey/{subnet}:
    get:
      summary: Get subnet mappings by CIDR prefix
      parameters:
        - name: subnet
          in: path
          required: true
          schema: { type: string }
      responses:
        '200': { description: Subnet mappings, content: { application/json: { schema: { type: object, additionalProperties: { type: string } } } } }
        '500': { description: Backend failure }
    delete:
      summary: Delete subnet mappings by CIDR prefix
      parameters:
        - name: subnet
          in: path
          required: true
          schema: { type: string }
      responses:
        '204': { description: Deleted }
        '500': { description: Backend failure }
  /v1/subnets/byValue/{value}:
    get:
      summary: Get subnet mappings by value
      parameters:
        - name: value
          in: path
          required: true
          schema: { type: string }
      responses:
        '200': { description: Subnet mappings, content: { application/json: { schema: { type: object, additionalProperties: { type: string } } } } }
        '500': { description: Backend failure }
    delete:
      summary: Delete subnet mappings by value
      parameters:
        - name: value
          in: path
          required: true
          schema: { type: string }
      responses:
        '204': { description: Deleted }
        '500': { description: Backend failure }
  /v1/operator_ui/modules/blocked_tokens:
    get:
      summary: List blocked tokens
      parameters:
        - $ref: '#/components/parameters/Search'
        - $ref: '#/components/parameters/Sort'
        - $ref: '#/components/parameters/Limit'
      responses:
        '200': { description: Blocked tokens, content: { application/json: { schema: { type: array, items: { $ref: '#/components/schemas/BlockedToken' } } } } }
        '400': { description: Parse error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/operator_ui/modules/blocked_tokens/{token}:
    get:
      summary: Get blocked token
      parameters:
        - name: token
          in: path
          required: true
          schema: { type: string }
      responses:
        '200': { description: Blocked token, content: { application/json: { schema: { $ref: '#/components/schemas/BlockedToken' } } } }
        '404': { description: Not found }
        '400': { description: Parse error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/operator_ui/modules/blocked_user_agents:
    get:
      summary: List blocked user agents
      parameters:
        - $ref: '#/components/parameters/Search'
        - $ref: '#/components/parameters/Sort'
        - $ref: '#/components/parameters/Limit'
      responses:
        '200': { description: Blocked user agents, content: { application/json: { schema: { type: array, items: { $ref: '#/components/schemas/BlockedUserAgent' } } } } }
        '400': { description: Parse error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/operator_ui/modules/blocked_user_agents/{encoded}:
    get:
      summary: Get blocked user agent
      parameters:
        - name: encoded
          in: path
          required: true
          schema: { type: string }
      responses:
        '200': { description: Blocked user agent, content: { application/json: { schema: { $ref: '#/components/schemas/BlockedUserAgent' } } } }
        '400': { description: Parse error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/operator_ui/modules/blocked_referrers:
    get:
      summary: List blocked referrers
      parameters:
        - $ref: '#/components/parameters/Search'
        - $ref: '#/components/parameters/Sort'
        - $ref: '#/components/parameters/Limit'
      responses:
        '200': { description: Blocked referrers, content: { application/json: { schema: { type: array, items: { $ref: '#/components/schemas/BlockedReferrer' } } } } }
        '400': { description: Parse error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/operator_ui/modules/blocked_referrers/{encoded}:
    get:
      summary: Get blocked referrer
      parameters:
        - name: encoded
          in: path
          required: true
          schema: { type: string }
      responses:
        '200': { description: Blocked referrer, content: { application/json: { schema: { $ref: '#/components/schemas/BlockedReferrer' } } } }
        '400': { description: Parse error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/health/alive:
    get:
      summary: Liveness check
      responses:
        '200': { description: Alive, content: { application/json: { schema: { $ref: '#/components/schemas/HealthStatus' } } } }
  /v1/health/ready:
    get:
      summary: Readiness check
      responses:
        '200': { description: Ready, content: { application/json: { schema: { $ref: '#/components/schemas/HealthStatus' } } } }
        '503': { description: Unready, content: { application/json: { schema: { $ref: '#/components/schemas/HealthStatus' } } } }
components:
  parameters:
    Tail:
      name: tail
      in: path
      required: true
      schema: { type: string }
    TailV2:
      name: tail
      in: path
      required: true
      schema: { type: string }
    Search:
      name: search
      in: query
      required: false
      schema: { type: string }
    Sort:
      name: sort
      in: query
      required: false
      schema: { type: string, enum: [asc, desc] }
    Limit:
      name: limit
      in: query
      required: false
      schema: { type: integer, minimum: 1 }
    Ttl:
      name: ttl
      in: query
      required: false
      schema: { type: string, description: Humantime duration }
    CorrelationId:
      name: correlation_id
      in: query
      required: false
      schema: { type: string }
  schemas:
    LoginRequest:
      type: object
      required: [email, password]
      properties:
        email: { type: string, format: email }
        password: { type: string, format: password }
    LoginResponse:
      type: object
      properties:
        session_id: { type: string }
        session_token: { type: string }
        verified_at: { type: string, format: date-time }
        expires_at: { type: string, format: date-time }
    LogoutRequest:
      type: object
      required: [session_id]
      properties:
        session_id: { type: string }
        session_token: { type: string }
    LogoutResponse:
      type: object
      properties:
        status: { $ref: '#/components/schemas/StatusValue' }
    TokenRequest:
      type: object
      required: [session_id, session_token, grant_type]
      properties:
        session_id: { type: string }
        session_token: { type: string }
        scope: { type: string }
        grant_type: { type: string, enum: [session] }
    TokenResponse:
      type: object
      required: [access_token, scope, expires_in, token_type]
      properties:
        access_token: { type: string }
        scope: { type: string }
        expires_in: { type: integer, format: int64 }
        token_type: { type: string, enum: [bearer] }
    ErrorResponse:
      type: object
      properties:
        message: { type: string }
    AnyJson:
      description: Arbitrary JSON value
    MetricsIngress:
      type: object
      additionalProperties:
        type: object
        additionalProperties: { type: number }
    GeoIpResponse:
      type: object
      properties:
        city:
          type: object
          properties:
            name: { type: string }
        asn: { type: integer }
        is_anonymous: { type: boolean }
    BlockedToken:
      type: object
      properties:
        household_token: { type: string }
        expire_time: { type: integer, format: int64 }
    BlockedUserAgent:
      type: object
      properties:
        user_agent: { type: string }
    BlockedReferrer:
      type: object
      properties:
        referrer: { type: string }
    DiscoveryHost:
      type: object
      properties:
        name: { type: string }
    DiscoveryNamespace:
      type: object
      properties:
        namespace: { type: string }
        confd_uri: { type: string }
    HealthStatus:
      type: object
      properties:
        status: { $ref: '#/components/schemas/StatusValue' }
    StatusValue:
      type: string
      enum: [Ok, Fail]

Using the OpenAPI Specification

Generating API Clients

The OpenAPI specification can be used to generate client libraries in multiple languages:

Using openapi-generator:

# Generate Python client
openapi-generator generate -i openapi.yaml -g python -o ./python-client

# Generate TypeScript client
openapi-generator generate -i openapi.yaml -g typescript-axios -o ./typescript-client

# Generate Go client
openapi-generator generate -i openapi.yaml -g go -o ./go-client

Using swagger-codegen:

swagger-codegen generate -i openapi.yaml -l python -o ./python-client

Validating the Specification

To validate the OpenAPI specification:

# Using swagger-cli
swagger-cli validate openapi.yaml

# Using spectral
spectral lint openapi.yaml

Next Steps

Authentication API - Detailed authentication flow
API Guide Index - Browse all API documentation
Operations Guide - Day-to-day operational procedures

11 - Troubleshooting Guide

Common issues and resolution procedures

Overview

This guide provides troubleshooting procedures for common issues encountered when operating the AgileTV CDN Manager (ESB3027). Use the diagnostic commands and resolution steps to identify and resolve problems.

Diagnostic Tools

Cluster Status

# Check node status
kubectl get nodes

# Check all pods
kubectl get pods -A

# Check events sorted by time
kubectl get events --sort-by='.lastTimestamp'

# Check resource usage
kubectl top nodes
kubectl top pods

Component Status

# Check deployments
kubectl get deployments

# Check statefulsets
kubectl get statefulsets

# Check persistent volumes
kubectl get pvc
kubectl get pv

# Check services
kubectl get services

# Check ingress
kubectl get ingress

Common Issues

Pods Stuck in Pending State

Symptoms: Pods remain in Pending state indefinitely.

Causes:

Insufficient cluster resources (CPU/memory)
No nodes match scheduling constraints
PersistentVolume not available

Diagnosis:

# Describe the pending pod
kubectl describe pod <pod-name>

# Check events for scheduling failures
kubectl get events --field-selector reason=FailedScheduling

# Check node capacity
kubectl describe nodes | grep -A 5 "Allocated resources"

# Check available PVs
kubectl get pv

Resolution:

# Free up resources by scaling down non-critical workloads
kubectl scale deployment <deployment> --replicas=0

# Or add additional nodes to the cluster

# If PV is stuck, delete and recreate
kubectl delete pvc <pvc-name>
kubectl delete pod <pod-name>

Pods Stuck in ContainerCreating

Symptoms: Pods remain in ContainerCreating state.

Causes:

Image pull failures
Volume mount issues
Network configuration problems

Diagnosis:

kubectl describe pod <pod-name>

# Check for image pull errors
kubectl get events | grep -i "failed to pull"

# Check volume mount status
kubectl get events | grep -i "mount"

Resolution:

# For image pull issues, verify image exists and credentials
kubectl get secret <pull-secret-name> -o yaml

# For volume issues, check Longhorn volume status
kubectl get volumes -n longhorn-system

# Delete stuck pod to trigger recreation
kubectl delete pod <pod-name> --force --grace-period=0

Persistent Volume Mount Failures

Symptoms: Pod fails to start with error “AttachVolume.Attach failed for volume… is not ready for workloads” or similar volume attachment errors.

Causes:

Longhorn volume created but unable to be successfully mounted
Network connectivity issues between nodes (Longhorn requires iSCSI and NFS traffic)
Longhorn service unhealthy
Incorrect storage class configuration

Diagnosis:

# Describe the failing pod to see the error
kubectl describe pod <pod-name>

# Check Longhorn volumes status
kubectl get volumes -n longhorn-system

# Check Longhorn UI for detailed volume status
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80
# Access: http://localhost:8080

Resolution:

# Verify firewall allows Longhorn traffic between nodes
# Ports 9500 and 8500 must be open (see Networking Guide)

# Check Longhorn is healthy
kubectl get pods -n longhorn-system

# If volume is stuck, delete PVC and pod to trigger recreation
kubectl delete pvc <pvc-name>
kubectl delete pod <pod-name>

Pods in CrashLoopBackOff

Symptoms: Pods repeatedly crash and restart.

Causes:

Application configuration errors
Missing dependencies (database not ready)
Resource limits too low
Liveness probe failures

Diagnosis:

# View current logs
kubectl logs <pod-name>

# View previous instance logs
kubectl logs <pod-name> -p

# Describe pod for restart reasons
kubectl describe pod <pod-name>

# Check if dependencies are healthy
kubectl get pods | grep -E "(postgres|kafka|redis)"

Resolution:

# For dependency issues, wait for dependencies to be ready
kubectl wait --for=condition=Ready pod/<dependency-pod> --timeout=300s

# For resource issues, increase limits
kubectl edit deployment <deployment-name>

# For configuration issues, check ConfigMaps and Secrets
kubectl get configmap <configmap-name> -o yaml
kubectl get secret <secret-name> -o yaml

# Restart the deployment
kubectl rollout restart deployment/<deployment-name>

Pods in Terminating State

Symptoms: Pods stuck in Terminating state indefinitely.

Causes:

Volume detachment issues
Node communication problems
Finalizer blocking deletion

Diagnosis:

kubectl describe pod <pod-name>

# Check if node is reachable
kubectl get nodes

# Check finalizers
kubectl get pod <pod-name> -o jsonpath='{.metadata.finalizers}'

Resolution:

# Force delete the pod
kubectl delete pod <pod-name> --force --grace-period=0

# If node is unreachable, drain and remove from cluster
kubectl drain <node-name> --ignore-daemonsets --force
kubectl delete node <node-name>

Service Unreachable

Symptoms: Service endpoints not accessible.

Causes:

No ready pods backing the service
Network policy blocking traffic
Service port mismatch

Diagnosis:

# Check service endpoints
kubectl get endpoints <service-name>

# Check if pods are ready
kubectl get pods -l app=<label>

# Check network policies
kubectl get networkpolicies

# Test connectivity from within cluster
kubectl run test --rm -it --image=busybox -- wget -O- <service-name>:<port>

Resolution:

# Ensure pods are ready and matching service selector
kubectl get pods --show-labels

# Check service selector matches pod labels
kubectl get service <service-name> -o jsonpath='{.spec.selector}'

# Temporarily disable network policy for testing
kubectl edit networkpolicy <policy-name>

Ingress Not Working

Symptoms: External access via ingress fails.

Causes:

Traefik ingress controller not running
Ingress configuration errors
TLS certificate issues
DNS resolution problems

Diagnosis:

# Check Traefik pods
kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik

# Check ingress resources
kubectl get ingress

# Describe ingress for errors
kubectl describe ingress <ingress-name>

# Check Traefik logs
kubectl logs -n kube-system -l app.kubernetes.io/name=traefik

# Test DNS resolution
nslookup <hostname>

Resolution:

# Restart Traefik
kubectl rollout restart deployment -n kube-system traefik

# Fix ingress configuration
kubectl edit ingress <ingress-name>

# Renew or recreate TLS secret
kubectl create secret tls <secret-name> --cert=tls.crt --key=tls.key \
  --dry-run=client -o yaml | kubectl apply -f -

# Verify hostname matches certificate
openssl x509 -in tls.crt -noout -subject -issuer

Database Connection Failures

Symptoms: Application cannot connect to PostgreSQL.

Causes:

PostgreSQL cluster not ready
Connection pool exhausted
Network connectivity issues
Authentication failures

Diagnosis:

# Check PostgreSQL cluster status
kubectl get clusters

# Check PostgreSQL pods
kubectl get pods -l app.kubernetes.io/name=postgresql

# Check PostgreSQL logs
kubectl logs -l app.kubernetes.io/name=postgresql

# Test connectivity
kubectl exec -it <app-pod> -- psql -h <postgres-service> -U <user> -d <database>

Resolution:

# Wait for PostgreSQL to be ready
kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=postgresql --timeout=300s

# Check connection string in application config
kubectl get secret <secret-name> -o jsonpath='{.data}' | base64 -d

# Restart application pods
kubectl rollout restart deployment/<deployment-name>

Kafka Connection Issues

Symptoms: Application cannot connect to Kafka.

Causes:

Kafka controllers not ready
Topic not created
Network connectivity issues

Diagnosis:

# Check Kafka pods
kubectl get pods -l app.kubernetes.io/name=kafka

# Check Kafka logs
kubectl logs -l app.kubernetes.io/name=kafka

# List topics
kubectl exec -it <kafka-pod> -- kafka-topics.sh --bootstrap-server localhost:9092 --list

Resolution:

# Wait for Kafka controllers to be ready
kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=kafka --timeout=300s

# Create missing topic
kubectl exec -it <kafka-pod> -- kafka-topics.sh --bootstrap-server localhost:9092 \
  --create --topic <topic-name> --partitions 3 --replication-factor 3

# Restart application to reconnect
kubectl rollout restart deployment/<deployment-name>

Redis Connection Issues

Symptoms: Application cannot connect to Redis.

Diagnosis:

# Check Redis pods
kubectl get pods -l app.kubernetes.io/name=redis

# Check Redis logs
kubectl logs -l app.kubernetes.io/name=redis

# Test connectivity
kubectl exec -it <redis-pod> -- redis-cli ping

Resolution:

# Wait for Redis to be ready
kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=redis --timeout=300s

# Restart application
kubectl rollout restart deployment/<deployment-name>

High Memory Usage

Symptoms: Pods approaching or hitting memory limits.

Diagnosis:

# Check memory usage
kubectl top pods

# Check OOMKilled pods
kubectl get pods --field-selector=status.phase=Failed

# Check for memory leaks in logs
kubectl logs <pod-name> | grep -i "memory\|oom"

Resolution:

# Temporarily increase memory limit
kubectl edit deployment <deployment-name>

# Or scale horizontally if HPA is enabled
kubectl scale deployment <deployment-name> --replicas=<n>

# Long-term: Update values.yaml and perform helm upgrade

High CPU Usage

Symptoms: Pods consistently using high CPU.

Diagnosis:

# Check CPU usage
kubectl top pods

# Check for runaway processes
kubectl top pods --sort-by=cpu

Resolution:

# Scale horizontally if HPA is enabled
kubectl scale deployment <deployment-name> --replicas=<n>

# Or increase CPU limits
kubectl edit deployment <deployment-name>

Persistent Volume Issues

Symptoms: PVC not binding or volume errors.

Diagnosis:

# Check PVC status
kubectl get pvc

# Check PV status
kubectl get pv

# Check Longhorn volumes
kubectl get volumes -n longhorn-system

# Check Longhorn UI for details
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80

Resolution:

# For stuck PVC, delete and recreate
kubectl delete pvc <pvc-name>
kubectl delete pod <pod-name>

# For Longhorn issues, check Longhorn UI
# Access via http://localhost:8080

# Recreate Longhorn volume if necessary

Zitadel Authentication Failures

Symptoms: Users cannot authenticate via Zitadel.

Causes:

CORS configuration mismatch
External domain misconfigured
Zitadel pods not healthy

Diagnosis:

# Check Zitadel pods
kubectl get pods -l app.kubernetes.io/name=zitadel

# Check Zitadel logs
kubectl logs -l app.kubernetes.io/name=zitadel

# Verify external domain configuration
helm get values acd-manager -o yaml | grep -A 5 zitadel

Resolution:

# Ensure global.hosts.manager[0].host matches zitadel.zitadel.ExternalDomain
# Update values.yaml if needed

helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml

# Restart Zitadel
kubectl rollout restart deployment -l app.kubernetes.io/name=zitadel

Certificate Errors

Symptoms: TLS/SSL errors in browser or API calls.

Diagnosis:

# Check certificate expiration
kubectl get secret <tls-secret> -o jsonpath='{.data.tls\.crt}' | base64 -d | \
  openssl x509 -noout -dates

# Check certificate subject
kubectl get secret <tls-secret> -o jsonpath='{.data.tls\.crt}' | base64 -d | \
  openssl x509 -noout -subject -issuer

Resolution:

# Renew self-signed certificate
helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml \
  --set ingress.selfSigned=true

# Or update manual certificate
kubectl create secret tls <secret-name> \
  --cert=new-cert.crt --key=new-key.key \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart pods to pick up new certificate
kubectl rollout restart deployment <deployment-name>

Log Collection

Collecting Logs for Support

# Capture timestamp once to ensure consistency
TS=$(date +%Y%m%d-%H%M%S)

# Create log collection directory
mkdir -p ~/cdn-logs-$TS
cd ~/cdn-logs-$TS

# Collect pod logs
for pod in $(kubectl get pods -o name); do
  kubectl logs $pod > ${pod#pod/}.log 2>&1
  kubectl logs $pod -p > ${pod#pod/}.previous.log 2>&1 || true
done

# Collect cluster events
kubectl get events --sort-by='.lastTimestamp' > events.log

# Collect pod descriptions
for pod in $(kubectl get pods -o name); do
  kubectl describe $pod > ${pod#pod/}.describe.txt
done

# Compress for transfer
tar czf cdn-logs-$TS.tar.gz *.log *.txt

Emergency Procedures

Complete Cluster Recovery

If the cluster is completely down:

Assess node status:
```
kubectl get nodes
```
Restart K3s on nodes:
```
# On each node
systemctl restart k3s
```
If primary server failed:
- Promote another server node
- Update load balancer/DNS to point to new primary
Restore from backup if necessary:
- See Upgrade Guide for restore procedures

Data Recovery

For data recovery scenarios:

PostgreSQL: Use Cloudnative PG backup/restore
Longhorn: Restore from volume snapshots
Kafka: Replication handles most failures

Getting Help

If issues persist:

Collect logs using the procedure above
Check release notes for known issues
Contact support with log bundle and issue description

Next Steps

After resolving issues:

Operations Guide - Preventive maintenance procedures
Configuration Guide - Verify configuration is correct
Architecture Guide - Understand component dependencies

12 - Glossary

Terminology and definitions

Overview

This glossary defines key terms and acronyms used throughout the AgileTV CDN Manager (ESB3027) documentation.

A

ACD (Agile Content Delivery)

The overall CDN solution comprising the Manager (ESB3027) and Director (ESB3024) components.

Agent Node

A Kubernetes node that runs workloads but does not participate in the control plane. Agent nodes provide additional capacity for running application pods.

API Gateway

See NGinx Gateway.

ASN (Autonomous System Number)

A unique identifier for a network on the internet. Used in GeoIP-based routing decisions.

C

CDN Director

The Edge Server Business (ESB3024) component that handles actual content routing and delivery. Multiple Directors can be managed by a single CDN Manager.

Cloudnative PG (CNPG)

A Kubernetes operator that manages PostgreSQL clusters. Provides high availability, automatic failover, and backup capabilities for the Manager’s database layer.

Confd

Configuration daemon that synchronizes configuration from the Manager to CDN Directors. Runs as a sidecar or separate deployment.

A security mechanism that allows web applications to make requests to a different domain. Zitadel enforces CORS policies requiring the external domain to match the configured hostname.

CrashLoopBackOff

A Kubernetes pod state indicating the container is repeatedly crashing and being restarted. Typically indicates a configuration or dependency issue.

D

Datastore

The internal key-value storage system used by the Manager for short-lived or simple structured data. Backed by Redis.

Descheduler

A Kubernetes component that periodically analyzes pod distribution and evicts pods from overutilized nodes to optimize cluster balance.

Director

See CDN Director.

E

EDB (EnterpriseDB)

A company that provides PostgreSQL-related software and services. The Cloudnative PG operator was originally developed by EDB.

Ephemeral Storage

Temporary storage available to pods. Used for temporary files and caches. Not persistent across pod restarts.

ESB (Edge Server Business)

The product family designation for CDN components. ESB3027 is the Manager, ESB3024 is the Director.

etcd

A distributed key-value store used by Kubernetes for cluster state management. Runs on Server nodes as part of the control plane.

F

FailedScheduling

A Kubernetes event indicating a pod could not be scheduled due to insufficient resources or scheduling constraints.

Flannel

A network overlay solution for Kubernetes. Provides VXLAN-based networking for pod-to-pod communication.

Frontend GUI

See MIB Frontend.

G

GeoIP

Geographic IP lookup service using MaxMind databases. Used for location-based routing decisions.

Grafana

A visualization and dashboard platform for time-series data. Used to display metrics collected by Telegraf and stored in VictoriaMetrics.

H

Helm Chart

A package of pre-configured Kubernetes resources. The CDN Manager is deployed via a Helm chart that handles all component installation.

HPA (Horizontal Pod Autoscaler)

A Kubernetes feature that automatically scales the number of pods based on CPU/memory utilization or custom metrics.

HTTP Server

The main API server component of the Manager, built with Actix Web (Rust framework).

I

Ingress

A Kubernetes resource that exposes HTTP/HTTPS routes from outside the cluster to services within. The CDN Manager uses Traefik as the ingress controller.

Ingress Controller

A component that implements ingress rules. The CDN Manager uses Traefik for primary ingress and NGinx for external Director communication.

K

Kafka

A distributed event streaming platform used by the Manager for asynchronous communication and event processing.

K3s

A lightweight Kubernetes distribution optimized for edge and production deployments. Used as the underlying cluster technology.

Kubernetes (K8s)

An open-source container orchestration platform. The CDN Manager runs on a K3s-based Kubernetes cluster.

L

Longhorn

A distributed block storage system for Kubernetes. Provides persistent volumes for stateful components like PostgreSQL and Kafka.

Liveness Probe

A Kubernetes health check that determines if a container is running properly. Failed liveness probes trigger container restart.

M

Manager

The central management component (ESB3027) for configuring and monitoring CDN Directors.

MaxMind

A provider of IP intelligence databases including GeoIP City, GeoLite2 ASN, and Anonymous IP databases used by the Manager.

MIB Frontend

The web-based configuration GUI for CDN operators. Provides a user interface for managing streams, routers, and other configuration.

Multi-Factor Authentication (MFA)

An authentication method requiring multiple forms of verification. Note: MFA is not currently supported in the CDN Manager and should be skipped during setup.

N

Name-based Virtual Hosting

A technique where multiple hostnames are served from the same IP address. Zitadel uses this for CORS validation.

Namespace

A Kubernetes abstraction for organizing cluster resources. The CDN Manager uses namespaces to group related components.

NGinx Gateway

An NGinx-based gateway that handles external communication with CDN Directors.

Node Token

A secret token used to authenticate new nodes joining a K3s cluster. Located at /var/lib/rancher/k3s/server/node-token on Server nodes.

O

Operator

A method of packaging, deploying, and managing a Kubernetes application. Cloudnative PG is an operator for PostgreSQL.

OOMKilled

A Kubernetes pod state indicating the container was terminated due to exceeding memory limits.

P

PDB (Pod Disruption Budget)

A Kubernetes feature that ensures a minimum number of pods remain available during voluntary disruptions like maintenance.

PersistentVolume (PV)

A piece of storage in the Kubernetes cluster. Created dynamically by Longhorn for stateful components.

PersistentVolumeClaim (PVC)

A request for storage by a pod. Bound to a PersistentVolume.

Pod

The smallest deployable unit in Kubernetes. Contains one or more containers.

PostgreSQL

An open-source relational database. Used by the Manager for persistent data storage, managed by Cloudnative PG.

Probe

A Kubernetes health check mechanism. Types include liveness, readiness, and startup probes.

Prometheus

An open-source monitoring and alerting toolkit. Telegraf exports metrics in Prometheus format.

R

RBAC (Role-Based Access Control)

A method of regulating access to resources based on user roles. Used by Kubernetes for authorization.

Readiness Probe

A Kubernetes health check that determines if a container is ready to receive traffic. Failed readiness probes remove the pod from service load balancing.

Redis

An in-memory data structure store used for caching and as the datastore backend for the Manager.

Replica

A copy of a pod. Multiple replicas provide high availability and load distribution.

Resource Preset

Predefined resource configurations (nano, micro, small, medium, large, xlarge, 2xlarge) for common deployment sizes.

Rolling Update

A deployment strategy that updates pods one at a time to maintain availability during upgrades.

S

Selection Input

A key-value storage mechanism used for configuration data that can be queried with wildcard patterns. Available in v1 and v2 APIs with different semantics.

Server Node

A Kubernetes node that participates in the control plane (etcd, API server). Can also run workloads unless tainted.

Service

A Kubernetes abstraction that defines a logical set of pods and a policy for accessing them. Provides stable networking endpoints.

ServiceAccount

A Kubernetes identity for processes running in pods. Used for authentication between Kubernetes components.

StatefulSet

A Kubernetes workload API object for managing stateful applications. Used for PostgreSQL and Kafka deployments.

Startup Probe

A Kubernetes health check that determines if a container application has started. Disables liveness and readiness checks until it succeeds.

Stream

A content stream configuration defining source and routing parameters.

T

Telegraf

An agent for collecting, processing, aggregating, and writing metrics. Runs on each node to gather system and application metrics.

TLS (Transport Layer Security)

A cryptographic protocol for secure communication. The CDN Manager uses TLS for all external HTTPS connections.

Topology Aware Hints

A Kubernetes feature that prefers routing traffic to pods in the same zone as the source. Reduces latency by keeping traffic local.

Traefik

A modern HTTP reverse proxy and ingress controller. Used as the primary ingress controller for the CDN Manager.

TTL (Time To Live)

The duration after which data expires. Used in the datastore and selection input APIs.

V

Values.yaml

The Helm chart configuration file. Contains all configurable parameters for the CDN Manager deployment.

VictoriaMetrics

A time-series database used for storing metrics data. Provides long-term storage and querying capabilities.

VXLAN

Virtual Extensible LAN. A network virtualization technology used by Flannel for pod networking.

Z

Zitadel

An identity and access management (IAM) platform used for authentication and authorization in the CDN Manager. Provides OAuth2/OIDC capabilities.

Default Credentials

The following table lists all default credentials used by the CDN Manager. Change these defaults before deploying to production.

Service	Username	Password	Notes
Zitadel Console	`admin@agiletv.dev`	`Password1!`	Primary identity management; accessed at `/ui/console`

Security Warning: Use the default admin@agiletv.dev account only to create a new administrator account with proper roles. After verifying the new account works, disable or delete the default admin account before exposing the system to users. For details on required roles and administrator permissions, see Zitadel’s Administrator Documentation. See the Next Steps Guide for initial configuration procedures.

Common Abbreviations

Abbreviation	Meaning
API	Application Programming Interface
ASN	Autonomous System Number
CORS	Cross-Origin Resource Sharing
CPU	Central Processing Unit
DNS	Domain Name System
EDB	EnterpriseDB
ESB	Edge Server Business
GUI	Graphical User Interface
HA	High Availability
Helm	Helm Package Manager
HPA	Horizontal Pod Autoscaler
HTTP	Hypertext Transfer Protocol
HTTPS	HTTP Secure
IAM	Identity and Access Management
IP	Internet Protocol
JSON	JavaScript Object Notation
K8s	Kubernetes
MFA	Multi-Factor Authentication
MIB	Management Information Base
NIC	Network Interface Card
OAuth	Open Authorization
OIDC	OpenID Connect
PVC	PersistentVolumeClaim
PV	PersistentVolume
RBAC	Role-Based Access Control
SSL	Secure Sockets Layer
TCP	Transmission Control Protocol
TLS	Transport Layer Security
TTL	Time To Live
UDP	User Datagram Protocol
UI	User Interface
VPA	Vertical Pod Autoscaler
VXLAN	Virtual Extensible LAN

Next Steps

After reviewing terminology:

Architecture Guide - Understand component relationships
Configuration Guide - Full configuration reference
Operations Guide - Day-to-day operational procedures

AgileTV CDN Manager (esb3027)

1 - Getting Started

Overview

Deployment Models

Prerequisites

Installation

Accessing the System

Documentation Navigation

2 - System Requirements Guide

Overview

Cluster Sizing

Production Deployments

High Availability Considerations

Hardware Requirements

Single-Node Lab Deployment

Production Cluster - Server Node (Control Plane Only)

Production Cluster - Server Node (Control Plane + Workloads)

Production Cluster - Agent Node

Storage Notes

Storage Performance

Operating System Requirements

Supported Operating Systems

Compatible Clones

Air-Gapped Deployments

Oracle Linux UEK Kernel

SELinux

Networking Requirements

Network Interface

Port Requirements

Resource Planning

Calculating Cluster Capacity

Scaling Considerations

Example Production Deployment

Next Steps

3 - Networking Guide

Network Architecture

Physical Network

Overlay Network

Port Requirements

Inter-Node Communication

Application Services Ports

External Access Ports

Network Configuration Guides

Deployment Type

3.1 - Shared Interface Network Setup

Overview

firewall Configuration

Assign Interface to Default Zone

Configure Firewall Rules

Verification

Troubleshooting

Troubleshooting

Nodes Cannot Communicate

Post-Installation Troubleshooting

3.2 - Configuring Segregated Networks

Overview

Prerequisites

Configure Firewalld Zones

Assign Interfaces to Zones

Configure Firewall Ports

Verify Zone Configuration

Single-NIC Alternative

Troubleshooting

Verify Zone Configuration

4 - Architecture Guide

Overview

High-Level Architecture

Component Architecture

Ingress Layer

Application Services

Data Layer

External Integrations

Detailed Component Descriptions

Core Manager

MIB Frontend

Confd (Configuration Service)

Selection Input Worker

Metrics Aggregator

Telegraf

Grafana