1 - Getting Started

Introduction to AgileTV CDN Manager

Overview

The AgileTV CDN Manager (product code ESB3027) is a cloud-native control plane for managing CDN deployments. It provides centralized orchestration for authentication, configuration, routing, and metrics collection across CDN infrastructure.

Before You Start:

  • Deployment type: Lab (single-node) or Production (multi-node)? See Installation Guide
  • Hardware: Nodes meeting specifications for your deployment type
  • OS: RHEL 9 or compatible clone (Oracle Linux, AlmaLinux, Rocky Linux)
  • Software: Installation ISO from AgileTV customer portal; Extras ISO for air-gapped
  • Network: Firewall ports configured per Networking Guide

Deployment Models

Deployment ModelDescriptionTypical Use Case
Self-HostedK3s Kubernetes cluster on customer premisesProduction deployments
Lab/Single-NodeMinimal single-node installationAcceptance testing, demonstrations, development

Functionality remains consistent across deployment models.

Prerequisites

  • Installation ISO: Obtain esb3027-acd-manager-X.Y.Z.iso from AgileTV customer portal
  • Extras ISO (air-gapped): Obtain esb3027-acd-manager-extras-X.Y.Z.iso for offline installations
  • OS: RHEL 9 or compatible clone (Oracle Linux, AlmaLinux, Rocky Linux)
  • Kubernetes familiarity: Basic understanding of pods, deployments, and Helm charts

For detailed hardware, network, and operating system requirements, see the System Requirements Guide.

Installation

Ready to install? The Installation Guide provides step-by-step procedures for both lab and production deployments:

  • Lab/Single-Node: Quick deployment for testing and demonstrations
  • Production/Multi-Node: High-availability cluster with 3+ nodes

See the Installation Guide to get started.

Accessing the System

Following successful deployment, the following interfaces are available:

ServiceURL PathAuthentication
MIB Frontend/guiZitadel SSO
API Gateway/apiBearer token
Zitadel Console/ui/consoleSee Glossary
Grafana/grafanaSee Glossary

All services are accessed via https://<cluster-host><path>.

Note: A self-signed SSL certificate is deployed by default. When accessing services through a browser, you will need to accept the self-signed certificate warning. For production deployments, configure a valid SSL certificate before exposing the system to users.

Initial user configuration is performed through Zitadel. Refer to the Configuration Guide for authentication setup procedures. For detailed guidance on managing users, roles, and permissions in the Zitadel Console, see Zitadel’s User Management Documentation.

Documentation Navigation

The following guides provide detailed information for specific operational tasks:

GuideDescription
System RequirementsHardware, operating system, and network specifications
ArchitectureDetailed system architecture and scaling guidance
InstallationStep-by-step installation and upgrade procedures
ConfigurationSystem configuration and customization
Performance TuningOptimization tips for improved performance
API GuideREST API reference and integration examples
OperationsDay-to-day operational procedures
Metrics & MonitoringMonitoring dashboards and alerting configuration
TroubleshootingCommon issues and resolution procedures
GlossaryDefinitions of technical terms
Release NotesVersion-specific changes and known issues

2 - System Requirements Guide

Hardware, operating system, and networking requirements

Overview

This document specifies the hardware, operating system, and networking requirements for deploying the AgileTV CDN Manager (ESB3027). Requirements vary based on deployment type and node role within the cluster.

Cluster Sizing

Production Deployments

Production deployments require a minimum of three nodes to achieve high availability. The cluster architecture employs distinct node roles:

RoleDescription
Server Node (Control Plane Only)Runs control plane components (etcd, Kubernetes API server) only; does not host application workloads; requires separate Agent nodes
Server Node (Combined)Runs control plane components and hosts application workloads; default configuration
Agent NodeExecutes application workloads only; does not participate in cluster quorum

Server nodes can be deployed in either Control Plane Only or Combined role configurations. The choice depends on your deployment requirements:

  • Control Plane Only: Dedicated control plane nodes with lower resource requirements; requires separate Agent nodes for workloads
  • Combined: Server nodes run both control plane and workloads; minimum 3 nodes required for HA

Why Use Control Plane Only Nodes?

Dedicated Control Plane Only nodes provide several benefits for larger deployments:

  • Resource Isolation: Control plane components (etcd, API server, scheduler) run on dedicated hardware without competing with application workloads for CPU and memory
  • Stability: Application workload spikes or misbehaving pods cannot impact control plane performance
  • Security: Smaller attack surface on control plane nodes; fewer containers and services running
  • Predictable Performance: Control plane responsiveness remains consistent regardless of application load
  • Flexible Sizing: Control Plane Only nodes can use lower-specification hardware (2 cores, 4 GiB) since they don’t run application workloads

For most small to medium deployments, Combined role servers are simpler and more cost-effective. Control Plane Only nodes are recommended for larger deployments with significant workload requirements or where control plane stability is critical.

High Availability Considerations

Production deployments require 3 nodes running control plane (etcd) and 3 nodes capable of running workloads. These can be the same nodes (Combined role) or separate nodes (CP-Only + Agent).

Node Role Combinations:

ConfigurationControl Plane NodesWorkload NodesTotal Nodes
All Combined3 Combined servers3 Combined servers3
Separated3 CP-Only servers3 Agent nodes6
Hybrid2 CP-Only + 1 Combined1 Combined + 2 Agent5

Any combination works as long as you have 3 control plane nodes and 3 workload-capable nodes.

Note: Regardless of the deployment configuration, a minimum of 3 nodes capable of running workloads is required for production deployments. This ensures both high availability and sufficient capacity for application pods.

For detailed fault tolerance information and data replication strategies, see the Architecture Guide.

Hardware Requirements

Single-Node Lab Deployment

Lab deployments are intended for acceptance testing, demonstrations, and development only. These configurations are not suitable for production workloads.

ResourceMinimumRecommended
CPU8 cores12 cores
Memory16 GiB24 GiB
Disk*128 GiB128 GiB

Production Cluster - Server Node (Control Plane Only)

Server nodes dedicated to control plane functions have modest resource requirements:

ResourceMinimumRecommended
CPU2 cores4 cores
Memory4 GiB8 GiB
Disk*64 GiB128 GiB

These nodes run only control plane components and require separate Agent nodes to run application workloads.

Production Cluster - Server Node (Control Plane + Workloads)

Combined role nodes require resources for both control plane and application workloads:

ResourceMinimumRecommended
CPU16 cores24 cores
Memory32 GiB48 GiB
Disk*256 GiB256 GiB

Production Cluster - Agent Node

Agent nodes execute application workloads and require the following resources:

ResourceMinimumRecommended
CPU4 cores8 cores
Memory6 GiB16 GiB
Disk*64 GiB128 GiB

Storage Notes

* Disk Space: All disk space values must be available in the /var/lib/longhorn partition. It is recommended that /var/lib/longhorn be a separate partition on a fast SSD for optimal performance, though SSD is not strictly required.

Longhorn Capacity: Longhorn storage requires an additional 30% capacity headroom for internal operations and scaling. If less than 30% of the total partition capacity is available, Longhorn may mark volumes as “full” and prevent further writes. Plan disk capacity accordingly.

Storage Performance

For optimal performance, the following storage characteristics are recommended:

  • Disk Type: SSD or NVMe storage for Longhorn volumes
  • Filesystem: XFS or ext4 with default mount options
  • Partition Layout: Dedicated /var/lib/longhorn partition for persistent storage

Virtual machines and bare-metal hardware are both supported. Nested virtualization (running multiple nodes under a single hypervisor) may impact performance and is not recommended for production deployments.

Operating System Requirements

Supported Operating Systems

The CDN Manager supports Red Hat Enterprise Linux and compatible distributions:

Operating SystemStatus
Red Hat Enterprise Linux 9Supported
Red Hat Enterprise Linux 10Untested
Red Hat Enterprise Linux 8Not supported

Compatible Clones

The following RHEL-compatible distributions are supported when major version requirements are satisfied:

  • Oracle Linux 9
  • AlmaLinux 9
  • Rocky Linux 9

Air-Gapped Deployments

Important: For air-gapped deployments (no internet access), the OS installation ISO must be mounted on all nodes before running the installer or join commands. The installer needs to install one or more packages from the distribution’s repository.

Oracle Linux UEK Kernel

Note: For Oracle Linux 9.7 and later using the Unbreakable Enterprise Kernel (UEK), you must install the kernel-uek-modules-extra-netfilter-$(uname -r) package before running the installer:

# Mount OS ISO first (required for air-gapped)
mount -o loop /path/to/oracle-linux-9.iso /mnt/iso

# Install required kernel modules
dnf install kernel-uek-modules-extra-netfilter-$(uname -r)

This package provides netfilter kernel modules required by K3s and Longhorn.

SELinux

SELinux is supported when installed in “Enforcing” mode. The installation process will configure appropriate SELinux policies automatically.

Networking Requirements

Network Interface

Each cluster node must have at least one network interface card (NIC) configured as the default gateway. If the node lacks a pre-configured default route, one must be established prior to installation.

Port Requirements

The cluster requires the following network connectivity:

CategoryPortsPurpose
Inter-Node2379-2380, 6443, 8472/UDP, 10250, 5001, 9500, 8500etcd, API server, Flannel VXLAN, Kubelet, Spegel, Longhorn
External Access80, 443HTTP redirect and HTTPS ingress
Application (optional)6379, 8125 TCP/UDP, 9093, 9095Redis, Telegraf, Alertmanager, Kafka external

Important: Complete port requirements, network ranges, and firewall configuration procedures are provided in the Networking Guide. Do not expose VictoriaMetrics (8428, 8429), Grafana (3000), or PostgreSQL (5432) directly—access these services only through the secure HTTPS ingress (port 443).

Resource Planning

Calculating Cluster Capacity

When planning cluster capacity, consider the following factors:

  1. Base Overhead: Kubernetes system components consume approximately 1-2 cores and 2-4 GiB memory per node
  2. Application Workloads: Refer to individual component resource requirements in the Architecture Guide
  3. Headroom: Maintain 20-30% resource headroom for workload spikes and automatic scaling

Scaling Considerations

The CDN Manager supports horizontal scaling for most components. The Horizontal Pod Autoscaler (HPA) can automatically adjust replica counts based on resource utilization. Detailed scaling guidance is available in the Architecture Guide.

Example Production Deployment

A minimal production deployment with 3 server nodes (combined role) and 2 agent nodes would require:

Node TypeCountCPU TotalMemory TotalDisk Total
Server (Combined)348 cores96 GiB768 GiB
Agent28 cores12 GiB128 GiB
Total556 cores108 GiB896 GiB

This configuration provides:

  • High availability (survives loss of 1 server node)
  • Capacity for application workloads across all nodes
  • Headroom for horizontal scaling

Next Steps

After verifying system requirements:

  1. Review the Installation Guide for deployment procedures
  2. Consult the Networking Guide for firewall configuration
  3. Examine the Architecture Guide for component resource requirements

3 - Networking Guide

Firewall configuration and network architecture

Overview

This guide describes the network architecture and firewall configuration requirements for the AgileTV CDN Manager (ESB3027). Proper network configuration is essential for cluster communication and external access to services.

Note: The installer script automatically detects if firewalld is enabled. If so, it will verify that the required inter-node ports are open through the firewall in the default zone before proceeding. If any required ports are missing, the installer will report an error and exit. Application service ports (such as Kafka, VictoriaMetrics, and Telegraf) are not checked by the installer as they are configurable.

Network Architecture

Physical Network

Each cluster node must have at least one network interface card (NIC) configured as the default gateway. If the node lacks a pre-configured default route, one must be established prior to installation.

Overlay Network

Kubernetes creates virtual network interfaces for pods that are typically not associated with any specific firewalld zone. The cluster uses the following network ranges:

NetworkCIDRPurpose
Pod Network10.42.0.0/16Inter-pod communication
Service Network10.43.0.0/16Kubernetes service discovery

Firewall regulations should target the primary physical interface. The overlay network traffic is handled by Flannel VXLAN.

IP Routing

Proper IP routing is critical for cluster communication. Ensure your network infrastructure allows routing between all subnets used by the cluster.

Port Requirements

Inter-Node Communication

The following ports must be permitted between all cluster nodes for Kubernetes and cluster infrastructure:

PortProtocolSourceDestinationPurpose
2379-2380TCPServer nodesServer nodesetcd cluster communication
6443TCPAll nodesServer nodesKubernetes API server
8472UDPAll nodesAll nodesFlannel VXLAN overlay network
10250TCPAll nodesAll nodesKubelet metrics and management
5001TCPAll nodesServer nodesSpegel registry mirror
9500-9503TCPAll nodesAll nodesLonghorn management API
8500-8504TCPAll nodesAll nodesLonghorn agent communication
10000-30000TCPAll nodesAll nodesLonghorn data replication
3260TCPAll nodesAll nodesLonghorn iSCSI
2049TCPAll nodesAll nodesLonghorn RWX (NFS)

Application Services Ports

The following ports must be accessible for application services within the cluster:

PortProtocolService
6379TCPRedis
9092TCPKafka (internal cluster communication)
9093TCPKafka (controller)
9094TCPKafka (internal)
9095TCPKafka (external client connections)
8428TCPVictoriaMetrics (Analytics)
8880TCPVictoriaMetrics (Alerting)
8429TCPVictoriaMetrics (Billing)
9093TCPAlertmanager
8125TCP/UDPTelegraf (metrics collection)
8080TCPTelegraf (API/Metrics)
8086TCPTelegraf (API/Metrics)

External Access Ports

The following ports must be accessible from external clients to cluster nodes:

PortProtocolService
80TCPHTTP ingress (Optional, redirects to HTTPS)
443TCPHTTPS ingress (Required, all services)
9095TCPKafka (external client connections)
6379TCPRedis (external client connections)
8125TCP/UDPTelegraf (metrics collection)

Firewall Configuration

firewalld Configuration

firewalld Configuration

For systems using firewalld, it is recommended to use separate zones for internal cluster traffic and external public access. This ensures that sensitive inter-node communication is restricted to the internal network.

  1. Assign Interfaces to Zones: First, assign your network interfaces to the appropriate zones. For example, if eth0 is your public interface and eth1 is your internal cluster interface:

    firewall-cmd --permanent --zone=public --add-interface=eth0
    firewall-cmd --permanent --zone=internal --add-interface=eth1
    
  2. Configure Firewall Rules: The following commands configure the minimum required firewall rules.

    # Inter-node communication (Internal Zone)
    firewall-cmd --permanent --zone=internal --add-port=2379-2380/tcp
    firewall-cmd --permanent --zone=internal --add-port=6443/tcp
    firewall-cmd --permanent --zone=internal --add-port=8472/udp
    firewall-cmd --permanent --zone=internal --add-port=10250/tcp
    firewall-cmd --permanent --zone=internal --add-port=5001/tcp
    firewall-cmd --permanent --zone=internal --add-port=9500-9503/tcp
    firewall-cmd --permanent --zone=internal --add-port=8500-8504/tcp
    firewall-cmd --permanent --zone=internal --add-port=10000-30000/tcp
    firewall-cmd --permanent --zone=internal --add-port=3260/tcp
    firewall-cmd --permanent --zone=internal --add-port=2049/tcp
    
    # Pod and service networks (Internal Zone)
    firewall-cmd --permanent --zone=internal --add-source=10.42.0.0/16
    firewall-cmd --permanent --zone=internal --add-source=10.43.0.0/16
    
    # External access (Public Zone)
    firewall-cmd --permanent --zone=public --add-port=80/tcp
    firewall-cmd --permanent --zone=public --add-port=443/tcp
    firewall-cmd --permanent --zone=public --add-port=9095/tcp
    firewall-cmd --permanent --zone=public --add-port=6379/tcp
    firewall-cmd --permanent --zone=public --add-port=8125/tcp
    firewall-cmd --permanent --zone=public --add-port=8125/udp
    
    # Apply changes
    firewall-cmd --reload
    

    For more restrictive configurations, you can scope rules to specific source subnets using --add-source=<subnet> within the internal zone.

Internal Application Ports (Optional)

For internal cluster communication, the following ports may be opened if direct application access is required:

firewall-cmd --permanent --add-port=9092/tcp

Note: This port is used for internal Kafka cluster communication only.

Security Warning: Do not expose VictoriaMetrics (8428, 8429), or PostgreSQL (5432) directly. These services require authentication and their direct ports do not use TLS connections, creating a security risk. Always access these services through the secure HTTPS ingress (port 443).

Externally Accessible Application Ports: The following application ports are safe for external access and are already configured in the External Access section:

PortServiceNotes
9095KafkaExternal client connections
6379RedisExternal client connections
8125TelegrafMetrics collection

Verification

Verify firewall rules are applied:

firewall-cmd --list-all

Verify ports are accessible between nodes:

# From one node, test connectivity to another
nc -zv <node-ip> 6443
nc -zv <node-ip> 8472

Kubernetes Port Forwarding

For accessing internal Kubernetes services that are not exposed via ingress or services, use kubectl port-forward to create a secure tunnel from your local machine to the service.

Basic Port Forwarding

# Forward local port to a service
kubectl port-forward -n <namespace> svc/<service-name> <local-port>:<service-port>

# Example: Forward local port 8080 to Grafana (port 3000)
kubectl port-forward -n default svc/acd-manager-grafana 8080:3000

Note: “Local” refers to the machine where you run kubectl. This can be:

  • A Server node in the cluster (common for administrative tasks)
  • A remote machine with kubectl configured to access the cluster

Accessing the Forwarded Service

Once the port-forward is established, access the service at http://localhost:<local-port> from the machine where you ran kubectl port-forward.

If running on a Server node: To access the forwarded port from your local workstation, you need to:

  1. Ensure the firewall on the Server node allows traffic on the forwarded port from your network
  2. Use the Server node’s IP address instead of localhost from your workstation
# From your workstation (if firewall allows)
curl http://<server-node-ip>:<local-port>

For simplicity, consider running port-forward from your local machine (if kubectl is configured for remote cluster access) rather than from a Server node.

Background Port Forwarding

To run port-forward in the background:

kubectl port-forward -n <namespace> svc/<service-name> <local-port>:<service-port> &

Security Considerations

Port forwarding is recommended for:

  • Administrative interfaces (e.g., Longhorn UI) that should not be publicly exposed
  • Debugging and troubleshooting internal services
  • Temporary access to services without modifying ingress configuration

The port-forward tunnel remains active only while the kubectl port-forward command is running. Press Ctrl+C to terminate the tunnel.

Example: The Longhorn storage UI is intentionally not exposed via ingress due to security risks. Access it via port-forward:

kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80

Then navigate to http://localhost:8080 in your browser.

Network Security Considerations

Network Segmentation

For production deployments, consider network segmentation:

  • Management Network: Dedicated network for Kubernetes control plane traffic
  • Application Network: Separate network for application service traffic
  • External Network: Public-facing network for ingress traffic

Traffic Encryption

  • All external traffic uses HTTPS (TLS 1.2 or higher)
  • Internal cluster traffic uses Flannel VXLAN encryption (if enabled)
  • Database connections (PostgreSQL, Redis) are internal to the cluster

Access Control

  • External access is limited to ports 80 and 443 by default
  • Application service ports should not be exposed externally
  • Use Kubernetes NetworkPolicies for fine-grained pod-to-pod traffic control

Troubleshooting

Nodes Cannot Communicate

  1. Verify firewall rules allow inter-node traffic:

    firewall-cmd --list-all
    
  2. Test connectivity between nodes:

    ping <node-ip>
    nc -zv <node-ip> 6443
    
  3. Check network routing:

    ip route
    

Pods Cannot Reach Services

  1. Verify Flannel is running:

    kubectl get pods -n kube-system | grep flannel
    
  2. Check VXLAN interface:

    ip link show flannel.1
    
  3. Verify pod network routes:

    ip route | grep 10.42
    

External Access Fails

  1. Verify ingress controller is running:

    kubectl get pods -n kube-system | grep traefik
    
  2. Check ingress configuration:

    kubectl get ingress
    
  3. Verify external firewall allows ports 80 and 443

Next Steps

After configuring networking:

  1. Installation Guide - Proceed with cluster installation
  2. System Requirements - Review hardware and OS requirements
  3. Architecture Guide - Understand component communication patterns

4 - Architecture Guide

Detailed system architecture and component overview

Overview

The AgileTV CDN Manager (ESB3027) is a cloud-native Kubernetes application designed for managing CDN operations. This guide provides a detailed description of the system architecture, component interactions, and scaling considerations.

High-Level Architecture

The CDN Manager follows a microservices architecture deployed on Kubernetes. The system is organized into logical layers:

graph LR
    Clients[API Clients] --> Ingress[Ingress Controller]
    Ingress --> Manager[Core Manager]
    Ingress --> Frontend[MIB Frontend]
    Ingress --> Grafana[Grafana]
    Manager --> Redis[(Redis)]
    Manager --> Kafka[(Kafka)]
    Manager --> PostgreSQL[(PostgreSQL)]
    Manager --> Zitadel[Zitadel IAM]
    Manager --> Confd[Configuration Service]
    Grafana --> VM[(VictoriaMetrics)]
    Confd -.-> Gateway[NGinx Gateway]
    Gateway --> Director[CDN Director]

Component Architecture

Ingress Layer

The ingress layer manages all incoming traffic to the cluster:

ComponentRole
Ingress ControllerPrimary ingress for all cluster traffic; routes requests to internal services based on path
NGinx GatewayReverse proxy for routing traffic to external CDN Directors; used by MIB Frontend to communicate with remote Confd instances on CDN Director nodes

Traffic flow:

  • API clients and Operator UI connect via the Ingress Controller at /api and /gui paths respectively
  • Grafana dashboards are accessed via the Ingress Controller at /grafana
  • Zitadel authentication console is accessed via the Ingress Controller at /ui/console
  • MIB Frontend uses NGinx Gateway when communicating with external Confd instances on CDN Director nodes

Application Services

The application layer contains the core CDN Manager services:

ComponentRoleScaling
Core ManagerMain REST API server (v1/v2 endpoints); handles authentication, configuration, routing, and discoveryHorizontally scalable via HPA
MIB FrontendWeb-based configuration GUI for operatorsHorizontally scalable via HPA
ConfdConfiguration service for routing configuration; synchronizes with Core Manager applicationSingle instance
GrafanaMonitoring and visualization dashboardsSingle instance
Selection Input WorkerConsumes selection input events from Kafka and updates configurationSingle instance
Metrics AggregatorCollects and aggregates metrics from CDN componentsSingle instance
TelegrafSystem-level metrics collection from cluster nodesDaemonSet (one per node)
AlertmanagerAlert routing and notification managementSingle instance

Data Layer

The data layer provides persistent and ephemeral storage:

ComponentRoleScaling
RedisIn-memory caching, session storage, and ephemeral stateMaster + replicas (read-only)
KafkaEvent streaming for selection input and metrics; provides durable message queueController cluster (odd count)
PostgreSQLPersistent configuration and state storage3-node cluster with HA
VictoriaMetrics (Analytics)Real-time and short-term metrics for operational dashboardsSingle instance
VictoriaMetrics (Billing)Long-term metrics retention (1+ years) for billing and license complianceSingle instance

External Integrations

ComponentRole
Zitadel IAMIdentity and access management; provides OAuth2/OIDC authentication
CDN Director (ESB3024)Edge routing infrastructure; receives configuration from Confd

Detailed Component Descriptions

Core Manager

The Core Manager is the central application server that exposes the REST API. It is implemented in Rust using the Actix-web framework.

Key Responsibilities:

  • Authentication and session management via Zitadel
  • Configuration document storage and retrieval
  • Selection input CRUD operations
  • Routing rule evaluation and GeoIP lookups
  • Service discovery for CDN Directors and edge servers
  • Operator UI helper endpoints

API Endpoints:

  • /api/v1/auth/* - Authentication (login, token, logout)
  • /api/v1/configuration - Configuration management
  • /api/v1/selection_input/* - Selection input operations
  • /api/v2/selection_input/* - Enhanced selection input with list operations
  • /api/v1/routing/* - Routing evaluation and validation
  • /api/v1/discovery/* - Host and namespace discovery
  • /api/v1/metrics - System metrics
  • /api/v1/health/* - Liveness and readiness probes
  • /api/v1/operator_ui/* - Operator helper endpoints

Runtime Modes: The Core Manager supports multiple runtime modes, each deployed as a separate container:

  • http-server - Primary HTTP API server (default)
  • metrics-aggregator - Background worker for metrics collection
  • selection-input - Background worker for Kafka selection input consumption

MIB Frontend

The MIB Frontend provides a web-based GUI for configuration management.

Key Features:

  • Intuitive web interface for CDN configuration
  • Real-time configuration validation
  • Integration with Zitadel for SSO authentication
  • Uses NGinx Gateway for external Director communication

Confd (Configuration Service)

Confd provides routing configuration services and synchronizes with the Core Manager application.

Key Responsibilities:

  • Hosts the service configuration for routing decisions
  • Provides API and CLI for configuration management
  • Synchronizes routing configuration with Core Manager
  • Maintains configuration state in PostgreSQL

Selection Input Worker

The Selection Input Worker processes selection input events from the Kafka stream.

Key Responsibilities:

  • Consumes messages from the selection_input Kafka topic
  • Validates and transforms input data
  • Updates configuration in the data store
  • Maintains message ordering within partitions

Scaling Limitation: The Selection Input Worker cannot be scaled beyond a single consumer per Kafka partition, as message ordering must be preserved.

Metrics Aggregator

The Metrics Aggregator collects and processes metrics from CDN components.

Key Responsibilities:

  • Polls metrics from Director instances
  • Aggregates usage statistics
  • Writes data to VictoriaMetrics (Analytics) for dashboards
  • Writes long-term data to VictoriaMetrics (Billing) for compliance

Telegraf

Telegraf is deployed as a DaemonSet to collect host-level metrics.

Key Responsibilities:

  • CPU, memory, disk, and network metrics from each node
  • Container-level resource usage
  • Kubernetes cluster metrics
  • Forwards metrics to VictoriaMetrics

Grafana

Grafana provides visualization and dashboard capabilities.

Features:

  • Pre-built dashboards for CDN monitoring
  • Custom dashboard support
  • VictoriaMetrics as data source
  • Alerting integration with Alertmanager

Access: https://<host>/grafana

Alertmanager

Alertmanager handles alert routing and notifications.

Key Responsibilities:

  • Receives alerts from Grafana and other sources
  • Deduplicates and groups alerts
  • Routes to notification channels (email, webhook, etc.)
  • Manages alert silencing and inhibition

Data Storage

Redis

Redis provides in-memory storage for:

  • User sessions and authentication tokens
  • Ephemeral configuration cache
  • Real-time state synchronization

Deployment: Master + read replicas for high availability

Kafka

Kafka provides durable event streaming for:

  • Selection input events
  • Metrics data streams
  • Inter-service communication

Deployment: Controller cluster with 3 replicas for production, 1 replica for lab deployments

Node Affinity: Kafka replicas must be scheduled on separate nodes to ensure high availability. The Helm chart configures pod anti-affinity rules to enforce this distribution.

Topics:

  • selection_input - Selection input events
  • metrics - Metrics data streams

Note: For lab/single-node deployments, the Kafka replica count must be set to 1 in the Helm values. Production deployments require 3 replicas for fault tolerance.

PostgreSQL

PostgreSQL provides persistent storage for:

  • Configuration documents
  • User and permission data
  • System state

Deployment: 3-node cluster managed by Cloudnative PG (CNPG) operator

High Availability: The CNPG operator manages automatic failover and ensures high availability:

  • One primary node handles read/write operations
  • Two replica nodes provide redundancy and can be promoted to primary on failure
  • Automatic failover occurs within seconds of primary node failure
  • Synchronous replication ensures data consistency

Note: The PostgreSQL cluster is deployed and managed automatically by the CNPG operator. Manual intervention is typically not required for normal operations.

VictoriaMetrics

Two VictoriaMetrics instances serve different purposes:

VictoriaMetrics (Analytics):

  • Real-time and short-term metrics storage
  • Supports Grafana dashboards
  • Retention: Configurable (typically 30-90 days)

VictoriaMetrics (Billing):

  • Long-term metrics retention
  • Billing and license compliance data
  • Retention: Minimum 1 year

Authentication and Authorization

Zitadel Integration

Zitadel provides identity and access management:

Authentication Flow:

  1. User accesses MIB Frontend or API
  2. Redirected to Zitadel for authentication
  3. Zitadel validates credentials and issues session token
  4. Session token exchanged for access token
  5. Access token included in API requests (Bearer authentication)

Default Credentials: See the Glossary for default login credentials.

Access Paths:

  • Zitadel Console: /ui/console
  • API authentication: /api/v1/auth/*

CORS Configuration

Zitadel enforces Cross-Origin Resource Sharing (CORS) policies. The external hostname configured in Zitadel must match the first entry in global.hosts.manager in the Helm values.

Network Architecture

Traffic Flow

graph TB
    External[External Clients] --> Ingress[Ingress Controller]
    External --> Redis[(Redis)]
    External --> Kafka[(Kafka)]
    External --> Telegraf[Telegraf]
    Ingress --> Manager[Core Manager]
    Ingress --> Frontend[MIB Frontend]
    Ingress --> Grafana[Grafana]
    Ingress --> Zitadel[Zitadel]

Note: Certain services (Redis, Kafka, Telegraf) can be accessed directly by external clients without traversing the ingress controller. This is typically used for metrics collection, event streaming, and direct data access scenarios.

Internal Communication

All internal services communicate over the Kubernetes overlay network (Flannel VXLAN). Services discover each other via Kubernetes DNS.

External Communication

  • CDN Directors: Accessed via NGinx Gateway for simplified routing
  • MaxMind GeoIP: Local database files (no external calls)

Scaling

Horizontal Pod Autoscaler (HPA)

The following components support automatic horizontal scaling via HPA:

ComponentMinimumMaximumScale Metrics
Core Manager38CPU (50%), Memory (80%)
NGinx Gateway24CPU (75%), Memory (80%)
MIB Frontend24CPU (75%), Memory (90%)

Note: HPA is enabled by default in the Helm chart. The default configuration is tuned for production deployments. Adjust min/max values based on expected load and available cluster capacity.

Manual Scaling

Components can also be scaled manually by setting replica counts in the Helm values:

manager:
  replicaCount: 3
mib-frontend:
  replicaCount: 2

Important: When manually setting replica counts, you must disable the Horizontal Pod Autoscaler (HPA) for the corresponding component. If HPA remains enabled, it will override manual replica settings. To disable HPA, set autoscaling.hpa.enabled: false for the component in your Helm values.

Components That Do Not Scale

The following components do not support horizontal scaling:

ComponentReason
ConfdSingle instance required for configuration consistency
PostgreSQLCloudnative PG cluster; scaled by adding replicas via operator configuration
KafkaScaled by adding controllers, not via replica count
VictoriaMetricsStateful; single instance per role
RedisMaster is single; replicas are read-only
GrafanaSingle instance sufficient for dashboard access
AlertmanagerSingle instance for alert routing
Selection Input WorkerKafka message ordering requires single consumer
Metrics AggregatorSingle instance for consistent metrics aggregation

Node Scaling

Additional Agent nodes can be added to the cluster at any time to increase workload capacity. Kubernetes automatically schedules pods to nodes with available resources.

Cluster Balancing

The CDN Manager deployment includes the Kubernetes Descheduler to maintain balanced resource utilization across cluster nodes:

  • Automatic Rebalancing: The descheduler periodically analyzes pod distribution and evicts pods from overutilized nodes
  • Node Balance: Helps prevent resource hotspots by redistributing workloads across available nodes
  • Integration with HPA: Works in conjunction with Horizontal Pod Autoscaler to optimize both pod count and placement

The descheduler runs as a background process and does not require manual intervention under normal operating conditions.

Resource Configuration

For detailed resource preset configurations and planning guidance, see the Configuration Guide.

High Availability

Server Node Redundancy

Production deployments require a minimum of 3 Server nodes:

  • Survives loss of 1 server node
  • Maintains quorum for etcd and Kafka

For enhanced availability, use 5 Server nodes:

  • Survives loss of 2 server nodes
  • Recommended for critical production environments

For large-scale deployments, 7 or more Server nodes can be used:

  • Survives loss of 3+ server nodes
  • Suitable for high-capacity production environments

Pod Distribution

Kubernetes automatically distributes pods across nodes to maximize availability:

  • Pods with the same deployment are scheduled on different nodes when possible
  • Pod Disruption Budgets (PDB) ensure minimum availability during maintenance

Data Replication

ComponentReplication Strategy
RedisSingle instance (backup via Longhorn snapshots)
KafkaReplicated partitions (default: 3)
PostgreSQL3-node cluster via Cloudnative PG
VictoriaMetricsSingle instance (backup via snapshots)
LonghornSingle replica with pod-node affinity

Longhorn Storage: Longhorn volumes are configured with a single replica by default. Pod scheduling is configured with node affinity to prefer scheduling pods on the same node as their persistent volume data. This approach optimizes I/O performance while maintaining data locality.

Next Steps

After understanding the architecture:

  1. Installation Guide - Deploy the CDN Manager
  2. Configuration Guide - Configure components for your environment
  3. Operations Guide - Day-to-day operational procedures
  4. Performance Tuning Guide - Optimize system performance
  5. Metrics & Monitoring - Set up monitoring and alerting

5 - Installation Guide

Step-by-step installation and upgrade procedures

Overview

This guide provides detailed instructions for installing the AgileTV CDN Manager (ESB3027) in various deployment scenarios. The installation process varies depending on the target environment and desired configuration.

Estimated Installation Time:

Deployment TypeTime
Single-Node (Lab)~15 minutes
Multi-Node (3 servers)~30 minutes

Actual installation time may vary depending on hardware performance, network speed, and whether air-gapped procedures are required.

Note: These estimates assume the operating system is already installed on all nodes. OS installation is outside the scope of this guide.

Installation Types

Installation TypeDescriptionUse Case
Single-Node (Lab)Minimal installation on a single hostAcceptance testing, demonstrations, development
Multi-Node (Production)Full high-availability cluster with 3+ server nodesProduction deployments

Installation Process Summary

The installation follows a sequential process:

  1. Prepare the host system - Verify requirements and mount the installation ISO
  2. Install the Kubernetes cluster - Deploy K3s, Longhorn storage, and PostgreSQL
  3. Join additional nodes (production only) - Expand the cluster for HA or capacity
  4. Deploy the Manager application - Install the CDN Manager Helm chart
  5. Post-installation configuration - Configure authentication, networking, and users
GuideDescription
Installation ChecklistStep-by-step checklist to track progress
Single-Node InstallationLab and acceptance testing deployment
Multi-Node InstallationProduction high-availability deployment
Air-Gapped DeploymentAir-gapped environment installation
Upgrade GuideUpgrading from previous versions
Next StepsPost-installation configuration tasks

Prerequisites

Before beginning installation, ensure the following requirements are met:

  • Hardware: Nodes meeting the System Requirements including CPU, memory, and disk specifications
  • Operating System: RHEL 9 or compatible clone (details); air-gapped deployments require the OS ISO mounted on all nodes
  • Network: Proper firewall configuration between nodes (port requirements, firewall configuration)
  • Software: Installation ISO obtained from AgileTV; air-gapped deployments also require the Extras ISO
  • Kernel Tuning: For production deployments, apply recommended sysctl settings (Performance Tuning Guide)

We recommend using the Installation Checklist to track your progress through the installation process.

Getting Help

If you encounter issues during installation:

5.1 - Installation Checklist

Step-by-step checklist to track installation progress

Overview

Use this checklist to track your installation progress. Print this page or keep it open during your installation to ensure all steps are completed correctly.

Pre-Installation

Hardware and Software

  • Verify hardware meets System Requirements
  • Confirm operating system is supported (RHEL 9 or compatible clone)
  • Configure firewall rules between nodes (details)
  • Apply recommended sysctl settings (details)
  • Obtain installation ISO (esb3027-acd-manager-X.Y.Z.iso)

Air-Gapped Deployments

  • Obtain Extras ISO (esb3027-acd-manager-extras-X.Y.Z.iso)
  • Mount OS ISO on all nodes before installation
  • Verify OS packages are accessible from mounted ISO

Special Requirements

  • Oracle Linux UEK: Install kernel-uek-modules-extra-netfilter-$(uname -r) package
  • Control Plane Only nodes: Set SKIP_REQUIREMENTS_CHECK=1 if below lab minimums
  • SELinux: Set to “Enforcing” mode before running installer (cannot enable after)

Cluster Installation

Single-Node Deployment

Follow the Single-Node Installation Guide.

  • Mount installation ISO (Step 1)
  • Install the base cluster (Step 2)
  • Verify cluster status (Step 3)
  • Air-gapped only: Load container images (Step 4)
  • Create configuration file (Step 5)
  • Optional: Load MaxMind GeoIP databases (Step 6)
  • Deploy the Manager Helm chart (Step 7)
  • Verify deployment (Step 8)

Multi-Node Deployment

Follow the Multi-Node Installation Guide.

Primary Server Node

  • Mount installation ISO (Step 1)
  • Install the base cluster (Step 2)
  • Verify system pods are running (Step 2)
  • Retrieve the node token (Step 3)

Additional Server Nodes

  • Mount installation ISO (Step 5)
  • Join the cluster (Step 5)
  • Verify each node joins (Step 5)
  • Optional: Taint Control Plane Only nodes (Step 5b)

Agent Nodes (Optional)

  • Mount installation ISO (Step 6)
  • Join the cluster as an agent (Step 6)
  • Verify each agent joins (Step 6)

Cluster Verification

  • Verify all nodes are ready (Step 7)
  • Verify system pods running on all nodes (Step 7)
  • Air-gapped only: Load container images on each node (Step 9)

Application Deployment

  • Create configuration file (Step 10)
  • Optional: Load MaxMind GeoIP databases (Step 11)
  • Optional: Configure TLS certificates from trusted CA (Step 12)
  • Deploy the Manager Helm chart (Step 13)
  • Verify all pods are running and distributed (Step 14)
  • Configure DNS records for manager hostname (Step 15)

Post-Installation

Initial Access

  • Access the system via HTTPS
  • Accept self-signed certificate warning (if using default certificate)
  • Log in with default credentials (see Glossary)

Security Configuration

  • Create new administrator account in Zitadel
  • Delete or secure the default admin account
  • Configure additional users and permissions
  • Review Zitadel Administrator Documentation for role assignments

Monitoring and Operations

  • Access Grafana dashboards at /grafana
  • Review pre-built monitoring dashboards
  • Configure alerting rules (optional)
  • Set up notification channels (optional)

Next Steps

  • Review Next Steps Guide for additional configuration
  • Configure CDN routing rules
  • Set up GeoIP-based routing (if using MaxMind databases)
  • Review Operations Guide for day-to-day procedures

Troubleshooting

If you encounter issues during installation:

  1. Check pod status: kubectl describe pod <pod-name>
  2. Review logs: kubectl logs <pod-name>
  3. Check cluster events: kubectl get events --sort-by='.lastTimestamp'
  4. Review the Troubleshooting Guide for common issues

5.2 - Single-Node Installation

Lab and acceptance testing deployment

Warning: Single-node deployments are for lab environments, acceptance testing, and demonstrations only. This configuration is not suitable for production workloads. For production deployments, see the Multi-Node Installation Guide, which requires a minimum of 3 server nodes for high availability.

Air-Gapped Deployment? This guide assumes internet connectivity. For air-gapped deployments, see the Air-Gapped Deployment Guide for additional requirements and procedures.

Overview

This guide describes the installation of the AgileTV CDN Manager on a single node. This configuration is intended for lab environments, acceptance testing, and demonstrations only. It is not suitable for production workloads.

Prerequisites

Hardware Requirements

Refer to the System Requirements Guide for hardware specifications. Single-node deployments require the “Single-Node (Lab)” configuration.

Operating System

Refer to the System Requirements Guide for supported operating systems.

Software Access

  • Installation ISO: esb3027-acd-manager-X.Y.Z.iso
  • Extras ISO (air-gapped only): esb3027-acd-manager-extras-X.Y.Z.iso

Network Configuration

Ensure that required firewall ports are configured before installation. See the Networking Guide for complete firewall configuration requirements.

SELinux

If SELinux is to be used, it must be set to “Enforcing” mode before running the installer script. The installer will configure appropriate SELinux policies automatically. SELinux cannot be enabled after installation.

Installation Steps

Step 1: Mount the ISO

Create a mount point and mount the installation ISO:

mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

Replace X.Y.Z with the actual version number.

Step 2: Install the Base Cluster

Run the installer to set up the K3s Kubernetes cluster:

/mnt/esb3027/install

This installs:

  • K3s Kubernetes distribution
  • Longhorn distributed storage
  • Cloudnative PG operator for PostgreSQL
  • Base system dependencies

The installer will configure the node as both a server and agent node.

Step 3: Verify Cluster Status

After the installer completes, verify that all components are operational before proceeding. This verification serves as an important checkpoint to confirm the installation is progressing correctly.

1. Verify the node is ready:

kubectl get nodes

Expected output:

NAME         STATUS   ROLES                       AGE   VERSION
k3s-server   Ready    control-plane,etcd,master   2m    v1.33.4+k3s1

2. Verify system pods in both namespaces are running:

# Check kube-system namespace (Kubernetes core components)
kubectl get pods -n kube-system

# Check longhorn-system namespace (distributed storage)
kubectl get pods -n longhorn-system

All pods should show Running status. If any pods are still Pending or ContainerCreating, wait until they are ready. Proceeding with incomplete system pods can cause subsequent steps to fail in unpredictable ways.

This verification confirms:

  • K3s cluster is operational
  • Longhorn distributed storage is running
  • Cloudnative PG operator is deployed
  • All core components are healthy before continuing

Step 4: Air-Gapped Deployments (If Applicable)

If deploying in an air-gapped environment, load container images from the extras ISO:

mkdir -p /mnt/esb3027-extras
mount -o loop,ro esb3027-acd-manager-extras-X.Y.Z.iso /mnt/esb3027-extras
/mnt/esb3027-extras/load-images

Step 5: Create Configuration File

Create a Helm values file for your deployment. At minimum, configure the manager hostname and at least one router:

# ~/values.yaml
global:
  hosts:
    manager:
      - host: manager.local
    routers:
      - name: default
        address: 127.0.0.1

The routers configuration specifies CDN Director instances. For lab deployments, a placeholder entry is sufficient. For production, specify the actual Director hostnames or IP addresses.

For single-node deployments, you must also disable Kafka replication:

kafka:
  replicaCount: 1
  controller:
    replicaCount: 1

Step 6: Load MaxMind GeoIP Databases (Optional)

If you plan to use GeoIP-based routing or validation features, load the MaxMind GeoIP databases. The following databases are used by the manager:

  • GeoIP2-City.mmdb - The City Database
  • GeoLite2-ASN.mmdb - The ASN Database
  • GeoIP2-Anonymous-IP.mmdb - The VPN and Anonymous IP Database

A helper utility is provided on the ISO to create the Kubernetes volume:

/mnt/esb3027/generate-maxmind-volume

The utility will prompt for the locations of the three database files and the name of the volume. After running this command, reference the volume in your configuration file:

manager:
  maxmindDbVolume: maxmind-db-volume

Replace maxmind-db-volume with the volume name you specified when running the utility.

Tip: When naming the volume, include a revision number or date (e.g., maxmind-db-volume-2026-04 or maxmind-db-volume-v2). This simplifies future updates: create a new volume with an updated name, update the values.yaml to reference the new volume, and delete the old volume after verification.

Step 7: Deploy the Manager Helm Chart

Deploy the CDN Manager application:

helm install acd-manager /mnt/esb3027/charts/acd-manager --values ~/values.yaml

Monitor the deployment progress:

kubectl get pods

Wait for all pods to show Running status before proceeding.

Note: The default Helm timeout is 5 minutes. If the installation fails due to a rollout timeout, retry with a larger timeout value:

helm install acd-manager /mnt/esb3027/charts/acd-manager --values ~/values.yaml --timeout 10m

If a previous installation attempt failed and you receive an error that the release name is already in use, uninstall the previous release before retrying:

helm uninstall acd-manager
helm install acd-manager /mnt/esb3027/charts/acd-manager --values ~/values.yaml

Step 8: Verify Deployment

Verify all application pods are running:

kubectl get pods

Expected output for a single-node deployment (pod names will vary):

NAME                                              READY   STATUS      RESTARTS   AGE
acd-manager-5b98d569d9-abc12                      1/1     Running     0          3m
acd-manager-confd-6fb78548c4-xnrh4                1/1     Running     0          3m
acd-manager-gateway-8bc8446fc-chs26               1/1     Running     0          3m
acd-manager-kafka-controller-0                    2/2     Running     0          3m
acd-manager-metrics-aggregator-76d96c4964-lwdcj   1/1     Running     0          3m
acd-manager-mib-frontend-7bdb69684b-6qxn8         1/1     Running     0          3m
acd-manager-postgresql-0                          1/1     Running     0          3m
acd-manager-redis-master-0                        2/2     Running     0          3m
acd-manager-redis-replicas-0                      2/2     Running     0          3m
acd-manager-selection-input-5fb694b857-qxt67      1/1     Running     0          3m
acd-manager-zitadel-8448b4c4fc-2pkd8              1/1     Running     0          3m
acd-manager-zitadel-init-hh6j7                    0/1     Completed   0          4m
acd-manager-zitadel-setup-nwp8k                   0/2     Completed   0          4m
alertmanager-0                                    1/1     Running     0          3m
grafana-6d948cfdc6-77ggk                          1/1     Running     0          3m
victoria-metrics-agent-dc87df588-tn8wv            1/1     Running     0          3m
victoria-metrics-alert-757c44c58f-kk9lp           1/1     Running     0          3m
victoria-metrics-longterm-server-0                1/1     Running     0          3m
victoria-metrics-server-0                         1/1     Running     0          3m

Note: Init pods (such as zitadel-init and zitadel-setup) will show Completed status after successful initialization. This is expected behavior.

Post-Installation

After installation completes, proceed to the Next Steps guide for:

  • Initial user configuration
  • Accessing the web interfaces
  • Configuring authentication
  • Setting up monitoring

Accessing the System

Refer to the Accessing the System section in the Getting Started guide for service URLs and default credentials.

Note: A self-signed SSL certificate is deployed by default. You will need to accept the certificate warning in your browser.

Troubleshooting

If pods fail to start:

  1. Check pod status: kubectl describe pod <pod-name>
  2. Review logs: kubectl logs <pod-name>
  3. Verify resources: kubectl top pods

See the Troubleshooting Guide for additional assistance.

Next Steps

After successful installation:

  1. Next Steps Guide - Post-installation configuration
  2. Configuration Guide - System configuration
  3. Operations Guide - Day-to-day operations

Appendix: Example Configuration

The following values.yaml provides a minimal working configuration for lab deployments:

# Minimal lab configuration for single-node deployment
global:
  hosts:
    manager:
      - host: manager.local
    routers:
      - name: default
        address: 127.0.0.1

# Single-node: Disable Kafka replication
kafka:
  replicaCount: 1
  controller:
    replicaCount: 1

Customization notes:

  • Replace manager.local with your desired hostname
  • The routers entry specifies CDN Director instances. The placeholder 127.0.0.1 may be used if a Director instance isn’t available, or specify actual Director hostnames for production testing
  • For air-gapped deployments, see Step 4: Air-Gapped Deployments

5.3 - Multi-Node Installation

Production high-availability deployment

Overview

This guide describes the installation of the AgileTV CDN Manager across multiple nodes for production deployments. This configuration provides high availability and horizontal scaling capabilities.

Air-Gapped Deployment? This guide assumes internet connectivity. For air-gapped deployments, see the Air-Gapped Deployment Guide for additional requirements and procedures.

Prerequisites

Hardware Requirements

Refer to the System Requirements Guide for hardware specifications. Production deployments require:

  • Minimum 3 Server nodes (Control Plane Only or Combined role)
  • Optional Agent nodes for additional workload capacity

Operating System

Refer to the System Requirements Guide for supported operating systems.

Software Access

  • Installation ISO: esb3027-acd-manager-X.Y.Z.iso (for each node)
  • Extras ISO (air-gapped only): esb3027-acd-manager-extras-X.Y.Z.iso

Network Configuration

Ensure that required firewall ports are configured between all nodes before installation. See the Networking Guide for complete firewall configuration requirements.

Multiple Network Interfaces

If your nodes have multiple network interfaces and you want to use a separate interface for cluster traffic (not the default route interface), configure the INSTALL_K3S_EXEC environment variable before installing the cluster or joining nodes.

For example, if bond0 has the default route but you want cluster traffic on bond1:

# For server nodes
export INSTALL_K3S_EXEC="server --node-ip 10.0.0.10 --flannel-iface=bond1"

# For agent nodes
export INSTALL_K3S_EXEC="agent --node-ip 10.0.0.20 --flannel-iface=bond1"

Where:

  • Mode: Use server for the primary node establishing the cluster, or for additional server nodes. Use agent for agent nodes joining the cluster.
  • --node-ip: The IP address of the interface to use for cluster traffic
  • --flannel-iface: The network interface name for Flannel VXLAN overlay traffic

Set this variable on each node before running the install or join scripts.

SELinux

If SELinux is to be used, it must be set to “Enforcing” mode before running the installer script. The installer will configure appropriate SELinux policies automatically. SELinux cannot be enabled after installation.

Installation Steps

Step 1: Prepare the Primary Server Node

Mount the installation ISO on the primary server node:

mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

Replace X.Y.Z with the actual version number.

Step 2: Install the Base Cluster on Primary Server

If your node has multiple network interfaces and you need to specify a separate interface for cluster traffic, set the INSTALL_K3S_EXEC environment variable before running the installer (see Multiple Network Interfaces):

export INSTALL_K3S_EXEC="server --node-ip <node-ip> --flannel-iface=<interface>"

Run the installer to set up the K3s Kubernetes cluster:

/mnt/esb3027/install

This installs:

  • K3s Kubernetes distribution
  • Longhorn distributed storage
  • Cloudnative PG operator for PostgreSQL
  • Base system dependencies

Important: After the installer completes, verify that all system pods in both namespaces are in the Running state before proceeding:

# Check kube-system namespace (Kubernetes core components)
kubectl get pods -n kube-system

# Check longhorn-system namespace (distributed storage)
kubectl get pods -n longhorn-system

All pods should show Running status. If any pods are still Pending or ContainerCreating, wait until they are ready. Proceeding with incomplete system pods can cause subsequent steps to fail in unpredictable ways.

This verification confirms:

  • K3s cluster is operational
  • Longhorn distributed storage is running
  • Cloudnative PG operator is deployed
  • All core components are healthy before continuing

Step 3: Retrieve the Node Token

Retrieve the node token for joining additional nodes:

cat /var/lib/rancher/k3s/server/node-token

Save this token for use on additional nodes. Also note the IP address of the primary server node.

Step 4: Server vs Agent Node Roles

Before joining additional nodes, determine which nodes will serve as Server nodes vs Agent nodes:

RoleControl PlaneWorkloadsHA QuorumUse Case
Server Node (Combined)Yes (etcd, API server)YesParticipatesDefault production role; minimum 3 nodes
Server Node (Control Plane Only)Yes (etcd, API server)NoParticipatesDedicated control plane; requires separate Agent nodes
Agent NodeNoYesNoAdditional workload capacity only

Guidance:

  • Combined role (default): Server nodes run both control plane and workloads; minimum 3 nodes required for HA
  • Control Plane Only: Dedicate nodes to control plane functions; requires at least 3 Server nodes plus 3+ Agent nodes for workloads
  • Agent nodes are required if using Control Plane Only servers; optional if using Combined role servers
  • For most deployments, 3 Server nodes (Combined role) with no Agent nodes is sufficient
  • Add Agent nodes to scale workload capacity without affecting control plane quorum

Proceed to Step 5 to join Server nodes. Agent nodes are joined after all Server nodes are ready.

Step 5: Join Additional Server Nodes

On each additional server node:

  1. Mount the ISO:

    mkdir -p /mnt/esb3027
    mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027
    
  2. Join the cluster:

If your node has multiple network interfaces, set the INSTALL_K3S_EXEC environment variable with the server mode before running the join script (see Multiple Network Interfaces):

export INSTALL_K3S_EXEC="server --node-ip <node-ip> --flannel-iface=<interface>"

Run the join script:

/mnt/esb3027/join-server https://<primary-server-ip>:6443 <node-token>

Replace <primary-server-ip> with the IP address of the primary server and <node-token> with the token retrieved in Step 3.

  1. Verify the node joined successfully:
    kubectl get nodes
    

Repeat for each server node. A minimum of 3 server nodes is required for high availability.

Step 5b: Taint Control Plane Only Nodes (Optional)

If you are using dedicated Control Plane Only nodes (not Combined role), apply taints to prevent workload scheduling:

kubectl taint nodes <node-name> CriticalAddonsOnly=true:NoSchedule

Apply this taint to each Control Plane Only node. Verify taints are applied:

kubectl describe nodes | grep -A 5 "Taints"

Note: This step is only required if you want dedicated control plane nodes. For Combined role deployments, do not apply taints.

Important: Control Plane Only Server nodes can be deployed with lower hardware specifications (2 cores, 4 GiB, 64 GiB) than the installer’s default minimum requirements. If your Control Plane Only Server nodes do not meet the Single-Node Lab configuration minimums (8 cores, 16 GiB, 128 GiB), you must set the SKIP_REQUIREMENTS_CHECK environment variable before running the installer or join command:

# For the primary server node
export SKIP_REQUIREMENTS_CHECK=1
/mnt/esb3027/install

# For additional Control Plane Only Server nodes
export SKIP_REQUIREMENTS_CHECK=1
/mnt/esb3027/join-server https://<primary-server-ip>:6443 <node-token>

Note: This applies to Server nodes only. Agent nodes have separate minimum requirements.

Step 6: Join Agent Nodes (Optional)

On each agent node:

  1. Mount the ISO:

    mkdir -p /mnt/esb3027
    mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027
    
  2. Join the cluster as an agent:

If your node has multiple network interfaces, set the INSTALL_K3S_EXEC environment variable with the agent mode before running the join script (see Multiple Network Interfaces):

export INSTALL_K3S_EXEC="agent --node-ip <node-ip> --flannel-iface=<interface>"

Run the join script:

/mnt/esb3027/join-agent https://<primary-server-ip>:6443 <node-token>
  1. Verify the node joined successfully from an existing server node:
    kubectl get nodes
    

Agent nodes provide additional workload capacity but do not participate in the control plane quorum.

Step 7: Verify Cluster Status

After all nodes are joined, verify the cluster is operational:

1. Verify all nodes are ready:

kubectl get nodes

Expected output:

NAME                 STATUS   ROLES                       AGE   VERSION
k3s-server-0         Ready    control-plane,etcd,master   5m    v1.33.4+k3s1
k3s-server-1         Ready    control-plane,etcd,master   3m    v1.33.4+k3s1
k3s-server-2         Ready    control-plane,etcd,master   2m    v1.33.4+k3s1
k3s-agent-1          Ready    <none>                      1m    v1.33.4+k3s1
k3s-agent-2          Ready    <none>                      1m    v1.33.4+k3s1

2. Verify system pods in both namespaces are running:

# Check kube-system namespace (Kubernetes core components)
kubectl get pods -n kube-system

# Check longhorn-system namespace (distributed storage)
kubectl get pods -n longhorn-system

All pods should show Running status. If any pods are still Pending or ContainerCreating, wait until they are ready.

This verification confirms:

  • K3s cluster is operational across all nodes
  • Longhorn distributed storage is running
  • Cloudnative PG operator is deployed
  • All core components are healthy before proceeding to application deployment

Step 9: Air-Gapped Deployments (If Applicable)

If deploying in an air-gapped environment, on each node:

mkdir -p /mnt/esb3027-extras
mount -o loop,ro esb3027-acd-manager-extras-X.Y.Z.iso /mnt/esb3027-extras
/mnt/esb3027-extras/load-images

Step 10: Create Configuration File

Create a Helm values file for your deployment. At minimum, configure the manager hostnames, Zitadel external domain, and at least one router:

# ~/values.yaml
global:
  hosts:
    manager:
      - host: manager.example.com
      - host: manager-backup.example.com
    routers:
      - name: director-1
        address: 192.0.2.1
      - name: director-2
        address: 192.0.2.2

zitadel:
  zitadel:
    ExternalDomain: manager.example.com

Tip: A complete default values.yaml file is available on the installation ISO at /mnt/esb3027/values.yaml. Copy this file to use as a starting point for your configuration.

Important: The zitadel.zitadel.ExternalDomain must match the first entry in global.hosts.manager or authentication will fail due to CORS policy violations.

Important: For multi-node deployments, Kafka replication is enabled by default with 3 replicas. Do not modify the kafka.replicaCount or kafka.controller.replicaCount settings unless you understand the implications for data durability.

Step 11: Load MaxMind GeoIP Databases (Optional)

If you plan to use GeoIP-based routing or validation features, load the MaxMind GeoIP databases. The following databases are used by the manager:

  • GeoIP2-City.mmdb - The City Database
  • GeoLite2-ASN.mmdb - The ASN Database
  • GeoIP2-Anonymous-IP.mmdb - The VPN and Anonymous IP Database

A helper utility is provided on the ISO to create the Kubernetes volume:

/mnt/esb3027/generate-maxmind-volume

The utility will prompt for the locations of the three database files and the name of the volume. After running this command, reference the volume in your configuration file:

manager:
  maxmindDbVolume: maxmind-db-volume

Replace maxmind-db-volume with the volume name you specified when running the utility.

Tip: When naming the volume, include a revision number or date (e.g., maxmind-db-volume-2026-04 or maxmind-db-volume-v2). This simplifies future updates: create a new volume with an updated name, update the values.yaml to reference the new volume, and delete the old volume after verification.

Step 12: Configure TLS Certificates (Optional)

For production deployments, configure a valid TLS certificate from a trusted Certificate Authority (CA). A self-signed certificate is deployed by default if no certificate is provided.

Method 1: Create TLS Secret Manually

Create a Kubernetes TLS secret with your certificate and key:

kubectl create secret tls acd-manager-tls --cert=tls.crt --key=tls.key

Method 2: Helm-Managed Secret

Add the certificate directly to your values.yaml:

ingress:
  secrets:
    acd-manager-tls: |
      -----BEGIN CERTIFICATE-----
      ...
      -----END CERTIFICATE-----
  tls:
    - hosts:
        - manager.example.com
      secretName: acd-manager-tls

Configuring All Ingress Controllers

All ingress controllers must be configured with the same certificate secret and hostname:

ingress:
  hostname: manager.example.com
  tls: true
  secretName: acd-manager-tls

zitadel:
  ingress:
    tls:
      - hosts:
          - manager.example.com
        secretName: acd-manager-tls

confd:
  ingress:
    hostname: manager.example.com
    tls: true
    secretName: acd-manager-tls

mib-frontend:
  ingress:
    hostname: manager.example.com
    tls: true
    secretName: acd-manager-tls

Important: The hostname must match the first entry in global.hosts.manager for Zitadel CORS compatibility. The secret name has a maximum length of 53 characters.

Step 13: Deploy the Manager Helm Chart

Deploy the CDN Manager application:

helm install acd-manager /mnt/esb3027/charts/acd-manager --values ~/values.yaml

Note: By default, helm install runs silently until completion. To see real-time output during deployment, add the --debug flag:

helm install acd-manager /mnt/esb3027/charts/acd-manager --values ~/values.yaml --debug

Tip: For better organization, split your configuration into multiple files and specify them with repeated --values flags:

helm install acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values-base.yaml \
  --values ~/values-tls.yaml \
  --values ~/values-autoscaling.yaml

Later files override earlier files, allowing you to maintain a base configuration with environment-specific overrides.

Monitor the deployment progress:

kubectl get pods

Wait for all pods to show Running status before proceeding.

Note: The default Helm timeout is 5 minutes. If the installation fails due to a rollout timeout, retry with a larger timeout value:

helm install acd-manager /mnt/esb3027/charts/acd-manager --values ~/values.yaml --timeout 10m

If a previous installation attempt failed and you receive an error that the release name is already in use, uninstall the previous release before retrying:

helm uninstall acd-manager
helm install acd-manager /mnt/esb3027/charts/acd-manager --values ~/values.yaml

Step 14: Verify Deployment

Verify all application pods are running:

kubectl get pods

Note: During the initial deployment, several pods may enter a CrashLoopBackoff state depending on the timing of other containers starting up. This is expected behavior as some services wait for dependencies (such as databases or Kafka) to become available. The deployment should stabilize automatically after a few minutes.

Verify pods are distributed across nodes:

kubectl get pods -o wide

Expected output for a 3-node cluster (pod names will vary):

NAME                                              READY   STATUS      RESTARTS        AGE
acd-cluster-postgresql-1                          1/1     Running     0               11m
acd-cluster-postgresql-2                          1/1     Running     0               11m
acd-cluster-postgresql-3                          1/1     Running     0               10m
acd-manager-5b98d569d9-2pbph                      1/1     Running     0               3m
acd-manager-5b98d569d9-m54f9                      1/1     Running     0               3m
acd-manager-5b98d569d9-pq26f                      1/1     Running     0               3m
acd-manager-confd-6fb78548c4-xnrh4                1/1     Running     0               3m
acd-manager-gateway-8bc8446fc-chs26               1/1     Running     0               3m
acd-manager-gateway-8bc8446fc-wzrml               1/1     Running     0               3m
acd-manager-kafka-controller-0                    2/2     Running     0               3m
acd-manager-kafka-controller-1                    2/2     Running     0               3m
acd-manager-kafka-controller-2                    2/2     Running     0               3m
acd-manager-metrics-aggregator-76d96c4964-lwdcj   1/1     Running     2               3m
acd-manager-mib-frontend-7bdb69684b-6qxn8         1/1     Running     0               3m
acd-manager-mib-frontend-7bdb69684b-pkjrw         1/1     Running     0               3m
acd-manager-redis-master-0                        2/2     Running     0               3m
acd-manager-redis-replicas-0                      2/2     Running     0               3m
acd-manager-selection-input-5fb694b857-qxt67      1/1     Running     2               3m
acd-manager-zitadel-8448b4c4fc-2pkd8              1/1     Running     0               3m
acd-manager-zitadel-8448b4c4fc-vchp9              1/1     Running     0               3m
acd-manager-zitadel-init-hh6j7                    0/1     Completed   0               4m
acd-manager-zitadel-setup-nwp8k                   0/2     Completed   0               4m
alertmanager-0                                    1/1     Running     0               3m
grafana-6d948cfdc6-77ggk                          1/1     Running     0               3m
telegraf-54779f5f46-2jfj5                         1/1     Running     0               3m
victoria-metrics-agent-dc87df588-tn8wv            1/1     Running     0               3m
victoria-metrics-alert-757c44c58f-kk9lp           1/1     Running     0               3m
victoria-metrics-longterm-server-0                1/1     Running     0               3m
victoria-metrics-server-0                         1/1     Running     0               3m

Note: Init pods (such as zitadel-init and zitadel-setup) will show Completed status after successful initialization. This is expected behavior. Some pods may show restart counts as they wait for dependencies to become available.

Step 15: Configure DNS (Optional)

Add DNS records for the manager hostname. For high availability, configure multiple A records pointing to different server nodes:

manager.example.com.  IN  A  <server-1-ip>
manager.example.com.  IN  A  <server-2-ip>
manager.example.com.  IN  A  <server-3-ip>

Alternatively, configure a load balancer to distribute traffic across nodes.

Post-Installation

After installation completes, proceed to the Next Steps guide for:

  • Initial user configuration
  • Accessing the web interfaces
  • Configuring authentication
  • Setting up monitoring

Accessing the System

Refer to the Accessing the System section in the Getting Started guide for service URLs and default credentials.

Note: A self-signed SSL certificate is deployed by default. For production deployments, configure a valid SSL certificate before exposing the system to users.

High Availability Considerations

Pod Distribution

The Helm chart configures pod anti-affinity rules to ensure:

  • Kafka controllers are scheduled on separate nodes
  • PostgreSQL cluster members are distributed across nodes
  • Application pods are spread across available nodes

Data Replication and Failure Tolerance

For detailed information on data replication strategies and failure scenario tolerance, refer to the Architecture Guide and System Requirements Guide.

Troubleshooting

If pods fail to start or nodes fail to join:

  1. Check node status: kubectl get nodes
  2. Describe problematic pods: kubectl describe pod <pod-name>
  3. Review logs: kubectl logs <pod-name>
  4. Check cluster events: kubectl get events --sort-by='.lastTimestamp'

See the Troubleshooting Guide for additional assistance.

Next Steps

After successful installation:

  1. Next Steps Guide - Post-installation configuration
  2. Configuration Guide - System configuration
  3. Operations Guide - Day-to-day operations

5.4 - Air-Gapped Deployment Guide

Installation procedures for air-gapped environments

Overview

This guide describes the installation of the AgileTV CDN Manager in air-gapped environments (no internet access). Air-gapped deployments require additional preparation compared to connected deployments.

Key differences from connected deployments:

  • Both Installation ISO and Extras ISO are required
  • OS installation ISO must be mounted on all nodes
  • Container images must be loaded from the Extras ISO on each node
  • Additional firewall considerations for OS package repositories

Prerequisites

Required ISOs

Before beginning installation, obtain the following:

ISOFilenamePurpose
Installation ISOesb3027-acd-manager-X.Y.Z.isoKubernetes cluster and Manager application
Extras ISOesb3027-acd-manager-extras-X.Y.Z.isoContainer images for air-gapped environments
OS Installation ISORHEL 9 or compatible cloneOperating system packages (required on all nodes)

Single-Node vs Multi-Node

Air-gapped procedures apply to both deployment types:

Network Configuration

Air-gapped environments may have internal network mirrors for OS packages. If no internal mirror exists, the OS installation ISO must be mounted on each node to provide packages during installation.

Air-Gapped Installation Steps

Step 1: Prepare All Nodes

On each node (primary server, additional servers, and agents):

  1. Mount the OS installation ISO:

    mkdir -p /mnt/os
    mount -o loop,ro /path/to/rhel-9.iso /mnt/os
    
  2. Configure local repository (if no internal mirror):

    cat > /etc/yum.repos.d/local.repo <<EOF
    [local]
    name=Local OS Repository
    baseurl=file:///mnt/os/BaseOS
    enabled=1
    gpgcheck=0
    EOF
    
  3. Verify repository is accessible:

    dnf repolist
    

Step 2: Mount Installation ISOs

On the primary server node first, then each additional node:

# Mount Installation ISO
mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

# Mount Extras ISO
mkdir -p /mnt/esb3027-extras
mount -o loop,ro esb3027-acd-manager-extras-X.Y.Z.iso /mnt/esb3027-extras

Step 3: Install Kubernetes Cluster

Primary Server Node

/mnt/esb3027/install

Wait for the installer to complete and verify system pods are running:

kubectl get nodes
kubectl get pods -n kube-system
kubectl get pods -n longhorn-system

Additional Server Nodes (Multi-Node Only)

On each additional server node:

/mnt/esb3027/join-server https://<primary-server-ip>:6443 <node-token>

Agent Nodes (Optional)

On each agent node:

/mnt/esb3027/join-agent https://<primary-server-ip>:6443 <node-token>

Step 4: Load Container Images

On each node in the cluster:

/mnt/esb3027-extras/load-images

This script loads all container images from the Extras ISO into the local container runtime.

Important: This step must be performed on every node (primary server, additional servers, and agents) before deploying the Manager application.

Step 5: Create Configuration File

Create a Helm values file for your deployment. At minimum, configure the manager hostname and router addresses:

# ~/values.yaml
global:
  hosts:
    manager:
      - host: manager.local
    routers:
      - name: default
        address: 127.0.0.1

# Single-node: Disable Kafka replication
kafka:
  replicaCount: 1
  controller:
    replicaCount: 1

For multi-node deployments, see the Multi-Node Installation Guide for complete configuration requirements.

Step 6: Deploy the Manager

Deploy the CDN Manager Helm chart:

helm install acd-manager /mnt/esb3027/charts/acd-manager --values ~/values.yaml

Monitor the deployment progress:

kubectl get pods --watch

Wait for all pods to show Running status before proceeding.

Step 7: Verify Deployment

Verify all application pods are running:

kubectl get pods

All pods should show Running status (except init pods which show Completed).

Post-Installation

After installation completes:

  1. Access the system via HTTPS at https://<manager-host>
  2. Configure authentication via Zitadel at https://<manager-host>/ui/console
  3. Set up monitoring via Grafana at https://<manager-host>/grafana

See the Next Steps Guide for detailed post-installation configuration.

Updating MaxMind GeoIP Databases

If using GeoIP-based routing, load the MaxMind databases:

/mnt/esb3027/generate-maxmind-volume

The utility will prompt for the database file locations and volume name. Reference the volume in your values.yaml:

manager:
  maxmindDbVolume: maxmind-geoip-2026-04

See the Operations Guide for database update procedures.

Troubleshooting

Image Pull Errors

If pods fail with image pull errors:

  1. Verify the load-images script completed successfully on all nodes
  2. Check container runtime image list:
    crictl images | grep <image-name>
    
  3. Ensure image tags in Helm chart match tags on the Extras ISO

OS Package Errors

If the installer reports missing OS packages:

  1. Verify OS ISO is mounted on the affected node
  2. Check repository configuration:
    dnf repolist
    dnf info <package-name>
    
  3. Ensure the ISO matches the installed OS version

Longhorn Volume Issues

If Longhorn volumes fail to mount:

  1. Verify all nodes have the load-images script completed
  2. Check Longhorn system pods:
    kubectl get pods -n longhorn-system
    
  3. Review Longhorn UI via port-forward:
    kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80
    

Next Steps

After successful installation:

  1. Next Steps Guide - Post-installation configuration
  2. Operations Guide - Day-to-day operational procedures
  3. Troubleshooting Guide - Common issues and resolution

5.5 - Upgrade Guide

Upgrading the CDN Manager to a newer version

Overview

This guide describes the procedure for upgrading the AgileTV CDN Manager (ESB3027) to a newer version. The upgrade process involves updating the Kubernetes cluster components and redeploying the Helm chart with the new version.

Prerequisites

Backup Requirements

Before beginning any upgrade, ensure you have:

  • PostgreSQL Backup: Verify recent backups are available via the Cloudnative PG operator
  • Configuration Backup: Save your current values.yaml file(s)
  • TLS Certificates: Ensure certificate files are backed up
  • MaxMind Volumes: Note the current volume names if using GeoIP databases

Version Compatibility

Review the Release Notes for the target version to check for:

  • Breaking changes requiring manual intervention
  • Required intermediate upgrade steps
  • New configuration options that should be set

Cluster Health

Verify the cluster is healthy before upgrading:

kubectl get nodes
kubectl get pods
kubectl get pvc

All nodes should show Ready status and all pods should be Running (or Completed for job pods).

Upgrade Methods

There are three upgrade methods available. Choose the one that best fits your situation:

MethodDowntimeUse Case
Rolling UpgradeMinimalPatch releases; minor version upgrades; configuration updates
Clean UpgradeBriefMajor version upgrades; component changes; troubleshooting
Full ReinstallExtendedCluster rebuilds; troubleshooting persistent issues; ensuring clean state

Method Selection Guidance:

  • Rolling Upgrade (Method 1) is the default choice for most upgrades. Use this for patch releases (e.g., 1.6.0 → 1.6.1) and even minor version upgrades (e.g., 1.4.0 → 1.6.0) where no breaking changes are documented. This method preserves all existing resources and performs an in-place update. Note: This method supports Helm’s automatic rollback (helm rollback) if the upgrade fails, allowing quick recovery to the previous state.

  • Clean Upgrade (Method 2) is recommended for major version upgrades (e.g., 1.x → 2.x) or when the release notes indicate significant component changes. This method ensures all resources are recreated with the new version, avoiding potential issues with stale configurations. Also use this method when troubleshooting upgrade failures from Method 1.

  • Full Reinstall (Method 3) should only be used when a completely clean cluster state is required. This includes troubleshooting persistent cluster-level issues, recovering from failed upgrades that cannot be rolled back, or when migrating between significantly different deployment configurations. This method requires verified backups and should be planned for extended downtime.

Upgrade Steps

This method performs an in-place rolling upgrade with minimal downtime. All upgrade commands are executed from the primary server node.

Step 1: Obtain the New Installation ISO

Unmount the old ISO (if mounted) and mount the new installation ISO:

umount /mnt/esb3027 2>/dev/null || true
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

Replace X.Y.Z with the target version number.

Step 2: Update Containers and Cluster Software

Run the installation script to update the container images and cluster software:

/mnt/esb3027/install

Wait for the script to complete.

Step 2b: Air-Gapped Environments (If Applicable)

If deploying in an air-gapped environment, also mount and load the extras ISO:

# Mount the Extras ISO
mkdir -p /mnt/esb3027-extras
mount -o loop,ro esb3027-acd-manager-extras-X.Y.Z.iso /mnt/esb3027-extras

# Load container images from the extras ISO
/mnt/esb3027-extras/load-images

Replace X.Y.Z with the target version number.

Step 4: Review and Update Configuration

Compare the default values.yaml from the new ISO with your current configuration:

diff /mnt/esb3027/values.yaml ~/values.yaml

Update your configuration file to include any new required settings. Common updates include:

# ~/values.yaml
global:
  hosts:
    manager:
      - host: manager.example.com
    routers:
      - name: director-1
        address: 192.0.2.1

zitadel:
  zitadel:
    ExternalDomain: manager.example.com

# Add any new required settings for the target version

Important: Do not modify settings unrelated to the upgrade unless specifically documented in the release notes.

Step 5: Update MaxMind GeoIP Volumes (If Applicable)

If you use MaxMind GeoIP databases, use the utility from the new ISO to create an updated volume:

/mnt/esb3027/generate-maxmind-volume

Update your values.yaml to reference the new volume name:

manager:
  maxmindDbVolume: maxmind-geoip-2026-04

Tip: Using dated or versioned volume names (e.g., maxmind-geoip-2026-04) allows you to create new volumes during upgrades and delete old ones after verification.

Step 6: Update TLS Certificates (If Needed)

If your TLS certificates need renewal or the new version requires certificate updates, create or update the secret:

kubectl create secret tls acd-manager-tls --cert=tls.crt --key=tls.key --dry-run=client -o yaml | kubectl apply -f -

Step 7: Upgrade the Helm Release

Perform a Helm upgrade with the new chart:

helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml

Note: The upgrade performs a rolling update of each deployment in the chart. Deployments are upgraded one at a time, with pods being terminated and recreated sequentially. StatefulSets (PostgreSQL, Kafka, Redis) roll out one pod at a time to maintain data availability.

Monitor the upgrade progress:

kubectl get pods --watch

Wait for all pods to stabilize and show Running status before considering the upgrade complete. Some pods may temporarily enter CrashLoopBackoff during the transition as they wait for dependencies to become available.

Step 8: Verify the Upgrade

Check the deployed version:

helm list
kubectl get deployments -o wide

Verify application functionality:

  • Access the MIB Frontend and confirm it loads
  • Test API connectivity
  • Verify Grafana dashboards are accessible
  • Check that Zitadel authentication is working

Step 9: Clean Up

After confirming the upgrade is successful:

  1. Unmount the old ISO (if still mounted):

    umount /mnt/esb3027
    
  2. Delete old MaxMind volumes (if replaced):

    kubectl get pvc
    kubectl delete pvc <old-volume-name>
    
  3. Remove old configuration files if no longer needed.


Method 2: Clean Upgrade (Helm Uninstall/Install)

This method removes the existing Helm release before installing the new version. This is useful for major version upgrades or when troubleshooting upgrade issues. All upgrade commands are executed from the primary server node.

Warning: This method causes brief downtime as all resources are deleted before reinstallation.

Step 1: Obtain the New Installation ISO

Mount the new installation ISO:

umount /mnt/esb3027 2>/dev/null || true
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

Step 2: Backup Configuration

Save your current Helm values:

helm get values acd-manager -o yaml > ~/values-backup.yaml

Step 3: Uninstall the Existing Release

Remove the existing Helm release:

helm uninstall acd-manager

Wait for pods to terminate:

kubectl get pods --watch

Note: Helm uninstall does not remove PersistentVolumes (PVs) or PersistentVolumeClaims (PVCs). All data stored in PostgreSQL, Kafka, Redis, and Longhorn volumes is preserved during the uninstall process. When the new version is installed, it will reattach to the existing PVCs and restore data automatically.

Step 4: Review and Update Configuration

Compare the default values.yaml from the new ISO with your configuration:

diff /mnt/esb3027/values.yaml ~/values.yaml

Update your configuration file as needed.

Step 5: Install the New Release

Install the new version:

helm install acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml

Monitor the deployment:

kubectl get pods --watch

Wait for all pods to stabilize before proceeding.

Step 6: Verify the Upgrade

Verify the upgrade as described in Method 1, Step 8.

Method 3: Full Reinstall (Cluster Rebuild)

This method completely removes Kubernetes and reinstalls from scratch. Use only for cluster rebuilds or when other upgrade methods fail.

Warning: This method causes extended downtime and permanent data loss. The K3s uninstall process destroys all Longhorn PersistentVolumes (PVs) and PersistentVolumeClaims (PVCs). All data stored in PostgreSQL, Kafka, Redis, and application volumes will be permanently lost. Verified backups are required before proceeding.

Warning: This method should only be used when necessary. Ensure you have verified backups before proceeding.

Step 1: Stop Kubernetes Services

On all nodes (server and agent), stop the K3s service:

systemctl stop k3s

Step 2: Uninstall K3s (Server Nodes Only)

On the primary server node first, then each additional server node:

/usr/local/bin/k3s-uninstall.sh

Step 3: Clean Up Residual State (All Nodes)

On all nodes, remove residual state:

/usr/local/bin/k3s-kill-all.sh
rm -rf /var/lib/rancher/k3s/*

Warning: This removes all cluster data including Longhorn PersistentVolumes (PVs) and PersistentVolumeClaims (PVCs). All data stored in PostgreSQL, Kafka, Redis, and application volumes will be permanently lost. Ensure verified backups are available before proceeding.

Step 4: Reinstall K3s Cluster and Deploy Manager

Follow the installation procedure in the Installation Guide to reinstall the cluster and deploy the Helm chart. At this point, you are in the same state as a fresh installation:

  • Primary server installation
  • Additional server joins (if applicable)
  • Agent joins (if applicable)
  • Helm chart deployment

Note: The K3s node token is regenerated during reinstallation. Retrieve the new token from /var/lib/rancher/k3s/server/node-token on the primary server after installation if you need to join additional nodes.


Rollback Procedure

Rollback procedures vary by upgrade method:

Method 1 (Rolling Upgrade)

Use Helm’s built-in rollback command:

helm rollback acd-manager

This reverts to the previous Helm release revision automatically.

Or manually redeploy the previous version:

helm upgrade acd-manager /mnt/esb3027-old/helm/charts/acd-manager \
  --values ~/values.yaml

Note: If you use multiple --values files for organization, ensure they are specified in the same order as the original installation.

Method 2 (Clean Upgrade)

Reinstall the previous version:

helm uninstall acd-manager
helm install acd-manager /mnt/esb3027-old/helm/charts/acd-manager \
  --values ~/values-backup.yaml

Method 3 (Full Reinstall)

Rollback requires repeating the full cluster reinstall procedure using the old installation ISO. Follow Method 3 steps with the previous version’s ISO. Ensure verified backups are available before attempting.

Troubleshooting

Pods Fail to Start

  1. Check pod status and events:

    kubectl describe pod <pod-name>
    kubectl get events --sort-by='.lastTimestamp'
    
  2. Review pod logs:

    kubectl logs <pod-name>
    kubectl logs <pod-name> -p  # Previous instance logs
    

Database Migration Issues

If PostgreSQL migrations fail:

  1. Check Cloudnative PG cluster status:

    kubectl get clusters
    kubectl describe cluster <cluster-name>
    
  2. Review migration job logs:

    kubectl get jobs
    kubectl logs job/<migration-job-name>
    

Helm Upgrade Fails

If helm upgrade fails:

  1. Check Helm release status:

    helm status acd-manager
    helm history acd-manager
    
  2. Review the error message for specific failures

  3. Attempt rollback if necessary

Post-Upgrade

After a successful upgrade:

  1. Review the Release Notes for any post-upgrade tasks
  2. Update monitoring dashboards if new metrics are available
  3. Test all critical functionality
  4. Document the upgrade in your change management system

Next Steps

After completing the upgrade:

  1. Next Steps Guide - Review post-installation tasks
  2. Operations Guide - Day-to-day operational procedures
  3. Release Notes - Review new features and changes

5.6 - Next Steps

Post-installation configuration tasks

Overview

After completing the installation of the AgileTV CDN Manager (ESB3027), several post-installation configuration tasks must be performed before the system is ready for production use. This guide walks you through the essential next steps.

Prerequisites

Before proceeding, ensure:

  • The CDN Manager Helm chart is successfully deployed
  • All pods are in Running status
  • You have network access to the cluster hostname or IP
  • You have the default credentials available

Step 1: Access Zitadel Console

The first step is to configure user authentication through Zitadel Identity and Access Management (IAM).

  1. Navigate to the Zitadel Console:

    https://<manager-host>/ui/console
    

    Replace <manager-host> with your configured hostname (e.g., manager.local or manager.example.com).

    Important: The <manager-host> must match the first entry in global.hosts.manager from your Helm values exactly. Zitadel uses name-based virtual hosting and CORS validation. If the hostname does not match, authentication will fail.

  2. Log in with the default administrator credentials (also listed in the Glossary):

    • Username: admin@agiletv.dev
    • Password: Password1!
  3. Important: If prompted to configure Multi-Factor Authentication (MFA), you must skip this step for now. MFA is not currently supported. Attempting to configure MFA may lock you out of the administrator account.

  4. Security Recommendation: After logging in, create a new administrator account with proper roles. Once verified, disable or delete the default admin@agiletv.dev account. For details on required roles and administrator permissions, see Zitadel’s Administrator Documentation.

Zitadel requires an SMTP server to send email notifications and perform email validations.

  1. In the Zitadel Console, navigate to Settings > Default Settings

  2. Configure the SMTP settings:

    • SMTP Host: Your mail server hostname
    • SMTP Port: Typically 587 (TLS) or 465 (SSL)
    • SMTP Username: Mail account username
    • SMTP Password: Mail account password
    • Sender Address: Email address for outgoing mail (e.g., noreply@example.com)
  3. Save the configuration

Note: Without SMTP configuration, email-based user validation and password recovery features will not function.

Step 3: Create Additional User Accounts

Create user accounts for operators and administrators:

Tip: For detailed guidance on managing users, roles, and permissions in the Zitadel Console, see Zitadel’s User Management Documentation.

  1. In the Zitadel Console, navigate to Users > Add User

  2. Fill in the user details:

    • Username: Unique username
    • First Name: User’s first name
    • Last Name: User’s last name
    • Email: User’s email address (this is their login username)

    Known Issue: Due to a limitation in this release of Zitadel, the username must match the local part (the portion before the @) of the email address. For example, if the email is foo@example.com, the username must be foo.

    If these do not match, Zitadel may allow login with the mismatched local part while blocking the full email address. For instance, if username is foo but email is foo.bar@example.com, login with foo@example.com may succeed while foo.bar@example.com is blocked.

    Workaround: Always ensure the username matches the email local part exactly.

  3. Important: The following options must be configured:

    • Email Verified: Check this box to skip email verification
    • Set Initial Password: Enter a temporary password for the user

    Note: If you configured SMTP settings in Step 2, the user will receive an email asking to verify their address and set their initial password. If SMTP is not configured, you must check the “Email Verified” box and set an initial password manually, otherwise the user account will not be enabled.

  4. Click Create User

  5. Provide the user with:

    • Their username
    • The temporary password (if set manually)
    • The Zitadel Console URL
  6. Instruct the user to change their password on first login

Step 4: Configure User Roles and Permissions

Zitadel manages roles and permissions for accessing the CDN Manager:

  1. In the Zitadel Console, navigate to Roles

  2. Assign appropriate roles to users:

    • Admin: Full administrative access
    • Operator: Operational access without administrative functions
    • Viewer: Read-only access
  3. To assign a role:

    • Select the user
    • Click Add Role
    • Select the appropriate role
    • Save the assignment

Step 5: Access the MIB Frontend

The MIB Frontend is the web-based configuration GUI for CDN operators:

  1. Navigate to the MIB Frontend:

    https://<manager-host>/gui
    
  2. Log in using your Zitadel credentials

  3. Verify you can access the configuration interface

Step 6: Verify API Access

Test API connectivity to ensure the system is functioning:

curl -k https://<manager-host>/api/v1/health/ready

Expected response:

{
  "status": "ready"
}

See the API Guide for detailed API documentation.

Step 7: Configure TLS Certificates (If Not Done During Installation)

For production deployments, a valid TLS certificate from a trusted Certificate Authority should be configured. If you did not configure TLS certificates during installation, refer to Step 12: Configure TLS Certificates in the Installation Guide.

Step 8: Set Up Monitoring and Alerting

Configure monitoring dashboards and alerting:

  1. Access Grafana:

    • Navigate to https://<manager-host>/grafana
    • Log in with default credentials (also listed in the Glossary):
      • Username: admin
      • Password: edgeware
  2. Review Pre-built Dashboards:

    • System health dashboards are included by default
    • CDN metrics dashboards show routing and usage statistics

    Note: CDN Director instances automatically have DNS names configured for use in Grafana dashboards. The DNS name is derived from the name field in global.hosts.routers with .external appended. For example, a router named my-router-1 will have the DNS name my-router-1.external in Grafana configuration.

Step 9: Verify Kafka and PostgreSQL Health

Ensure the data layer components are healthy:

kubectl get pods

Verify the following pods are running:

ComponentPod Name PatternExpected Status
Kafkaacd-manager-kafka-controller-*Running (3 pods for production)
PostgreSQLacd-cluster-postgresql-0, acd-cluster-postgresql-1, acd-cluster-postgresql-2Running (3-node HA cluster)
Redisacd-manager-redis-master-*Running

All pods should show Running status with no restarts.

Step 10: Configure Availability Zones (Optional)

For improved network performance, configure availability zones to enable Topology Aware Hints. This optimizes service-to-pod routing by keeping traffic within the same zone when possible.

See the Performance Tuning Guide for detailed instructions on:

  • Labeling nodes with zone and region topology
  • Verifying topology configuration
  • Requirements for Topology Aware Hints to activate
  • Integration with pod anti-affinity rules

Note: This step is optional. If zone labels are not configured, the system will fall back to random load-balancing.

Step 11: Review System Configuration

Verify the initial configuration:

  1. Review Helm Values:

    helm get values acd-manager -o yaml
    
  2. Check Ingress Configuration:

    kubectl get ingress
    
  3. Verify Service Endpoints:

    kubectl get endpoints
    

Step 12: Document Your Deployment

Maintain documentation for your deployment:

  • Cluster hostname and IP addresses
  • Configuration file locations
  • User accounts and roles created
  • TLS certificate expiration dates
  • Backup procedures and schedules
  • Monitoring and alerting contacts

Next Steps

After completing post-installation configuration:

  1. Configuration Guide - Detailed system configuration options
  2. Operations Guide - Day-to-day operational procedures
  3. Metrics & Monitoring Guide - Comprehensive monitoring setup
  4. API Guide - REST API reference and integration examples

Troubleshooting

Cannot Access Zitadel Console

  • Verify DNS resolution or hosts file configuration
  • Check that Traefik ingress is running: kubectl get pods -n kube-system | grep traefik
  • Review Traefik logs: kubectl logs -n kube-system -l app.kubernetes.io/name=traefik

Authentication Failures

  • Verify Zitadel pods are healthy: kubectl get pods | grep zitadel
  • Check Zitadel logs: kubectl logs <zitadel-pod-name>
  • Ensure the external domain matches your hostname in Zitadel configuration

MIB Frontend Not Loading

  • Verify MIB Frontend pods are running: kubectl get pods | grep mib-frontend
  • Check for connectivity issues to Confd and API services
  • Review browser console for JavaScript errors

API Returns 401 Unauthorized

  • Verify you have a valid bearer token
  • Check token expiration
  • Ensure Zitadel authentication is functioning

For additional troubleshooting assistance, refer to the Troubleshooting Guide.

6 - Configuration Guide

Helm chart configuration reference

Overview

The CDN Manager is deployed via Helm chart with configuration supplied through values.yaml files. This guide explains the configuration structure, how to apply changes, and provides a reference for all configurable options.

Configuration Files

Default Configuration

The default values.yaml file is located on the installation ISO at /mnt/esb3027/values.yaml. This file contains all default values and should be copied to a writable location for modification:

cp /mnt/esb3027/values.yaml ~/values.yaml

Important: You only need to specify fields in your custom values.yaml that differ from the default. Helm applies configuration hierarchically:

  1. Default values from the Helm chart itself
  2. Values from the default values.yaml on the ISO
  3. Values from your custom values.yaml file(s)

For example, if you only need to change the manager hostname and router addresses, your custom values.yaml might contain only:

global:
  hosts:
    manager:
      - host: manager.example.com
    routers:
      - name: default
        address: 192.0.2.1

All other configuration values will be inherited from the default values.yaml on the ISO. This approach simplifies upgrades, as you only maintain your customizations.

Configuration Merging

Helm merges configuration files from left to right, with later files overriding earlier values. This allows you to:

  • Maintain a base configuration with common settings
  • Create environment-specific override files
  • Keep the default chart values for unchanged settings
# Multiple files merged left-to-right
helm install acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values-base.yaml \
  --values ~/values-production.yaml \
  --values ~/values-tls.yaml

Individual Value Overrides

For temporary changes, you can override individual values with --set:

helm upgrade acd-manager /mnt/esb3027/helm/charts/acd-manager \
  --values ~/values.yaml \
  --set manager.logLevel=debug

Note: Using --set is discouraged for permanent changes, as the same arguments must be specified for every Helm operation.

Applying Configuration

Initial Installation

helm install acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml

Updating Configuration

helm upgrade acd-manager /mnt/esb3027/helm/charts/acd-manager \
  --values ~/values.yaml

Dry Run

Before applying changes, validate the configuration with a dry run:

helm upgrade acd-manager /mnt/esb3027/helm/charts/acd-manager \
  --values ~/values.yaml \
  --dry-run

Rollback

If an upgrade fails, rollback to the previous revision:

# View revision history
helm history acd-manager

# Rollback to previous revision
helm rollback acd-manager

# Rollback to specific revision
helm rollback acd-manager <revision_number>

Note: Rollback reverts the Helm release but does not modify your values.yaml file. You must manually revert configuration file changes.

Force Reinstall

If an upgrade fails and rollback is not sufficient, you can perform a clean reinstall:

helm uninstall acd-manager
helm install acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml

Warning: This is service-affecting as all pods will be destroyed and recreated.

Configuration Reference

Global Settings

The global section contains cluster-wide settings. The most critical configuration is global.hosts.

global:
  hosts:
    manager:
      - host: manager.local
    routers:
      - name: default
        address: 127.0.0.1
    edns_proxy: []
    geoip: []
KeyTypeDescription
global.hosts.managerArrayExternal IP addresses or DNS hostnames for all Manager cluster nodes
global.hosts.routersArrayCDN Director (ESB3024) instances
global.hosts.edns_proxyArrayEDNS Proxy addresses (currently unused)
global.hosts.geoipArrayGeoIP Proxy addresses for Frontend GUI

Important: The first entry in global.hosts.manager must match zitadel.zitadel.ExternalDomain exactly. Zitadel enforces CORS protection, and authentication will fail if these do not match.

Manager Configuration

Core Manager API server settings:

KeyTypeDefaultDescription
manager.image.registryStringghcr.ioContainer image registry
manager.image.repositoryStringedgeware/acd-managerContainer image repository
manager.image.tagStringImage tag override (uses latest if empty)
manager.logLevelStringinfoLog level (trace, debug, info, warn, error)
manager.replicaCountNumber1Number of replicas (HPA manages this when enabled)
manager.containerPorts.httpNumber80HTTP container port
manager.maxmindDbVolumeStringName of PVC containing MaxMind GeoIP databases

Manager Resources

The chart supports both resource presets and explicit resource specifications:

KeyTypeDefaultDescription
manager.resourcesPresetString`` (empty)Resource preset (see Resource Presets table). Ignored if manager.resources is set.
manager.resources.requests.cpuString300mCPU request
manager.resources.requests.memoryString512MiMemory request
manager.resources.limits.cpuString1CPU limit
manager.resources.limits.memoryString1GiMemory limit

Note: For production workloads, explicitly set manager.resources rather than using presets.

Manager Datastore

manager:
  datastore:
    type: redis
    namespace: "cdn_manager_ds"
    default_ttl: ""
    compression: zstd
KeyTypeDefaultDescription
manager.datastore.typeStringredisDatastore backend type
manager.datastore.namespaceStringcdn_manager_dsRedis namespace for manager data
manager.datastore.default_ttlString`` (empty)Default TTL for entries
manager.datastore.compressionStringzstdCompression algorithm (none, zstd, etc.)

Manager Discovery

manager:
  discovery: []
  # Example:
  # - namespace: "other"
  #   hosts:
  #     - other-host1
  #     - other-host2
  #   pattern: "other-.*"
KeyTypeDescription
manager.discoveryArrayArray of discovery host configurations. Each entry can specify hosts (list of hostnames), pattern (regex pattern), or both

Manager Tuning

manager:
  tuning:
    enable_cache_control: true
    cache_control_max_age: "5m"
    cache_control_miss_max_age: ""
KeyTypeDefaultDescription
manager.tuning.enable_cache_controlBooleantrueEnable cache control headers in responses
manager.tuning.cache_control_max_ageString5mMaximum age for cache control headers
manager.tuning.cache_control_miss_max_ageString`` (empty)Maximum age for cache control headers on cache misses

Manager Container Arguments

manager:
  args:
    - --config-file=/etc/manager/config.toml
    - http-server

Gateway Configuration

NGinx Gateway settings for external Director communication:

KeyTypeDefaultDescription
gateway.replicaCountNumber1Number of gateway replicas
gateway.resources.requests.cpuString100mCPU request
gateway.resources.requests.memoryString128MiMemory request
gateway.resources.limits.cpuString150mCPU limit
gateway.resources.limits.memoryString192MiMemory limit
gateway.service.typeStringClusterIPService type

MIB Frontend Configuration

Web-based configuration GUI settings:

KeyTypeDefaultDescription
mib-frontend.enabledBooleantrueEnable the frontend GUI
mib-frontend.frontend.resourcePresetStringnanoResource preset
mib-frontend.frontend.autoscaling.hpa.enabledBooleantrueEnable HPA
mib-frontend.frontend.autoscaling.hpa.minReplicasNumber2Minimum replicas
mib-frontend.frontend.autoscaling.hpa.maxReplicasNumber4Maximum replicas

Confd Configuration

Confd settings for configuration management:

KeyTypeDefaultDescription
confd.enabledBooleantrueEnable Confd
confd.service.ports.internalNumber15000Internal service port

VictoriaMetrics Configuration

Time-series database for metrics:

KeyTypeDefaultDescription
acd-metrics.enabledBooleantrueEnable metrics components
acd-metrics.victoria-metrics-single.enabledBooleantrueEnable VictoriaMetrics
acd-metrics.grafana.enabledBooleantrueEnable Grafana
acd-metrics.telegraf.enabledBooleantrueEnable Telegraf
acd-metrics.prometheus.enabledBooleantrueEnable Prometheus metrics

Ingress Configuration

Traffic exposure settings:

KeyTypeDefaultDescription
ingress.enabledBooleantrueEnable ingress record generation
ingress.pathTypeStringPrefixIngress path type
ingress.hostnameString`` (empty)Primary hostname (defaults to manager.local via global.hosts)
ingress.pathString/apiDefault path for ingress
ingress.tlsBooleanfalseEnable TLS configuration
ingress.selfSignedBooleanfalseGenerate self-signed certificate via Helm
ingress.secretsArrayCustom TLS certificate secrets

Ingress Extra Paths

The chart includes default extra paths for Confd and GeoIP:

ingress:
  extraPaths:
    - path: /confd
      pathType: Prefix
      backend:
        service:
          name: acd-manager-gateway
          port:
            name: http
    - path: /geoip
      pathType: Prefix
      backend:
        service:
          name: acd-manager-gateway
          port:
            name: http

TLS Certificate Secrets

For production TLS certificates:

ingress:
  secrets:
    - name: manager.local-tls
      key: |-
        -----BEGIN RSA PRIVATE KEY-----
        ...
        -----END RSA PRIVATE KEY-----
      certificate: |-
        -----BEGIN CERTIFICATE-----
        ...
        -----END CERTIFICATE-----
  tls: true

Resource Configuration

Resource Presets

Predefined resource configurations for common deployment sizes:

PresetRequest CPURequest MemoryLimit CPULimit MemoryEphemeral Storage Limit
nano100m128Mi150m192Mi2Gi
micro250m256Mi375m384Mi2Gi
small500m512Mi750m768Mi2Gi
medium500m1024Mi750m1536Mi2Gi
large1000m2048Mi1500m3072Mi2Gi
xlarge1000m3072Mi3000m6144Mi2Gi
2xlarge1000m3072Mi6000m12288Mi2Gi

Note: Limits are calculated as requests plus 50% (except for xlarge/2xlarge and ephemeral-storage).

Custom Resources

Override preset with custom values:

manager:
  resources:
    requests:
      cpu: "300m"
      memory: "512Mi"
    limits:
      cpu: "1"
      memory: "1Gi"

Note:

  • CPU values use millicores (1000m = 1 core)
  • Memory values use binary SI units (1024Mi = 1GiB)
  • Requests represent minimum guaranteed resources
  • Limits represent maximum consumable resources

Capacity Planning

When sizing resources:

  • Requests determine scheduling (node must have available capacity)
  • Limits prevent resource starvation
  • Maintain 20-30% cluster headroom for scaling
  • Total capacity = sum of all requests × replica count + headroom

Security Contexts

Pod Security Context

manager:
  podSecurityContext:
    enabled: true
    fsGroup: 1001
    fsGroupChangePolicy: Always
    sysctls: []
    supplementalGroups: []

Container Security Context

manager:
  containerSecurityContext:
    enabled: true
    runAsUser: 1001
    runAsGroup: 1001
    runAsNonRoot: true
    readOnlyRootFilesystem: true
    privileged: false
    allowPrivilegeEscalation: false
    capabilities:
      drop: ["ALL"]
    seccompProfile:
      type: "RuntimeDefault"

Health Probes

Probe Types

ProbePurposeFailure Action
startupProbeInitial startup verificationContainer restart
readinessProbeTraffic readiness checkRemove from load balancer
livenessProbeHealth monitoringContainer restart

Default Probe Configuration

Liveness Probe

manager:
  livenessProbe:
    enabled: true
    initialDelaySeconds: 5
    periodSeconds: 30
    timeoutSeconds: 10
    failureThreshold: 5
    successThreshold: 1
    httpGet:
      path: /api/v1/health/alive
      port: http

Readiness Probe

manager:
  readinessProbe:
    enabled: true
    initialDelaySeconds: 5
    periodSeconds: 10
    timeoutSeconds: 7
    failureThreshold: 3
    successThreshold: 1
    httpGet:
      path: /api/v1/health/ready
      port: http

Startup Probe

manager:
  startupProbe:
    enabled: true
    initialDelaySeconds: 0
    periodSeconds: 5
    timeoutSeconds: 3
    failureThreshold: 10
    successThreshold: 1
    httpGet:
      path: /api/v1/health/alive
      port: http

Autoscaling Configuration

Horizontal Pod Autoscaler (HPA)

manager:
  autoscaling:
    hpa:
      enabled: true
      minReplicas: 3
      maxReplicas: 8
      targetCPU: 50
      targetMemory: 80
KeyTypeDefaultDescription
manager.autoscaling.hpa.enabledBooleantrueEnable HPA
manager.autoscaling.hpa.minReplicasNumber3Minimum number of replicas
manager.autoscaling.hpa.maxReplicasNumber8Maximum number of replicas
manager.autoscaling.hpa.targetCPUNumber50Target CPU utilization percentage
manager.autoscaling.hpa.targetMemoryNumber80Target Memory utilization percentage

Network Policy

networkPolicy:
  enabled: true
  allowExternal: true
  allowExternalEgress: true
  addExternalClientAccess: true
KeyTypeDefaultDescription
networkPolicy.enabledBooleantrueEnable NetworkPolicy
networkPolicy.allowExternalBooleantrueAllow connections from any source (don’t require pod label)
networkPolicy.allowExternalEgressBooleantrueAllow pod to access any range of port and destinations
networkPolicy.addExternalClientAccessBooleantrueAllow access from pods with client label set to “true”

Pod Affinity and Anti-Affinity

manager:
  podAffinityPreset: ""
  podAntiAffinityPreset: soft
  nodeAffinityPreset:
    type: ""
    key: ""
    values: []
  affinity: {}
KeyTypeDefaultDescription
manager.podAffinityPresetString`` (empty)Pod affinity preset (soft or hard). Ignored if affinity is set
manager.podAntiAffinityPresetStringsoftPod anti-affinity preset (soft or hard). Ignored if affinity is set
manager.nodeAffinityPreset.typeString`` (empty)Node affinity preset type (soft or hard)
manager.affinityObject{}Custom affinity rules (overrides presets)

Service Configuration

service:
  type: ClusterIP
  ports:
    http: 80
  annotations:
    service.kubernetes.io/topology-mode: Auto
  externalTrafficPolicy: Cluster
  sessionAffinity: None
KeyTypeDefaultDescription
service.typeStringClusterIPService type
service.ports.httpNumber80HTTP service port
service.annotationsObjectservice.kubernetes.io/topology-mode: AutoService annotations
service.externalTrafficPolicyStringClusterExternal traffic policy

Persistence Configuration

persistence:
  enabled: false
  mountPath: /agiletv/manager/data
  storageClass: ""
  accessModes:
    - ReadWriteOnce
  size: 8Gi
KeyTypeDefaultDescription
persistence.enabledBooleanfalseEnable persistence using PVC
persistence.mountPathString/agiletv/manager/dataMount path
persistence.storageClassString`` (empty)Storage class (uses cluster default if empty)
persistence.sizeString8GiSize of data volume

RBAC and Service Account

rbac:
  create: false
  rules: []

serviceAccount:
  create: true
  name: ""
  automountServiceAccountToken: true
  annotations: {}

Metrics

metrics:
  enabled: false
  serviceMonitor:
    enabled: false
    namespace: ""
    annotations: {}
    labels: {}
    interval: ""
    scrapeTimeout: ""
KeyTypeDefaultDescription
metrics.enabledBooleanfalseEnable Prometheus metrics export
metrics.serviceMonitor.enabledBooleanfalseCreate Prometheus Operator ServiceMonitor

Next Steps

After configuration:

  1. Installation Guide - Deploy with your configuration
  2. Operations Guide - Day-to-day management
  3. Performance Tuning Guide - Optimize system performance
  4. Architecture Guide - Understand component relationships

7 - Performance Tuning Guide

Optimization tips for improving CDN Manager performance

Overview

This guide provides performance tuning recommendations for the AgileTV CDN Manager (ESB3027). While the default configuration is suitable for most deployments, certain environments may benefit from additional optimizations.

Network Topology Optimization

Topology Aware Hints

The CDN Manager uses Kubernetes Topology Aware Hints to prefer routing pods in the same zone as the source of network traffic. This reduces cross-zone latency and improves overall system responsiveness.

How It Works

When nodes are labeled with topology zones, Kubernetes automatically routes traffic to pods in the same zone when possible. This is particularly beneficial for:

  • Low-latency requirements: Keeps traffic local to reduce round-trip time
  • Cost optimization: Reduces cross-zone data transfer costs in cloud environments
  • Load distribution: Prevents hotspots by distributing load across zones

Configuring Availability Zones

Each node must have zone and region labels applied for Topology Aware Hints to function:

# Label a node with a zone
kubectl label nodes <node-name> topology.kubernetes.io/zone=us-east-1a

# Label a node with a region
kubectl label nodes <node-name> topology.kubernetes.io/region=us-east-1

Replace <node-name> with your actual node names and adjust the zone/region values to match your deployment geography.

Note: Labels applied via kubectl label are automatically persistent and will survive node restarts.

Verify Topology Configuration

Verify labels are applied:

kubectl get nodes --show-labels | grep topology.kubernetes.io

Verify EndpointSlices are being generated with hints:

kubectl get endpointslices

Requirements for Topology Aware Hints

For Topology Aware Hints to activate:

  • Minimum Nodes: At least one node must be labeled with each zone referenced by endpoints
  • Symmetry: The control plane checks for sufficient CPU capacity across zones to balance traffic
  • Zone Coverage: All zones with endpoints should have at least one ready node

Integration with Pod Anti-Affinity

Topology labels complement the pod anti-affinity rules already configured in the Helm chart:

  • Pod Anti-Affinity: Handles pod-to-node placement to ensure high availability
  • Topology Aware Hints: Handles service-to-pod traffic routing to keep requests within the same zone

Together, these features optimize both placement and routing for improved performance.

Fallback Behavior

If zone labels are not configured, the system falls back to random load-balancing across all available pods. This is functionally correct but may result in:

  • Increased cross-zone traffic
  • Higher latency for some requests
  • Less predictable performance characteristics

Kernel Network Tuning (sysctl)

For high-throughput deployments, tuning Linux kernel network parameters can significantly improve connection handling and overall system performance. These settings are particularly beneficial for environments with high connection rates or large numbers of concurrent connections.

Apply the following settings to optimize network performance:

# Networking
net.core.somaxconn = 1024
net.core.netdev_max_backlog = 2048
net.ipv4.tcp_max_syn_backlog = 2048

# Connection Tracking
net.netfilter.nf_conntrack_max = 131072
net.netfilter.nf_conntrack_tcp_timeout_established = 1200

# Port Reuse
net.ipv4.ip_local_port_range = 10240 65535
net.ipv4.tcp_tw_reuse = 1

# Memory Buffers
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608

Setting Descriptions

ParameterRecommended ValuePurpose
net.core.somaxconn1024Maximum socket listen backlog. Increases pending connection queue size.
net.core.netdev_max_backlog2048Maximum packets queued at network device level. Helps handle burst traffic.
net.ipv4.tcp_max_syn_backlog2048Maximum SYN requests queued. Improves handling of connection floods.
net.netfilter.nf_conntrack_max131072Maximum tracked connections. Prevents connection tracking table exhaustion.
net.netfilter.nf_conntrack_tcp_timeout_established1200Timeout for established connections (seconds). Reduces stale entry buildup.
net.ipv4.ip_local_port_range10240 65535Range of local ports for outbound connections. Expands available ephemeral ports.
net.ipv4.tcp_tw_reuse1Allows reusing TIME_WAIT sockets. Reduces port exhaustion under high load.
net.core.rmem_max8388608Maximum receive socket buffer size (8MB). Improves high-bandwidth transfers.
net.core.wmem_max8388608Maximum send socket buffer size (8MB). Improves high-bandwidth transfers.

Applying Settings

Temporary (Until Reboot)

Apply settings immediately but they will be lost on reboot:

sudo sysctl -w net.core.somaxconn=1024
sudo sysctl -w net.core.netdev_max_backlog=2048
# ... repeat for each parameter

Persistent (Across Reboots)

Add settings to /etc/sysctl.conf or a file in /etc/sysctl.d/:

# Create a dedicated config file
cat <<EOF | sudo tee /etc/sysctl.d/99-cdn-manager.conf
# CDN Manager Network Tuning
net.core.somaxconn = 1024
net.core.netdev_max_backlog = 2048
net.ipv4.tcp_max_syn_backlog = 2048
net.netfilter.nf_conntrack_max = 131072
net.netfilter.nf_conntrack_tcp_timeout_established = 1200
net.ipv4.ip_local_port_range = 10240 65535
net.ipv4.tcp_tw_reuse = 1
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
EOF

# Apply all settings
sudo sysctl -p /etc/sysctl.d/99-cdn-manager.conf

Kubernetes Considerations

For Kubernetes deployments, these sysctl settings can be applied via:

  1. Node-level configuration: Use DaemonSets or node provisioning scripts
  2. Pod-level safe sysctls: Some sysctls can be set per-pod via securityContext.sysctls
  3. Container runtime configuration: Configure via container runtime options

Note that some sysctls require privileged containers or node-level configuration.

Monitoring Impact

After applying these settings, monitor:

  • Connection establishment rates
  • TIME_WAIT socket count: netstat -n | grep TIME_WAIT | wc -l
  • Connection tracking table usage: cat /proc/sys/net/netfilter/nf_conntrack_count
  • Network buffer utilization via Grafana dashboards

Resource Configuration

Horizontal Pod Autoscaler (HPA)

The default HPA configuration is tuned for production workloads. For environments with variable load, consider adjusting the scale metrics:

ComponentDefault Scale MetricsTuning Consideration
Core ManagerCPU 50%, Memory 80%Lower CPU threshold for faster scale-out
NGinx GatewayCPU 75%, Memory 80%Increase for cost optimization
MIB FrontendCPU 75%, Memory 90%Adjust based on operator concurrency

For detailed HPA configuration, see the Architecture Guide.

Resource Requests and Limits

Ensure resource requests and limits are appropriately sized for your workload. Under-provisioned resources can cause:

  • Pod evictions during high load
  • Increased latency due to CPU throttling
  • Slow scaling responses

Refer to the Configuration Guide for preset configurations and planning guidance.

Database Optimization

PostgreSQL

The PostgreSQL cluster is managed by the Cloudnative PG operator. For improved performance:

  • Connection Pooling: The application uses connection pooling by default
  • Replica Usage: Read queries can be offloaded to replicas for read-heavy workloads
  • Backup Scheduling: Schedule backups during low-traffic periods to minimize I/O impact

Redis

Redis provides in-memory caching for sessions and ephemeral state:

  • Memory Allocation: Ensure sufficient memory for cache hit rates
  • Persistence: RDB snapshots are enabled; adjust frequency based on durability needs

Kafka

Kafka handles event streaming for selection input and metrics:

  • Partition Count: Default partitions are sized for typical workloads
  • Replication Factor: Production deployments use 3 replicas for fault tolerance
  • Consumer Groups: The Selection Input Worker is limited to one consumer per partition

Monitoring Performance

Key Metrics to Watch

Monitor the following metrics for performance insights:

  • API Response Time: Track via Grafana dashboards
  • Pod CPU/Memory Usage: Identify resource bottlenecks
  • Kafka Lag: Monitor consumer lag for selection input processing
  • Database Connections: Watch for connection pool exhaustion

Grafana Dashboards

Pre-built dashboards are available at https://<manager-host>/grafana:

  • System Health: Overall cluster and application health
  • CDN Metrics: Routing and usage statistics
  • Resource Utilization: CPU, memory, and network usage per component

Troubleshooting Performance Issues

High Latency

  1. Check pod distribution across nodes: kubectl get pods -o wide
  2. Verify topology labels are applied: kubectl get nodes --show-labels
  3. Review network latency between nodes
  4. Check for resource contention: kubectl top pods

Slow Scaling

  1. Verify HPA is enabled: kubectl get hpa
  2. Check cluster capacity for scheduling new pods
  3. Review HPA metrics: kubectl describe hpa acd-manager

Database Performance

  1. Check PostgreSQL cluster status: kubectl get pods -l app=postgresql
  2. Review slow query logs (if enabled)
  3. Monitor connection pool usage

Next Steps

After reviewing performance tuning:

  1. Architecture Guide - Understand component interactions
  2. Configuration Guide - Detailed configuration options
  3. Metrics & Monitoring Guide - Comprehensive monitoring setup
  4. Troubleshooting Guide - Resolve performance issues

8 - Operations Guide

Day-to-day operational procedures and maintenance tasks

Overview

This guide covers day-to-day operational procedures for managing the AgileTV CDN Manager (ESB3027). Topics include routine maintenance, backup procedures, log management, and common operational tasks.

Prerequisites

Before performing operations, ensure you have:

  • kubectl access to the cluster
  • helm CLI installed
  • Access to the node where values.yaml is stored
  • Appropriate RBAC permissions for administrative tasks

Cluster Access

There are two supported methods for accessing the Kubernetes cluster:

  1. SSH to a Server Node (Recommended for operations staff) - SSH into any Server node and run kubectl commands directly
  2. Remote kubectl - Install kubectl on your local machine and configure it to connect to the cluster remotely

The kubectl command-line tool is pre-configured on all Server nodes and can be used directly without additional setup:

# SSH to any Server node
ssh root@<server-ip>

# Run kubectl commands directly
kubectl get nodes
kubectl get pods

This method is recommended for day-to-day operations as it requires no local configuration and provides direct access to the cluster.

Method 2: Remote kubectl from Local Machine

To use kubectl from your local workstation or laptop:

Step 1: Install kubectl

Download and install kubectl for your operating system:

  • Official Documentation: Install kubectl
  • macOS (Homebrew): brew install kubectl
  • Linux: Download from the official Kubernetes release page
  • Windows: Download from the official Kubernetes release page

Step 2: Copy kubeconfig from Server Node

# Copy kubeconfig from any Server node
scp root@<server-ip>:/etc/rancher/k3s/k3s.yaml ~/.kube/config

Step 3: Update kubeconfig

Edit the kubeconfig file to point to the correct server address:

# Replace localhost with the actual server IP
# macOS/Linux:
sed -i '' 's/127.0.0.1/<server-ip>/g' ~/.kube/config  # macOS
sed -i 's/127.0.0.1/<server-ip>/g' ~/.kube/config    # Linux

# Or manually edit ~/.kube/config and change:
# server: https://127.0.0.1:6443
# to:
# server: https://<server-ip>:6443

Step 4: Verify connectivity

kubectl get nodes

Managing Multiple Clusters

If you manage multiple Kubernetes clusters from the same machine, you can maintain multiple kubeconfig files:

# Set KUBECONFIG environment variable to include multiple config files
export KUBECONFIG=~/.kube/config-prod:~/.kube/config-lab

# View all contexts
kubectl config get-contexts

# Switch between clusters
kubectl config use-context <context-name>

# View current context
kubectl config current-context

For more information, see the official Kubernetes documentation: Organizing Cluster Access

Helm Commands

Helm releases are managed cluster-wide:

# List all releases
helm list

# View release history
helm history acd-manager

# Get deployed values
helm get values acd-manager -o yaml

# Get deployed manifest
helm get manifest acd-manager

Note: If using remote kubectl, ensure helm is installed on your local machine. See Helm Installation for instructions.

Helm Commands

Helm releases are managed cluster-wide:

# List all releases
helm list

# View release history
helm history acd-manager

# Get deployed values
helm get values acd-manager -o yaml

# Get deployed manifest
helm get manifest acd-manager

Backup Procedures

PostgreSQL Backup

PostgreSQL is managed by the Cloudnative PG operator, which provides continuous backup capabilities.

# Check backup status
kubectl get backup

# Create manual backup
kubectl apply -f - <<EOF
apiVersion: postgresql.cnpg.io/v1
kind: Backup
metadata:
  name: manual-backup-$(date +%Y%m%d-%H%M%S)
spec:
  cluster:
    name: acd-cluster-postgresql
EOF

# List available backups
kubectl get backup -o wide

# Restore from backup (requires downtime)
# See Upgrade Guide for restore procedures

Longhorn Volume Backups

Longhorn provides snapshot and backup capabilities for persistent volumes:

# List all volumes
kubectl get volumes -n longhorn-system

# Create snapshot via Longhorn UI
# Port-forward to Longhorn UI (do not expose via ingress)
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80

# Access: http://localhost:8080
# WARNING: Longhorn UI grants access to sensitive storage information
# and should never be exposed through the ingress controller

Accessing Internal Services

For debugging and troubleshooting, you may need direct access to internal services.

PostgreSQL

PostgreSQL is managed by the Cloudnative PG operator. Connection details are stored in the acd-cluster-postgresql-app Secret:

# View connection details
kubectl describe secret acd-cluster-postgresql-app

# Extract individual fields
PG_HOST=$(kubectl get secret acd-cluster-postgresql-app -o jsonpath='{.data.host}' | base64 -d)
PG_USER=$(kubectl get secret acd-cluster-postgresql-app -o jsonpath='{.data.username}' | base64 -d)
PG_PASS=$(kubectl get secret acd-cluster-postgresql-app -o jsonpath='{.data.password}' | base64 -d)
PG_DB=$(kubectl get secret acd-cluster-postgresql-app -o jsonpath='{.data.dbname}' | base64 -d)

# Connect via psql
kubectl exec -it acd-cluster-postgresql-0 -- psql -U $PG_USER -d $PG_DB

Secret fields: The CNPG operator populates the following fields: username, password, host, port, dbname, uri, jdbc-uri, fqdn-uri, fqdn-jdbc-uri, pgpass.

Redis

Redis runs on port 6379 with no authentication:

# Connect via redis-cli
kubectl exec -it acd-manager-redis-master-0 -- redis-cli

# Or connect from another pod
kubectl run redis-test --rm -it --image=redis -- redis-cli -h acd-manager-redis-master

Kafka

Kafka is accessible on port 9095 from any cluster node:

# Connect from within cluster
kubectl exec -it acd-manager-kafka-controller-0 -- kafka-topics.sh --bootstrap-server localhost:9092 --list

# Connect from external (via any node IP)
kafka-topics.sh --bootstrap-server <node-ip>:9095 --list

The selection_input topic is pre-configured for selection input events.

Longhorn Storage

Longhorn is a distributed block storage system for Kubernetes that provides persistent volumes for stateful applications such as PostgreSQL and Kafka.

Architecture

Longhorn deploys controller and replica engines on each node, forming a distributed storage system. When a volume is created, Longhorn replicates data across multiple nodes to ensure durability even in the event of node failures.

Storage Protocols:

  • iSCSI: Used for standard Read-Write-Once (RWO) volumes
  • NFS: Used for Read-Write-Many (RWX) volumes that can be mounted by multiple pods simultaneously

Configuration

The CDN Manager deploys Longhorn with a single replica configuration, which differs from the Longhorn default of 3 replicas. This configuration is optimized for the cluster architecture where:

  • Pod-node affinity is configured to schedule pods on the same node as their persistent volume data
  • This optimizes I/O performance by reducing network traffic
  • Data locality is maintained while still providing volume portability

Capacity Planning

Longhorn storage requires an additional 30% capacity headroom for internal operations and scaling. If less than 30% of the total partition capacity is available, Longhorn may mark volumes as “full” and prevent further writes.

For detailed storage requirements and disk partitioning guidance, see the System Requirements Guide.

Configuration Backup

Always backup your Helm values before making changes:

# Export current values
helm get values acd-manager -o yaml > ~/values-backup-$(date +%Y%m%d).yaml

# Backup custom values files
cp ~/values.yaml ~/values-backup-$(date +%Y%m%d).yaml

Backup Schedule Recommendations

ComponentFrequencyRetention
PostgreSQLDaily30 days
Longhorn SnapshotsBefore changes7 days
ConfigurationBefore each changeIndefinite

Updating MaxMind GeoIP Databases

The MaxMind GeoIP databases (GeoIP2-City, GeoLite2-ASN, GeoIP2-Anonymous-IP) are used for GeoIP-based routing and validation features. These databases should be updated periodically to ensure accurate IP geolocation data.

Prerequisites

  • Updated MaxMind database files (.mmdb format) obtained from MaxMind
  • Access to the cluster via kubectl
  • Helm CLI installed

Update Procedure

Step 1: Create New Volume with Updated Databases

Run the volume generation utility with a unique volume name that includes a revision identifier:

# Mount the installation ISO if not already mounted
mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

# Generate new volume with updated databases
/mnt/esb3027/generate-maxmind-volume

When prompted:

  1. Provide the paths to the three database files:
    • GeoIP2-City.mmdb
    • GeoLite2-ASN.mmdb
    • GeoIP2-Anonymous-IP.mmdb
  2. Enter a unique volume name with a revision number or date, for example:
    • maxmind-geoip-2026-04
    • maxmind-geoip-v2

Tip: Using a revision-based naming convention simplifies rollback if needed.

Step 2: Update Helm Configuration

Edit your values.yaml file to reference the new volume:

manager:
  maxmindDbVolume: maxmind-geoip-2026-04

Replace maxmind-geoip-2026-04 with the volume name you specified in Step 1.

Step 3: Apply Configuration Update

Upgrade the Helm release with the updated configuration:

helm upgrade acd-manager /mnt/esb3027/charts/acd-manager --values ~/values.yaml

Step 4: Rolling Restart (Optional)

To ensure all pods immediately use the new database files, perform a rolling restart of the manager deployment:

kubectl rollout restart deployment acd-manager

Monitor the rollout status:

kubectl rollout status deployment acd-manager

Step 5: Verify Update

Verify the pods are running with the new volume:

kubectl get pods
kubectl describe pod -l app.kubernetes.io/component=manager | grep -A 5 "Volumes"

Step 6: Clean Up Old Volume (Optional)

After verifying the new databases are working correctly, you can delete the old persistent volume:

# List persistent volumes to find the old one
kubectl get pv

# Delete the old volume
kubectl delete pv <old-volume-name>

Caution: Ensure the new volume is functioning correctly before deleting the old volume. Keep the old volume for at least 24-48 hours as a rollback option.

Rollback Procedure

If issues occur after updating the databases:

  1. Revert the maxmindDbVolume value in your values.yaml to the previous volume name
  2. Run helm upgrade with the reverted configuration
  3. Optionally restart the deployment: kubectl rollout restart deployment acd-manager

Update Frequency Recommendations

DatabaseRecommended Update Frequency
GeoIP2-CityWeekly or monthly
GeoLite2-ASNMonthly
GeoIP2-Anonymous-IPWeekly or monthly

MaxMind releases database updates on a regular schedule. Subscribe to MaxMind notifications to stay informed of new releases.

Log Management

Application Logs

# View manager logs
kubectl logs -l app.kubernetes.io/component=manager

# Follow logs in real-time
kubectl logs -l app.kubernetes.io/component=manager -f

# View logs from specific pod
kubectl logs <pod-name>

# View previous instance logs (after crash)
kubectl logs <pod-name> -p

# View logs with timestamps
kubectl logs <pod-name> --timestamps

# View logs from all containers in pod
kubectl logs <pod-name> --all-containers

Component-Specific Logs

# Zitadel logs
kubectl logs -l app.kubernetes.io/name=zitadel

# Gateway logs
kubectl logs -l app.kubernetes.io/component=gateway

# Confd logs
kubectl logs -l app.kubernetes.io/component=confd

# MIB Frontend logs
kubectl logs -l app.kubernetes.io/component=mib-frontend

# PostgreSQL logs
kubectl logs -l app.kubernetes.io/name=postgresql

# Kafka logs
kubectl logs -l app.kubernetes.io/name=kafka

# Redis logs
kubectl logs -l app.kubernetes.io/name=redis

Log Aggregation

Logs are collected by Telegraf and sent to VictoriaMetrics:

# Access Grafana for log visualization
# https://<manager-host>/grafana

# Query logs via Grafana Explore
# Select VictoriaMetrics datasource and use log queries

Log Rotation

Container logs are automatically rotated by Kubernetes:

  • Default max size: 10MB per container
  • Default max files: 5 rotated files
  • Total per pod: ~50MB maximum

Scaling Operations

Manual Scaling

Note: If HPA (Horizontal Pod Autoscaler) is enabled for a deployment, manual scaling changes will be overridden by the HPA. To manually scale, you must first disable the HPA.

# Check if HPA is enabled
kubectl get hpa

# Disable HPA before manual scaling
kubectl patch hpa acd-manager -p '{"spec": {"minReplicas": null, "maxReplicas": null}}'

# Or delete the HPA entirely
kubectl delete hpa acd-manager

# Scale manager replicas
kubectl scale deployment acd-manager --replicas=3

# Scale gateway replicas
kubectl scale deployment acd-manager-gateway --replicas=2

# Scale MIB frontend replicas
kubectl scale deployment acd-manager-mib-frontend --replicas=2

HPA Configuration

# View HPA status
kubectl get hpa

# Describe HPA details
kubectl describe hpa acd-manager

# Edit HPA configuration
kubectl edit hpa acd-manager

Configuration Updates

Updating Helm Values

# Edit values file
vi ~/values.yaml

# Validate with dry-run
helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml \
  --dry-run

# Apply changes
helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml

# Verify rollout
kubectl rollout status deployment/acd-manager

Rolling Back Changes

# View revision history
helm history acd-manager

# Rollback to previous revision
helm rollback acd-manager

# Rollback to specific revision
helm rollback acd-manager <revision>

# Verify rollback
helm history acd-manager

Certificate Management

Checking Certificate Expiration

# Check TLS secret expiration
kubectl get secret acd-manager-tls -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates

# Check via Grafana dashboard
# Certificate expiration metrics are available in Grafana

Renewing Certificates

# For Helm-managed self-signed certificates
helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml \
  --set ingress.selfSigned=true

# For manual certificates, update the secret
kubectl create secret tls acd-manager-tls \
  --cert=new-tls.crt \
  --key=new-tls.key \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart pods to pick up new certificate
kubectl rollout restart deployment acd-manager

Health Checks

Component Health

# Check all pods
kubectl get pods

# Check specific component
kubectl get pods -l app.kubernetes.io/component=manager

# Check persistent volumes
kubectl get pvc

# Check cluster status
kubectl get nodes

# Check ingress
kubectl get ingress

API Health Endpoints

# Liveness check
curl -k https://<manager-host>/api/v1/health/alive

# Readiness check
curl -k https://<manager-host>/api/v1/health/ready

Database Health

# PostgreSQL cluster status
kubectl get clusters -n default

# Check PostgreSQL pods
kubectl get pods -l app.kubernetes.io/name=postgresql

# Kafka cluster status
kubectl get pods -l app.kubernetes.io/name=kafka

# Redis status
kubectl get pods -l app.kubernetes.io/name=redis

Maintenance Windows

Planned Maintenance

Before performing maintenance:

  1. Notify users of potential service impact
  2. Verify backups are current
  3. Document the maintenance procedure
  4. Prepare rollback plan

Node Maintenance

# Cordon node to prevent new pods
kubectl cordon <node-name>

# Drain node (evict pods)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Perform maintenance

# Uncordon node
kubectl uncordon <node-name>

Cluster Upgrades

See the Upgrade Guide for cluster upgrade procedures.

Troubleshooting Quick Reference

Common Commands

# Describe problematic pod
kubectl describe pod <pod-name>

# View pod events
kubectl get events --sort-by='.lastTimestamp'

# Check resource usage
kubectl top pods
kubectl top nodes

# Exec into container
kubectl exec -it <pod-name> -- /bin/sh

# Check network policies
kubectl get networkpolicies

# Check service endpoints
kubectl get endpoints

Restarting Components

# Restart deployment
kubectl rollout restart deployment/<deployment-name>

# Restart statefulset
kubectl rollout restart statefulset/<statefulset-name>

# Delete pod (auto-recreated)
kubectl delete pod <pod-name>

Security Operations

Rotating Service Account Tokens

# Delete service account secret (auto-regenerated)
kubectl delete secret <service-account-token-secret>

# Tokens are automatically regenerated

Updating RBAC Permissions

# View current roles
kubectl get roles
kubectl get clusterroles

# View role bindings
kubectl get rolebindings
kubectl get clusterrolebindings

# Edit role
kubectl edit role <role-name>

Audit Log Access

# K3s audit logs location
/var/lib/rancher/k3s/server/logs/audit.log

# View recent audit events
tail -f /var/lib/rancher/k3s/server/logs/audit.log

Disaster Recovery

Pod Recovery

Pods are automatically recreated if they fail:

# Check pod status
kubectl get pods

# If pod is stuck in Terminating
kubectl delete pod <pod-name> --force --grace-period=0

# If pod is stuck in Pending, check resources
kubectl describe pod <pod-name>
kubectl get events --sort-by='.lastTimestamp'

Node Failure Recovery

When a node fails:

  1. Automatic: Pods are rescheduled on healthy nodes (after timeout)
  2. Manual: Force delete stuck pods
# Force delete pods on failed node
kubectl delete pod --all --force --grace-period=0 \
  --field-selector spec.nodeName=<failed-node>

Data Recovery

For data recovery scenarios, refer to:

  • PostgreSQL: Cloudnative PG backup/restore procedures
  • Longhorn: Volume snapshot restoration
  • Kafka: Partition replication handles node failures

Routine Maintenance Checklist

Daily

  • Review Grafana dashboards for anomalies
  • Check alert notifications
  • Verify backup completion

Weekly

  • Review pod restart counts
  • Check certificate expiration dates
  • Review log storage usage
  • Verify HPA is functioning correctly

Monthly

  • Test backup restoration procedure
  • Review and rotate credentials if needed
  • Update documentation if configuration changed
  • Review resource utilization trends

Next Steps

After mastering operations:

  1. Troubleshooting Guide - Deep dive into problem resolution
  2. Performance Tuning Guide - Optimize system performance
  3. Metrics & Monitoring Guide - Comprehensive monitoring setup
  4. API Guide - REST API reference and automation

9 - Metrics & Monitoring Guide

Monitoring architecture and metrics collection

Overview

The CDN Manager includes a comprehensive monitoring stack based on VictoriaMetrics for time-series data storage, Telegraf for metrics collection, and Grafana for visualization. This guide describes the monitoring architecture and how to access and use the monitoring capabilities.

Architecture

Components

ComponentPurpose
TelegrafMetrics collector running on each node, gathering system and application metrics
VictoriaMetrics AgentMetrics scraper and forwarder; scrapes Prometheus endpoints and forwards to VictoriaMetrics
VictoriaMetrics (Short-term)Time-series database for operational dashboards (30-90 day retention)
VictoriaMetrics (Long-term)Time-series database for billing and compliance (1+ year retention)
GrafanaVisualization and dashboard platform
AlertmanagerAlert routing and notification management

Metrics Flow

The following diagram illustrates how metrics flow through the monitoring stack:

flowchart TB
    subgraph External["External Sources"]
        Streamers[Streamers/External Clients]
    end

    subgraph Cluster["Kubernetes Cluster"]
        Telegraf[Telegraf DaemonSet]

        subgraph Applications["Application Components"]
            Director[CDN Director]
            Kafka[Kafka]
            Redis[Redis]
            Manager[ACD Manager]
            Alertmanager[Alertmanager]
        end

        VMAgent[VictoriaMetrics Agent]

        subgraph Storage["Storage"]
            VMShort[VictoriaMetrics<br/>Short-term]
           VMLong[VictoriaMetrics<br/>Long-term]
        end
    end

    Grafana[Grafana]

    Streamers -->|Push metrics| Telegraf
    Telegraf -->|remote_write| VMShort
    Telegraf -->|remote_write| VMLong

    Director -->|Scrape| VMAgent
    Kafka -->|Scrape| VMAgent
    Redis -->|Scrape| VMAgent
    Manager -->|Scrape| VMAgent
    Alertmanager -->|Scrape| VMAgent

    VMAgent -->|remote_write| VMShort
    VMAgent -->|remote_write| VMLong

    VMShort -->|Query| Grafana
    VMLong -->|Query| Grafana

Metrics Flow Summary:

  1. External metrics ingestion:

    • External clients (streamers) push metrics to Telegraf
    • Telegraf forwards metrics via remote_write to both VictoriaMetrics instances
  2. Internal metrics scraping:

    • VictoriaMetrics Agent scrapes Prometheus endpoints from:
      • CDN Director instances
      • Kafka cluster
      • Redis
      • ACD Manager components
      • Alertmanager
    • VMAgent forwards scraped metrics via remote_write to both VictoriaMetrics instances
  3. Data visualization:

    • Grafana queries both VictoriaMetrics databases depending on the dashboard requirements
    • Operational dashboards use short-term storage
    • Billing and compliance dashboards use long-term storage

Accessing Grafana

Grafana is deployed as part of the metrics stack and accessible via the ingress:

URL: https://<manager-host>/grafana

Default credentials are listed in the Glossary.

Important: Change all default passwords after first login.

Metrics Collection

Application Metrics

Applications expose metrics on Prometheus-compatible endpoints. VictoriaMetrics Agent (VMAgent) scrapes these endpoints and forwards metrics to VictoriaMetrics via remote_write.

System Metrics

Telegraf collects system-level metrics including:

  • CPU usage
  • Memory utilization
  • Disk I/O
  • Network statistics
  • Process metrics

Kubernetes Metrics

Cluster metrics are collected including:

  • Pod resource usage
  • Node status
  • Deployment status
  • Persistent volume usage

Grafana Dashboards

Accessing Dashboards

After logging into Grafana:

  1. Navigate to Dashboards in the left menu
  2. Browse available dashboards
  3. Click on a dashboard to view metrics

Dashboard Types

The included dashboards provide visibility into:

  • Cluster Health: Overall cluster resource utilization
  • Application Performance: Request rates, latency, error rates
  • Component Status: Individual component health indicators

CDN Director Metrics

Director DNS Names in Grafana

CDN Director instances are identified in Grafana by their DNS name, which is derived from the name field in global.hosts.routers:

global:
  hosts:
    routers:
      - name: my-router-1
        address: 192.0.2.1

The DNS name used in Grafana dashboards will be: my-router-1.external

This naming convention is automatically applied for all configured directors.

Metrics Retention

VictoriaMetrics is configured with default retention policies. For custom retention settings, modify the VictoriaMetrics configuration in your values.yaml:

acd-metrics:
  victoria-metrics-single:
    retentionPeriod: "3"  # Retention period in months

Troubleshooting

Metrics Not Appearing

If metrics are not appearing in Grafana:

  1. Check Telegraf pods:

    kubectl get pods -l app.kubernetes.io/component=telegraf
    
  2. Check Telegraf logs:

    kubectl logs -l app.kubernetes.io/component=telegraf
    
  3. Verify VictoriaMetrics is running:

    kubectl get pods -l app.kubernetes.io/component=victoria-metrics
    
  4. Check application metrics endpoints:

    kubectl exec <pod-name> -- curl localhost:8080/metrics
    

Dashboard Loading Issues

If dashboards fail to load:

  1. Check Grafana pods:

    kubectl get pods -l app.kubernetes.io/component=grafana
    
  2. Review Grafana logs:

    kubectl logs -l app.kubernetes.io/component=grafana
    
  3. Verify datasource configuration in Grafana UI

Next Steps

After setting up monitoring:

  1. Operations Guide - Day-to-day operational procedures
  2. Troubleshooting Guide - Resolve monitoring issues
  3. API Guide - Access metrics via API

10 - API Guide

REST API reference and integration examples

Overview

The CDN Manager exposes versioned HTTP APIs under /api (v1 and v2), using JSON payloads by default. When sending request bodies, set Content-Type: application/json. Server errors typically respond with { "message": "..." } where available, or an empty body with the relevant status code.

Authentication uses a two-step flow:

  1. Create a session
  2. Exchange that session for an access token with grant_type=session

Use the access token in Authorization: Bearer <token> when calling bearer-protected routes. CORS preflight (OPTIONS) is supported and wildcard origins are accepted by default.

Durations such as TTLs use humantime strings (for example, 60s, 5m, 1h).

Base URL

All API endpoints are relative to:

https://<manager-host>/api

API Reference Guides

The API documentation is organized by functional area:

GuideDescription
Authentication APILogin, token exchange, logout, and session management
Health APILiveness and readiness probes
Selection Input APIKey-value and list storage with search capabilities
Data Store APIGeneric JSON key/value storage
Subnets APICIDR-to-value mappings for routing decisions
Routing APIGeoIP lookups and IP validation
Discovery APIHost and namespace discovery
Metrics APIMetrics submission and aggregation
Configuration APIConfiguration document management
Operator UI APIBlocked tokens, user agents, and referrers
OpenAPI SpecificationComplete OpenAPI 3.0 specification

Authentication Flow

All authenticated API calls follow the same authentication flow. For detailed instructions, see the Authentication API Guide.

Quick Start:

# Step 1: Login to get session
curl -s -X POST "https://cdn-manager/api/v1/auth/login" \
  -H "Content-Type: application/json" \
  -d '{
    "email": "user@example.com",
    "password": "Password1!"
  }' | tee /tmp/session.json

SESSION_ID=$(jq -r '.session_id' /tmp/session.json)
SESSION_TOKEN=$(jq -r '.session_token' /tmp/session.json)

# Step 2: Exchange session for access token
curl -s -X POST "https://cdn-manager/api/v1/auth/token" \
  -H "Content-Type: application/json" \
  -d "$(jq -nc --arg sid "$SESSION_ID" --arg st "$SESSION_TOKEN" \
    '{session_id:$sid,session_token:$st,grant_type:"session",scope:"openid"}')" \
  | tee /tmp/token.json

ACCESS_TOKEN=$(jq -r '.access_token' /tmp/token.json)

# Step 3: Call a protected endpoint
curl -s "https://cdn-manager/api/v1/metrics" \
  -H "Authorization: Bearer ${ACCESS_TOKEN}"

Error Responses

The API uses standard HTTP response codes to indicate the success or failure of an API request.

Most errors return an empty response body with the relevant HTTP status code (e.g., 404 Not Found or 409 Conflict).

In some cases, the server may return a JSON body containing a user-facing error message:

{
  "message": "Human-readable error message"
}

Next Steps

After learning the API:

  1. Operations Guide - Day-to-day operational procedures
  2. Troubleshooting Guide - Resolve API issues
  3. Configuration Guide - Full configuration reference

10.1 - Authentication API

Authentication and session management

Overview

The Authentication API provides endpoints for user authentication, session management, and token exchange. All authenticated API calls require a valid access token obtained through the authentication flow.

Base URL

https://<manager-host>/api/v1/auth

Endpoints

POST /api/v1/auth/login

Create a session from email/password credentials.

Request:

POST /api/v1/auth/login
Content-Type: application/json

{
  "email": "user@example.com",
  "password": "Password1!"
}

Success Response (200):

{
  "session_id": "session-1",
  "session_token": "token-1",
  "verified_at": "2024-01-01T00:00:00Z",
  "expires_at": "2024-01-01T01:00:00Z"
}

Errors:

  • 401 - Authentication failure (invalid credentials)
  • 500 - Backend/state errors

POST /api/v1/auth/token

Exchange a session for an access token (required for bearer auth).

Request:

POST /api/v1/auth/token
Content-Type: application/json

{
  "session_id": "session-1",
  "session_token": "token-1",
  "grant_type": "session",
  "scope": "openid profile"
}

Success Response (200):

{
  "access_token": "<token>",
  "scope": "openid profile",
  "expires_in": 3600,
  "token_type": "bearer"
}

Token Scopes

The scope parameter in the token exchange request is a space-separated string of permissions requested for the access token.

Scope Resolution When a token is requested, the backend system filters the requested scopes against the user’s actual permissions. The resulting access token will only contain the subset of requested scopes that the user is authorized to possess.

Naming and Design Scope names are defined by the applications that consume the tokens, not by the central IAM system. To prevent collisions between different applications or modules, it is highly recommended that application developers use URN-style prefixes for scope names (e.g., urn:acd:manager:config:read).

Errors:

  • 401 - Authentication failure (invalid session)
  • 500 - Backend/state errors

POST /api/v1/auth/logout

Revoke a session. Note: This does not revoke issued access tokens; they remain valid until expiration.

Request:

POST /api/v1/auth/logout
Content-Type: application/json

{
  "session_id": "session-1",
  "session_token": "token-1"
}

Success Response (200):

{
  "status": "Ok"
}

Errors:

  • 400 - Invalid session parameters
  • 500 - Backend/state errors

Complete Authentication Flow Example

# Step 1: Login to get session
curl -s -X POST "https://cdn-manager/api/v1/auth/login" \
  -H "Content-Type: application/json" \
  -d '{
    "email": "user@example.com",
    "password": "Password1!"
  }' | tee /tmp/session.json

SESSION_ID=$(jq -r '.session_id' /tmp/session.json)
SESSION_TOKEN=$(jq -r '.session_token' /tmp/session.json)

# Step 2: Exchange session for access token
curl -s -X POST "https://cdn-manager/api/v1/auth/token" \
  -H "Content-Type: application/json" \
  -d "$(jq -nc --arg sid "$SESSION_ID" --arg st "$SESSION_TOKEN" \
    '{session_id:$sid,session_token:$st,grant_type:"session",scope:"openid"}')" \
  | tee /tmp/token.json

ACCESS_TOKEN=$(jq -r '.access_token' /tmp/token.json)

# Step 3: Call a protected endpoint
curl -s "https://cdn-manager/api/v1/metrics" \
  -H "Authorization: Bearer ${ACCESS_TOKEN}"

Using the Access Token

Once you have obtained an access token, include it in the Authorization header of all API requests:

Authorization: Bearer <access_token>

Example:

curl -s "https://cdn-manager/api/v1/configuration" \
  -H "Authorization: Bearer ${ACCESS_TOKEN}"

Token Expiration

Access tokens expire after the duration specified in expires_in (typically 3600 seconds / 1 hour). When a token expires, you must re-authenticate to obtain a new token.

Next Steps

10.2 - Health API

Liveness and readiness probe endpoints

Overview

The Health API provides endpoints for Kubernetes health probes and service health checking.

Base URL

https://<manager-host>/api/v1/health

Endpoints

GET /api/v1/health/alive

Liveness probe that indicates whether the service is running. Always returns 200 OK.

Request:

GET /api/v1/health/alive

Response (200):

{
  "status": "Ok"
}

Use Case: Kubernetes liveness probe to determine if the pod should be restarted.


GET /api/v1/health/ready

Readiness probe that checks service readiness including downstream dependencies.

Request:

GET /api/v1/health/ready

Success Response (200):

{
  "status": "Ok"
}

Failure Response (503):

{
  "status": "Fail"
}

Use Case: Kubernetes readiness probe to determine if the pod should receive traffic. Returns 503 if any downstream dependencies (database, Kafka, Redis) are unavailable.


Kubernetes Configuration

Example Kubernetes probe configuration:

livenessProbe:
  httpGet:
    path: /api/v1/health/alive
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /api/v1/health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Next Steps

10.3 - Selection Input API

Key-value and list storage with search capabilities

Overview

The Selection Input API provides JSON key/value storage with search capabilities. It supports two API versions (v1 and v2) with different operation models.

Base URL

https://<manager-host>/api/v1/selection_input
https://<manager-host>/api/v2/selection_input

Version Comparison

Featurev1 /api/v1/selection_inputv2 /api/v2/selection_input
Primary operationMerge/UPSERT (POST)Insert/Replace (PUT)
List appendN/APOST to push to list
Search syntaxWildcard prefix (foo* implicit)Full wildcard (foo* explicit)
Query paramssearch, sort, limit, ttlsearch, ttl, correlation_id
Sort supportYes (asc/desc)No
Limit supportYesNo
Use caseSimple key-value with optional searchList-like operations, full wildcard

When to Use Each Version

ScenarioRecommended Version
Simple key-value storagev1
List/queue operations (append to array)v2 POST
Full wildcard pattern matchingv2
Need to sort or paginate resultsv1

v1 Endpoints

GET /api/v1/selection_input/{path}

Fetch stored JSON. If value is an object, optional search/limit/sort applies to its keys.

Query Parameters:

  • search - Wildcard prefix search (adds * implicitly)
  • sort - Sort order (asc or desc)
  • limit - Maximum results (must be > 0)

Success Response (200):

{
  "foo": 1,
  "foobar": 2
}

Errors:

  • 404 - Path does not exist
  • 400 - Invalid search/sort/limit parameters
  • 500 - Backend failure

Example:

curl -s "https://cdn-manager/api/v1/selection_input/config?search=foo&limit=2"

POST /api/v1/selection_input/{path}

Upsert (merge) JSON at path. Nested objects are merged recursively.

Query Parameters:

  • ttl - Expiry time as humantime string (e.g., 10m, 1h)

Request:

{
  "feature_flag": true,
  "ratio": 0.5
}

Success: 201 Created echoing the payload

Errors:

  • 500 / 503 - Backend failure

Example:

curl -s -X POST "https://cdn-manager/api/v1/selection_input/config?ttl=10m" \
  -H "Content-Type: application/json" \
  -d '{
    "feature_flag": true,
    "ratio": 0.5
  }'

DELETE /api/v1/selection_input/{path}

Delete stored value.

Success: 204 No Content

Errors: 503 - Backend failure


v2 Endpoints

GET /api/v2/selection_input/{path}

Fetch stored JSON with optional wildcard filtering.

Query Parameters:

  • search - Full wildcard pattern (e.g., foo*, *bar*)
  • correlation_id - Accepted but currently ignored (logging only)

Success Response (200):

{
  "foo": 1,
  "foobar": 2
}

Errors:

  • 400 - Invalid search pattern
  • 404 - Path does not exist
  • 500 - Backend failure

Example:

curl -s "https://cdn-manager/api/v2/selection_input/config?search=foo*"

PUT /api/v2/selection_input/{path}

Insert/replace value. Old value is discarded.

Query Parameters:

  • ttl - Expiry time as humantime string

Request:

{
  "items": ["a", "b", "c"]
}

Success: 200 OK

Example:

curl -s -X PUT "https://cdn-manager/api/v2/selection_input/catalog" \
  -H "Content-Type: application/json" \
  -d '{
    "items": ["a", "b", "c"]
  }'

POST /api/v2/selection_input/{path}

Push a value to the back of a list-like entry (append to array).

Query Parameters:

  • ttl - Expiry time as humantime string

Request (any JSON value):

{
  "item": 42
}

Or a simple string:

"ready-for-publish"

Success: 200 OK

Example:

curl -s -X POST "https://cdn-manager/api/v2/selection_input/queue" \
  -H "Content-Type: application/json" \
  -d '"ready-for-publish"'

DELETE /api/v2/selection_input/{path}

Delete stored value.

Success: 204 No Content


Next Steps

10.4 - Data Store API

Generic JSON key/value storage

Overview

The Data Store API provides generic JSON key/value storage for short-lived or simple structured data.

Base URL

https://<manager-host>/api/v1/datastore

Endpoints

GET /api/v1/datastore

List all known keys.

Query Parameters:

  • show_hidden - Boolean (default false). When true, includes internal keys starting with _.

Success Response (200):

["user:123", "config:settings", "session:abc"]

Hidden Keys: Keys starting with _ are reserved for internal use (e.g., subnet service). Writing to hidden keys via the datastore API returns 400 Bad Request.


GET /api/v1/datastore/{key}

Retrieve the JSON value for a specific key.

Success Response (200): The stored JSON value

Errors:

  • 404 - Key does not exist
  • 500 - Backend failure

Example:

curl -s "https://cdn-manager/api/v1/datastore/user:123"

POST /api/v1/datastore/{key}

Create a new JSON value at the specified key. Fails if the key already exists.

Query Parameters:

  • ttl - Expiry time as humantime string (e.g., 60s, 1h)

Request:

{
  "id": 123,
  "name": "alice"
}

Success: 201 Created

Errors:

  • 409 Conflict - Key already exists
  • 500 - Backend failure

Example:

curl -s -X POST "https://cdn-manager/api/v1/datastore/user:123?ttl=1h" \
  -H "Content-Type: application/json" \
  -d '{"id":123,"name":"alice"}'

PUT /api/v1/datastore/{key}

Update or replace the JSON value at an existing key.

Query Parameters:

  • ttl - Expiry time as humantime string

Success: 200 OK

Errors:

  • 404 - Key does not exist
  • 500 - Backend failure

Example:

curl -s -X PUT "https://cdn-manager/api/v1/datastore/user:123" \
  -H "Content-Type: application/json" \
  -d '{"id":123,"name":"alice-updated"}'

DELETE /api/v1/datastore/{key}

Delete the value at the specified key. Idempotent operation.

Success: 204 No Content

Errors: 500 - Backend failure

Example:

curl -s -X DELETE "https://cdn-manager/api/v1/datastore/user:123"

Next Steps

10.5 - Subnets API

CIDR-to-value mappings for routing decisions

Overview

The Subnets API manages CIDR-to-value mappings used for routing decisions. This allows classification of IP ranges for routing purposes.

Base URL

https://<manager-host>/api/v1/subnets

Endpoints

PUT /api/v1/subnets

Create or update subnet mappings.

Request:

{
  "192.168.1.0/24": "office",
  "10.0.0.0/8": "internal",
  "203.0.113.0/24": "external"
}

Success: 200 OK

Errors:

  • 400 - Invalid CIDR format
  • 500 - Backend failure

Example:

curl -s -X PUT "https://cdn-manager/api/v1/subnets" \
  -H "Content-Type: application/json" \
  -d '{
    "192.168.1.0/24": "office",
    "10.0.0.0/8": "internal"
  }'

GET /api/v1/subnets

List all subnet mappings.

Success Response (200): JSON object of CIDR-to-value mappings

Example:

curl -s "https://cdn-manager/api/v1/subnets" | jq '.'

DELETE /api/v1/subnets

Delete all subnet mappings.

Success: 204 No Content


GET /api/v1/subnets/byKey/{subnet}

Retrieve subnet mappings whose CIDR begins with the given prefix.

Example:

curl -s "https://cdn-manager/api/v1/subnets/byKey/192.168" | jq '.'

GET /api/v1/subnets/byValue/{value}

Retrieve subnet mappings with the given classification value.

Example:

curl -s "https://cdn-manager/api/v1/subnets/byValue/office" | jq '.'

DELETE /api/v1/subnets/byKey/{subnet}

Delete subnet mappings whose CIDR begins with the given prefix.


DELETE /api/v1/subnets/byValue/{value}

Delete subnet mappings with the given classification value.


Next Steps

10.6 - Routing API

GeoIP lookups and IP validation

Overview

The Routing API provides GeoIP information lookup and IP address validation for routing decisions.

Base URL

https://<manager-host>/api/v1/routing

Endpoints

GET /api/v1/routing/geoip

Look up GeoIP information for an IP address.

Query Parameters:

  • ip - IP address to look up

Success Response (200):

{
  "city": {
    "name": "Washington"
  },
  "asn": 64512
}

Errors:

  • 400 - Invalid IP format
  • 500 - Backend failure

Caching: Cache-Control: public, max-age=86400 (24 hours)

Example:

curl -s "https://cdn-manager/api/v1/routing/geoip?ip=149.101.100.0"

GET /api/v1/routing/validate

Validate if an IP address is allowed (not blocked).

Query Parameters:

  • ip - IP address to validate

Success Response (200): Empty body (IP is allowed)

Forbidden Response (403):

Access Denied

Errors:

  • 400 - Invalid IP format
  • 500 - Backend failure

Caching: Cache-Control headers included (default: max-age=300, configurable via [tuning] section)

Example:

curl -i "https://cdn-manager/api/v1/routing/validate?ip=149.101.100.0"

Use Cases

GeoIP-Based Routing

Use the /geoip endpoint to determine the geographic location and ASN of an IP address for routing decisions:

# Get location data for routing
IP_INFO=$(curl -s "https://cdn-manager/api/v1/routing/geoip?ip=203.0.113.50")
CITY=$(echo "$IP_INFO" | jq -r '.city.name')
ASN=$(echo "$IP_INFO" | jq -r '.asn')

echo "Routing based on city: $CITY, ASN: $ASN"

IP Validation

Use the /validate endpoint to check if an IP is allowed before processing requests:

# Check if IP is allowed
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" \
  "https://cdn-manager/api/v1/routing/validate?ip=203.0.113.50")

if [ "$RESPONSE" = "200" ]; then
  echo "IP is allowed"
elif [ "$RESPONSE" = "403" ]; then
  echo "IP is blocked"
fi

Next Steps

10.7 - Discovery API

Host and namespace discovery

Overview

The Discovery API provides information about discovered hosts and namespaces. Discovery is configured via the Helm chart values.yaml file. Each entry defines a namespace with a list of hostnames.

Base URL

https://<manager-host>/api/v1/discovery

Endpoints

GET /api/v1/discovery/hosts

Return discovered hosts grouped by namespace.

Success Response (200):

{
  "directors": [
    { "name": "director-1.example.com" }
  ],
  "edge-servers": [
    { "name": "cdn1.example.com" },
    { "name": "cdn2.example.com" }
  ]
}

Example:

curl -s "https://cdn-manager/api/v1/discovery/hosts"

GET /api/v1/discovery/namespaces

Return discovery namespaces with their corresponding Confd URIs.

Success Response (200):

[
  {
    "namespace": "edge-servers",
    "confd_uri": "/api/v1/confd/edge-servers"
  },
  {
    "namespace": "directors",
    "confd_uri": "/api/v1/confd/directors"
  }
]

Example:

curl -s "https://cdn-manager/api/v1/discovery/namespaces"

Configuration

Discovery is configured via the Helm chart values.yaml file under manager.discovery:

manager:
  discovery:
    - namespace: "directors"
      hosts:
        - director-1.example.com
        - director-2.example.com
    - namespace: "edge-servers"
      hosts:
        - cdn1.example.com
        - cdn2.example.com

Each entry defines a namespace with a list of hostnames. Optionally, a pattern field can be specified for regex-based host matching.


Next Steps

10.8 - Metrics API

Metrics submission and aggregation

Overview

The Metrics API allows submission and retrieval of metrics data from CDN components.

Base URL

https://<manager-host>/api/v1/metrics

Endpoints

POST /api/v1/metrics

Submit metrics data.

Request:

{
  "example.com": {
    "metric1": 100,
    "metric2": 200
  }
}

Success: 200 OK

Errors: 500 - Validation/backend errors

Example:

curl -s -X POST "https://cdn-manager/api/v1/metrics" \
  -H "Content-Type: application/json" \
  -d '{
    "example.com": {
      "metric1": 100,
      "metric2": 200
    }
  }'

GET /api/v1/metrics

Return aggregated metrics per host.

Response: JSON object with aggregated metrics per host

Note: Metrics are stored per host for up to 5 minutes. Hosts that stop reporting disappear from aggregation after that window. When no metrics are being reported, returns empty object {}.

Example:

curl -s "https://cdn-manager/api/v1/metrics"

Metrics Retention

  • Metrics are stored for up to 5 minutes in the aggregation layer
  • For long-term metrics storage, data is forwarded to VictoriaMetrics
  • Query historical metrics via Grafana dashboards at /grafana

Next Steps

10.9 - Configuration API

Configuration document management

Overview

The Configuration API provides endpoints for managing the system configuration document. ETag is supported; send If-None-Match for conditional GET (may return 304).

Operational Note: This API is intended for internal verification only. Behavior is undefined in multi-replica clusters because pods do not coordinate config writes.

Base URL

https://<manager-host>/api/v1/configuration

Endpoints

GET /api/v1/configuration

Retrieve the configuration document.

Success: 200 OK with configuration JSON

Conditional GET: Returns 304 Not Modified if If-None-Match header matches current ETag

Example:

# Get ETag from response headers
etag=$(curl -s -D- "https://cdn-manager/api/v1/configuration" | awk '/ETag/{print $2}')

# Conditional GET - returns 304 if config unchanged
curl -s -H "If-None-Match: $etag" "https://cdn-manager/api/v1/configuration" -o /tmp/cfg.json -w "%{http_code}\n"

PUT /api/v1/configuration

Replace the configuration document.

Request:

{
  "feature_flag": false,
  "ratio": 0.25
}

Success: 200 OK

Errors:

  • 400 - Invalid configuration format
  • 500 - Backend failure

DELETE /api/v1/configuration

Delete the configuration document.

Success: 200 OK


ETag Usage

The configuration API supports ETags for optimistic concurrency control:

# 1. Get current config and ETag
response=$(curl -s -D headers.txt "https://cdn-manager/api/v1/configuration")
etag=$(grep -i ETag headers.txt | cut -d' ' -f2 | tr -d '\r')

# 2. Modify the config as needed
modified_config=$(echo "$response" | jq '.feature_flag = true')

# 3. Update with ETag to prevent overwriting concurrent changes
curl -s -X PUT "https://cdn-manager/api/v1/configuration" \
  -H "Content-Type: application/json" \
  -H "If-Match: $etag" \
  -d "$modified_config"

Next Steps

10.10 - Operator UI API

Blocked tokens, user agents, and referrers

Overview

The Operator UI API provides read-only helpers exposing curated selection input content for the operator interface.

Query Parameters: search, sort, limit (same as selection input v1)

Note: Stored keys for user agents/referrers are URL-safe base64; responses decode them to human-readable values.

Base URL

https://<manager-host>/api/v1/operator_ui

Endpoints

Blocked Household Tokens

GET /api/v1/operator_ui/modules/blocked_tokens

List all blocked household tokens.

Success Response (200):

[
  {
    "household_token": "house-001_token-abc",
    "expire_time": 1625247600
  }
]

GET /api/v1/operator_ui/modules/blocked_tokens/{token}

Get details for a specific blocked token.

Success Response (200):

{
  "household_token": "house-001_token-abc",
  "expire_time": 1625247600
}

Blocked User Agents

GET /api/v1/operator_ui/modules/blocked_user_agents

List all blocked user agents.

Success Response (200):

[
  {
    "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
  },
  {
    "user_agent": "curl/7.68.0"
  }
]

GET /api/v1/operator_ui/modules/blocked_user_agents/{encoded}

Get details for a specific blocked user agent. The path variable is URL-safe base64 encoded.

Example:

# Encode the user agent
ENC=$(python3 -c "import base64; print(base64.urlsafe_b64encode(b'curl/7.68.0').decode().rstrip('='))")

# Get details
curl -s "https://cdn-manager/api/v1/operator_ui/modules/blocked_user_agents/$ENC"

Blocked Referrers

GET /api/v1/operator_ui/modules/blocked_referrers

List all blocked referrers.

Success Response (200):

[
  {
    "referrer": "https://spam-example.com"
  }
]

GET /api/v1/operator_ui/modules/blocked_referrers/{encoded}

Get details for a specific blocked referrer. The path variable is URL-safe base64 encoded.

Example:

# Encode the referrer
ENC=$(python3 -c "import base64; print(base64.urlsafe_b64encode(b'spam-example.com').decode().rstrip('='))")

# Get details
curl -s "https://cdn-manager/api/v1/operator_ui/modules/blocked_referrers/$ENC"

URL-Safe Base64 Encoding

The Operator UI API uses URL-safe base64 encoding for path parameters. To encode values:

Python:

import base64

# Encode
encoded = base64.urlsafe_b64encode(b'value').decode().rstrip('=')

# Decode
decoded = base64.urlsafe_b64decode(encoded + '=' * (-len(encoded) % 4)).decode()

Bash (with openssl):

# Encode
echo -n "value" | openssl base64 -urlsafe | tr -d '='

# Decode
echo "encoded" | openssl base64 -urlsafe -d

Next Steps

10.11 - OpenAPI Specification

Complete OpenAPI 3.0 specification

Overview

The CDN Manager API is documented using the OpenAPI 3.0 specification. This appendix provides the complete specification for reference and for generating API clients.

OpenAPI Specification (YAML)

openapi: 3.0.3
info:
  title: AgileTV CDN Manager API
  version: "1.0"
servers:
  - url: https://<manager-host>/api
    description: CDN Manager API server
paths:
  /v1/auth/login:
    post:
      summary: Login and create session
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/LoginRequest'
      responses:
        '200':
          description: Session created
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/LoginResponse'
        '401': { description: Unauthorized, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
        '500': { description: Internal error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/auth/token:
    post:
      summary: Exchange session for access token
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/TokenRequest'
      responses:
        '200':
          description: Access token
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/TokenResponse'
        '401': { description: Unauthorized, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
        '500': { description: Internal error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/auth/logout:
    post:
      summary: Revoke session
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/LogoutRequest'
      responses:
        '200': { description: Revoked, content: { application/json: { schema: { $ref: '#/components/schemas/LogoutResponse' } } } }
        '401': { description: Unauthorized, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
        '500': { description: Internal error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/selection_input{tail}:
    get:
      summary: Read selection input
      parameters:
        - $ref: '#/components/parameters/Tail'
        - $ref: '#/components/parameters/Search'
        - $ref: '#/components/parameters/Sort'
        - $ref: '#/components/parameters/Limit'
      responses:
        '200': { description: JSON value }
        '400': { description: Bad request, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
        '404': { description: Not found }
        '500': { description: Backend failure }
    post:
      summary: Merge selection input
      parameters:
        - $ref: '#/components/parameters/Tail'
        - $ref: '#/components/parameters/Ttl'
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/AnyJson'
      responses:
        '201': { description: Created, content: { application/json: { schema: { $ref: '#/components/schemas/AnyJson' } } } }
        '500': { description: Backend failure }
        '503': { description: Service unavailable }
    delete:
      summary: Delete selection input
      parameters:
        - $ref: '#/components/parameters/Tail'
      responses:
        '204': { description: Deleted }
        '503': { description: Service unavailable }
  /v2/selection_input{tail}:
    get:
      summary: Read selection input v2
      parameters:
        - $ref: '#/components/parameters/TailV2'
        - $ref: '#/components/parameters/Search'
      responses:
        '200': { description: JSON value }
        '400': { description: Invalid search pattern }
        '404': { description: Not found }
        '500': { description: Backend failure }
    put:
      summary: Replace selection input v2
      parameters:
        - $ref: '#/components/parameters/TailV2'
        - $ref: '#/components/parameters/Ttl'
        - $ref: '#/components/parameters/CorrelationId'
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/AnyJson'
      responses:
        '200': { description: Updated }
        '500': { description: Backend failure }
    post:
      summary: Push to selection input v2
      parameters:
        - $ref: '#/components/parameters/TailV2'
        - $ref: '#/components/parameters/Ttl'
        - $ref: '#/components/parameters/CorrelationId'
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/AnyJson'
      responses:
        '200': { description: Pushed }
        '500': { description: Backend failure }
    delete:
      summary: Delete selection input v2
      parameters:
        - $ref: '#/components/parameters/TailV2'
      responses:
        '204': { description: Deleted }
        '500': { description: Backend failure }
  /v1/configuration:
    get:
      summary: Read configuration
      responses:
        '200': { description: Configuration, content: { application/json: { schema: { $ref: '#/components/schemas/AnyJson' } } }, headers: { ETag: { schema: { type: string } } } }
        '304': { description: Not modified }
        '500': { description: Backend failure }
    put:
      summary: Replace configuration
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/AnyJson'
      responses:
        '200': { description: Replaced }
        '500': { description: Backend failure }
    delete:
      summary: Delete configuration
      responses:
        '200': { description: Deleted }
        '500': { description: Backend failure }
  /v1/routing/geoip:
    get:
      summary: GeoIP lookup
      parameters:
        - name: ip
          in: query
          required: true
          schema: { type: string }
      responses:
        '200': { description: GeoIP data, content: { application/json: { schema: { $ref: '#/components/schemas/GeoIpResponse' } } } }
        '400': { description: Invalid IP }
        '500': { description: Backend failure }
  /v1/routing/validate:
    get:
      summary: Validate routing
      parameters:
        - name: ip
          in: query
          required: true
          schema: { type: string }
      responses:
        '200': { description: Allowed }
        '403': { description: Access Denied }
        '400': { description: Invalid IP }
        '500': { description: Backend failure }
  /v1/metrics:
    post:
      summary: Ingest metrics
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/MetricsIngress'
      responses:
        '200': { description: Stored }
        '500': { description: Validation/back-end error }
    get:
      summary: Aggregate metrics
      responses:
        '200': { description: Aggregated metrics, content: { application/json: { schema: { $ref: '#/components/schemas/AnyJson' } } } }
        '500': { description: Backend failure }
  /v1/discovery/hosts:
    get:
      summary: List discovered hosts by namespace
      responses:
        '200':
          description: Discovered hosts keyed by namespace
          content:
            application/json:
              schema:
                type: object
                additionalProperties:
                  type: array
                  items:
                    $ref: '#/components/schemas/DiscoveryHost'
        '500': { description: Backend failure }
  /v1/discovery/namespaces:
    get:
      summary: List discovery namespaces with Confd URIs
      responses:
        '200':
          description: Namespaces with Confd links
          content:
            application/json:
              schema:
                type: array
                items:
                  $ref: '#/components/schemas/DiscoveryNamespace'
        '500': { description: Backend failure }
  /v1/datastore:
    get:
      summary: List datastore keys
      responses:
        '200': { description: Keys list, content: { application/json: { schema: { type: array, items: { type: string } } } } }
        '500': { description: Backend failure }
  /v1/datastore/{key}:
    get:
      summary: Get a JSON value by key
      parameters:
        - name: key
          in: path
          required: true
          schema: { type: string }
      responses:
        '200': { description: JSON value, content: { application/json: { schema: { $ref: '#/components/schemas/AnyJson' } } } }
        '404': { description: Not found }
        '500': { description: Backend failure }
    post:
      summary: Create a JSON value at key
      parameters:
        - name: key
          in: path
          required: true
          schema: { type: string }
        - $ref: '#/components/parameters/Ttl'
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/AnyJson'
      responses:
        '201': { description: Created }
        '409': { description: Conflict (already exists) }
        '500': { description: Backend failure }
    put:
      summary: Update/replace a JSON value at key
      parameters:
        - name: key
          in: path
          required: true
          schema: { type: string }
        - $ref: '#/components/parameters/Ttl'
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/AnyJson'
      responses:
        '200': { description: Updated }
        '404': { description: Not found }
        '500': { description: Backend failure }
    delete:
      summary: Delete a datastore key
      parameters:
        - name: key
          in: path
          required: true
          schema: { type: string }
      responses:
        '204': { description: Deleted }
        '500': { description: Backend failure }
  /v1/subnets:
    get:
      summary: List all subnet mappings
      responses:
        '200': { description: Subnet mappings, content: { application/json: { schema: { type: object, additionalProperties: { type: string } } } } }
        '500': { description: Backend failure }
    put:
      summary: Create or update subnet mappings
      requestBody:
        required: true
        content:
          application/json:
            schema:
              type: object
              additionalProperties:
                type: string
              description: Map of CIDR strings to classification values
      responses:
        '200': { description: Created }
        '400': { description: Invalid CIDR format }
        '500': { description: Backend failure }
    delete:
      summary: Delete all subnet mappings
      responses:
        '204': { description: Deleted }
        '500': { description: Backend failure }
  /v1/subnets/byKey/{subnet}:
    get:
      summary: Get subnet mappings by CIDR prefix
      parameters:
        - name: subnet
          in: path
          required: true
          schema: { type: string }
      responses:
        '200': { description: Subnet mappings, content: { application/json: { schema: { type: object, additionalProperties: { type: string } } } } }
        '500': { description: Backend failure }
    delete:
      summary: Delete subnet mappings by CIDR prefix
      parameters:
        - name: subnet
          in: path
          required: true
          schema: { type: string }
      responses:
        '204': { description: Deleted }
        '500': { description: Backend failure }
  /v1/subnets/byValue/{value}:
    get:
      summary: Get subnet mappings by value
      parameters:
        - name: value
          in: path
          required: true
          schema: { type: string }
      responses:
        '200': { description: Subnet mappings, content: { application/json: { schema: { type: object, additionalProperties: { type: string } } } } }
        '500': { description: Backend failure }
    delete:
      summary: Delete subnet mappings by value
      parameters:
        - name: value
          in: path
          required: true
          schema: { type: string }
      responses:
        '204': { description: Deleted }
        '500': { description: Backend failure }
  /v1/operator_ui/modules/blocked_tokens:
    get:
      summary: List blocked tokens
      parameters:
        - $ref: '#/components/parameters/Search'
        - $ref: '#/components/parameters/Sort'
        - $ref: '#/components/parameters/Limit'
      responses:
        '200': { description: Blocked tokens, content: { application/json: { schema: { type: array, items: { $ref: '#/components/schemas/BlockedToken' } } } } }
        '400': { description: Parse error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/operator_ui/modules/blocked_tokens/{token}:
    get:
      summary: Get blocked token
      parameters:
        - name: token
          in: path
          required: true
          schema: { type: string }
      responses:
        '200': { description: Blocked token, content: { application/json: { schema: { $ref: '#/components/schemas/BlockedToken' } } } }
        '404': { description: Not found }
        '400': { description: Parse error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/operator_ui/modules/blocked_user_agents:
    get:
      summary: List blocked user agents
      parameters:
        - $ref: '#/components/parameters/Search'
        - $ref: '#/components/parameters/Sort'
        - $ref: '#/components/parameters/Limit'
      responses:
        '200': { description: Blocked user agents, content: { application/json: { schema: { type: array, items: { $ref: '#/components/schemas/BlockedUserAgent' } } } } }
        '400': { description: Parse error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/operator_ui/modules/blocked_user_agents/{encoded}:
    get:
      summary: Get blocked user agent
      parameters:
        - name: encoded
          in: path
          required: true
          schema: { type: string }
      responses:
        '200': { description: Blocked user agent, content: { application/json: { schema: { $ref: '#/components/schemas/BlockedUserAgent' } } } }
        '400': { description: Parse error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/operator_ui/modules/blocked_referrers:
    get:
      summary: List blocked referrers
      parameters:
        - $ref: '#/components/parameters/Search'
        - $ref: '#/components/parameters/Sort'
        - $ref: '#/components/parameters/Limit'
      responses:
        '200': { description: Blocked referrers, content: { application/json: { schema: { type: array, items: { $ref: '#/components/schemas/BlockedReferrer' } } } } }
        '400': { description: Parse error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/operator_ui/modules/blocked_referrers/{encoded}:
    get:
      summary: Get blocked referrer
      parameters:
        - name: encoded
          in: path
          required: true
          schema: { type: string }
      responses:
        '200': { description: Blocked referrer, content: { application/json: { schema: { $ref: '#/components/schemas/BlockedReferrer' } } } }
        '400': { description: Parse error, content: { application/json: { schema: { $ref: '#/components/schemas/ErrorResponse' } } } }
  /v1/health/alive:
    get:
      summary: Liveness check
      responses:
        '200': { description: Alive, content: { application/json: { schema: { $ref: '#/components/schemas/HealthStatus' } } } }
  /v1/health/ready:
    get:
      summary: Readiness check
      responses:
        '200': { description: Ready, content: { application/json: { schema: { $ref: '#/components/schemas/HealthStatus' } } } }
        '503': { description: Unready, content: { application/json: { schema: { $ref: '#/components/schemas/HealthStatus' } } } }
components:
  parameters:
    Tail:
      name: tail
      in: path
      required: true
      schema: { type: string }
    TailV2:
      name: tail
      in: path
      required: true
      schema: { type: string }
    Search:
      name: search
      in: query
      required: false
      schema: { type: string }
    Sort:
      name: sort
      in: query
      required: false
      schema: { type: string, enum: [asc, desc] }
    Limit:
      name: limit
      in: query
      required: false
      schema: { type: integer, minimum: 1 }
    Ttl:
      name: ttl
      in: query
      required: false
      schema: { type: string, description: Humantime duration }
    CorrelationId:
      name: correlation_id
      in: query
      required: false
      schema: { type: string }
  schemas:
    LoginRequest:
      type: object
      required: [email, password]
      properties:
        email: { type: string, format: email }
        password: { type: string, format: password }
    LoginResponse:
      type: object
      properties:
        session_id: { type: string }
        session_token: { type: string }
        verified_at: { type: string, format: date-time }
        expires_at: { type: string, format: date-time }
    LogoutRequest:
      type: object
      required: [session_id]
      properties:
        session_id: { type: string }
        session_token: { type: string }
    LogoutResponse:
      type: object
      properties:
        status: { $ref: '#/components/schemas/StatusValue' }
    TokenRequest:
      type: object
      required: [session_id, session_token, grant_type]
      properties:
        session_id: { type: string }
        session_token: { type: string }
        scope: { type: string }
        grant_type: { type: string, enum: [session] }
    TokenResponse:
      type: object
      required: [access_token, scope, expires_in, token_type]
      properties:
        access_token: { type: string }
        scope: { type: string }
        expires_in: { type: integer, format: int64 }
        token_type: { type: string, enum: [bearer] }
    ErrorResponse:
      type: object
      properties:
        message: { type: string }
    AnyJson:
      description: Arbitrary JSON value
    MetricsIngress:
      type: object
      additionalProperties:
        type: object
        additionalProperties: { type: number }
    GeoIpResponse:
      type: object
      properties:
        city:
          type: object
          properties:
            name: { type: string }
        asn: { type: integer }
        is_anonymous: { type: boolean }
    BlockedToken:
      type: object
      properties:
        household_token: { type: string }
        expire_time: { type: integer, format: int64 }
    BlockedUserAgent:
      type: object
      properties:
        user_agent: { type: string }
    BlockedReferrer:
      type: object
      properties:
        referrer: { type: string }
    DiscoveryHost:
      type: object
      properties:
        name: { type: string }
    DiscoveryNamespace:
      type: object
      properties:
        namespace: { type: string }
        confd_uri: { type: string }
    HealthStatus:
      type: object
      properties:
        status: { $ref: '#/components/schemas/StatusValue' }
    StatusValue:
      type: string
      enum: [Ok, Fail]

Using the OpenAPI Specification

Generating API Clients

The OpenAPI specification can be used to generate client libraries in multiple languages:

Using openapi-generator:

# Generate Python client
openapi-generator generate -i openapi.yaml -g python -o ./python-client

# Generate TypeScript client
openapi-generator generate -i openapi.yaml -g typescript-axios -o ./typescript-client

# Generate Go client
openapi-generator generate -i openapi.yaml -g go -o ./go-client

Using swagger-codegen:

swagger-codegen generate -i openapi.yaml -l python -o ./python-client

Validating the Specification

To validate the OpenAPI specification:

# Using swagger-cli
swagger-cli validate openapi.yaml

# Using spectral
spectral lint openapi.yaml

Next Steps

11 - Troubleshooting Guide

Common issues and resolution procedures

Overview

This guide provides troubleshooting procedures for common issues encountered when operating the AgileTV CDN Manager (ESB3027). Use the diagnostic commands and resolution steps to identify and resolve problems.

Diagnostic Tools

Cluster Status

# Check node status
kubectl get nodes

# Check all pods
kubectl get pods -A

# Check events sorted by time
kubectl get events --sort-by='.lastTimestamp'

# Check resource usage
kubectl top nodes
kubectl top pods

Component Status

# Check deployments
kubectl get deployments

# Check statefulsets
kubectl get statefulsets

# Check persistent volumes
kubectl get pvc
kubectl get pv

# Check services
kubectl get services

# Check ingress
kubectl get ingress

Common Issues

Pods Stuck in Pending State

Symptoms: Pods remain in Pending state indefinitely.

Causes:

  • Insufficient cluster resources (CPU/memory)
  • No nodes match scheduling constraints
  • PersistentVolume not available

Diagnosis:

# Describe the pending pod
kubectl describe pod <pod-name>

# Check events for scheduling failures
kubectl get events --field-selector reason=FailedScheduling

# Check node capacity
kubectl describe nodes | grep -A 5 "Allocated resources"

# Check available PVs
kubectl get pv

Resolution:

# Free up resources by scaling down non-critical workloads
kubectl scale deployment <deployment> --replicas=0

# Or add additional nodes to the cluster

# If PV is stuck, delete and recreate
kubectl delete pvc <pvc-name>
kubectl delete pod <pod-name>

Pods Stuck in ContainerCreating

Symptoms: Pods remain in ContainerCreating state.

Causes:

  • Image pull failures
  • Volume mount issues
  • Network configuration problems

Diagnosis:

kubectl describe pod <pod-name>

# Check for image pull errors
kubectl get events | grep -i "failed to pull"

# Check volume mount status
kubectl get events | grep -i "mount"

Resolution:

# For image pull issues, verify image exists and credentials
kubectl get secret <pull-secret-name> -o yaml

# For volume issues, check Longhorn volume status
kubectl get volumes -n longhorn-system

# Delete stuck pod to trigger recreation
kubectl delete pod <pod-name> --force --grace-period=0

Persistent Volume Mount Failures

Symptoms: Pod fails to start with error “AttachVolume.Attach failed for volume… is not ready for workloads” or similar volume attachment errors.

Causes:

  • Longhorn volume created but unable to be successfully mounted
  • Network connectivity issues between nodes (Longhorn requires iSCSI and NFS traffic)
  • Longhorn service unhealthy
  • Incorrect storage class configuration

Diagnosis:

# Describe the failing pod to see the error
kubectl describe pod <pod-name>

# Check Longhorn volumes status
kubectl get volumes -n longhorn-system

# Check Longhorn UI for detailed volume status
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80
# Access: http://localhost:8080

Resolution:

# Verify firewall allows Longhorn traffic between nodes
# Ports 9500 and 8500 must be open (see Networking Guide)

# Check Longhorn is healthy
kubectl get pods -n longhorn-system

# If volume is stuck, delete PVC and pod to trigger recreation
kubectl delete pvc <pvc-name>
kubectl delete pod <pod-name>

Pods in CrashLoopBackOff

Symptoms: Pods repeatedly crash and restart.

Causes:

  • Application configuration errors
  • Missing dependencies (database not ready)
  • Resource limits too low
  • Liveness probe failures

Diagnosis:

# View current logs
kubectl logs <pod-name>

# View previous instance logs
kubectl logs <pod-name> -p

# Describe pod for restart reasons
kubectl describe pod <pod-name>

# Check if dependencies are healthy
kubectl get pods | grep -E "(postgres|kafka|redis)"

Resolution:

# For dependency issues, wait for dependencies to be ready
kubectl wait --for=condition=Ready pod/<dependency-pod> --timeout=300s

# For resource issues, increase limits
kubectl edit deployment <deployment-name>

# For configuration issues, check ConfigMaps and Secrets
kubectl get configmap <configmap-name> -o yaml
kubectl get secret <secret-name> -o yaml

# Restart the deployment
kubectl rollout restart deployment/<deployment-name>

Pods in Terminating State

Symptoms: Pods stuck in Terminating state indefinitely.

Causes:

  • Volume detachment issues
  • Node communication problems
  • Finalizer blocking deletion

Diagnosis:

kubectl describe pod <pod-name>

# Check if node is reachable
kubectl get nodes

# Check finalizers
kubectl get pod <pod-name> -o jsonpath='{.metadata.finalizers}'

Resolution:

# Force delete the pod
kubectl delete pod <pod-name> --force --grace-period=0

# If node is unreachable, drain and remove from cluster
kubectl drain <node-name> --ignore-daemonsets --force
kubectl delete node <node-name>

Service Unreachable

Symptoms: Service endpoints not accessible.

Causes:

  • No ready pods backing the service
  • Network policy blocking traffic
  • Service port mismatch

Diagnosis:

# Check service endpoints
kubectl get endpoints <service-name>

# Check if pods are ready
kubectl get pods -l app=<label>

# Check network policies
kubectl get networkpolicies

# Test connectivity from within cluster
kubectl run test --rm -it --image=busybox -- wget -O- <service-name>:<port>

Resolution:

# Ensure pods are ready and matching service selector
kubectl get pods --show-labels

# Check service selector matches pod labels
kubectl get service <service-name> -o jsonpath='{.spec.selector}'

# Temporarily disable network policy for testing
kubectl edit networkpolicy <policy-name>

Ingress Not Working

Symptoms: External access via ingress fails.

Causes:

  • Traefik ingress controller not running
  • Ingress configuration errors
  • TLS certificate issues
  • DNS resolution problems

Diagnosis:

# Check Traefik pods
kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik

# Check ingress resources
kubectl get ingress

# Describe ingress for errors
kubectl describe ingress <ingress-name>

# Check Traefik logs
kubectl logs -n kube-system -l app.kubernetes.io/name=traefik

# Test DNS resolution
nslookup <hostname>

Resolution:

# Restart Traefik
kubectl rollout restart deployment -n kube-system traefik

# Fix ingress configuration
kubectl edit ingress <ingress-name>

# Renew or recreate TLS secret
kubectl create secret tls <secret-name> --cert=tls.crt --key=tls.key \
  --dry-run=client -o yaml | kubectl apply -f -

# Verify hostname matches certificate
openssl x509 -in tls.crt -noout -subject -issuer

Database Connection Failures

Symptoms: Application cannot connect to PostgreSQL.

Causes:

  • PostgreSQL cluster not ready
  • Connection pool exhausted
  • Network connectivity issues
  • Authentication failures

Diagnosis:

# Check PostgreSQL cluster status
kubectl get clusters

# Check PostgreSQL pods
kubectl get pods -l app.kubernetes.io/name=postgresql

# Check PostgreSQL logs
kubectl logs -l app.kubernetes.io/name=postgresql

# Test connectivity
kubectl exec -it <app-pod> -- psql -h <postgres-service> -U <user> -d <database>

Resolution:

# Wait for PostgreSQL to be ready
kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=postgresql --timeout=300s

# Check connection string in application config
kubectl get secret <secret-name> -o jsonpath='{.data}' | base64 -d

# Restart application pods
kubectl rollout restart deployment/<deployment-name>

Kafka Connection Issues

Symptoms: Application cannot connect to Kafka.

Causes:

  • Kafka controllers not ready
  • Topic not created
  • Network connectivity issues

Diagnosis:

# Check Kafka pods
kubectl get pods -l app.kubernetes.io/name=kafka

# Check Kafka logs
kubectl logs -l app.kubernetes.io/name=kafka

# List topics
kubectl exec -it <kafka-pod> -- kafka-topics.sh --bootstrap-server localhost:9092 --list

Resolution:

# Wait for Kafka controllers to be ready
kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=kafka --timeout=300s

# Create missing topic
kubectl exec -it <kafka-pod> -- kafka-topics.sh --bootstrap-server localhost:9092 \
  --create --topic <topic-name> --partitions 3 --replication-factor 3

# Restart application to reconnect
kubectl rollout restart deployment/<deployment-name>

Redis Connection Issues

Symptoms: Application cannot connect to Redis.

Diagnosis:

# Check Redis pods
kubectl get pods -l app.kubernetes.io/name=redis

# Check Redis logs
kubectl logs -l app.kubernetes.io/name=redis

# Test connectivity
kubectl exec -it <redis-pod> -- redis-cli ping

Resolution:

# Wait for Redis to be ready
kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=redis --timeout=300s

# Restart application
kubectl rollout restart deployment/<deployment-name>

High Memory Usage

Symptoms: Pods approaching or hitting memory limits.

Diagnosis:

# Check memory usage
kubectl top pods

# Check OOMKilled pods
kubectl get pods --field-selector=status.phase=Failed

# Check for memory leaks in logs
kubectl logs <pod-name> | grep -i "memory\|oom"

Resolution:

# Temporarily increase memory limit
kubectl edit deployment <deployment-name>

# Or scale horizontally if HPA is enabled
kubectl scale deployment <deployment-name> --replicas=<n>

# Long-term: Update values.yaml and perform helm upgrade

High CPU Usage

Symptoms: Pods consistently using high CPU.

Diagnosis:

# Check CPU usage
kubectl top pods

# Check for runaway processes
kubectl top pods --sort-by=cpu

Resolution:

# Scale horizontally if HPA is enabled
kubectl scale deployment <deployment-name> --replicas=<n>

# Or increase CPU limits
kubectl edit deployment <deployment-name>

Persistent Volume Issues

Symptoms: PVC not binding or volume errors.

Diagnosis:

# Check PVC status
kubectl get pvc

# Check PV status
kubectl get pv

# Check Longhorn volumes
kubectl get volumes -n longhorn-system

# Check Longhorn UI for details
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80

Resolution:

# For stuck PVC, delete and recreate
kubectl delete pvc <pvc-name>
kubectl delete pod <pod-name>

# For Longhorn issues, check Longhorn UI
# Access via http://localhost:8080

# Recreate Longhorn volume if necessary

Zitadel Authentication Failures

Symptoms: Users cannot authenticate via Zitadel.

Causes:

  • CORS configuration mismatch
  • External domain misconfigured
  • Zitadel pods not healthy

Diagnosis:

# Check Zitadel pods
kubectl get pods -l app.kubernetes.io/name=zitadel

# Check Zitadel logs
kubectl logs -l app.kubernetes.io/name=zitadel

# Verify external domain configuration
helm get values acd-manager -o yaml | grep -A 5 zitadel

Resolution:

# Ensure global.hosts.manager[0].host matches zitadel.zitadel.ExternalDomain
# Update values.yaml if needed

helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml

# Restart Zitadel
kubectl rollout restart deployment -l app.kubernetes.io/name=zitadel

Certificate Errors

Symptoms: TLS/SSL errors in browser or API calls.

Diagnosis:

# Check certificate expiration
kubectl get secret <tls-secret> -o jsonpath='{.data.tls\.crt}' | base64 -d | \
  openssl x509 -noout -dates

# Check certificate subject
kubectl get secret <tls-secret> -o jsonpath='{.data.tls\.crt}' | base64 -d | \
  openssl x509 -noout -subject -issuer

Resolution:

# Renew self-signed certificate
helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml \
  --set ingress.selfSigned=true

# Or update manual certificate
kubectl create secret tls <secret-name> \
  --cert=new-cert.crt --key=new-key.key \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart pods to pick up new certificate
kubectl rollout restart deployment <deployment-name>

Log Collection

Collecting Logs for Support

# Capture timestamp once to ensure consistency
TS=$(date +%Y%m%d-%H%M%S)

# Create log collection directory
mkdir -p ~/cdn-logs-$TS
cd ~/cdn-logs-$TS

# Collect pod logs
for pod in $(kubectl get pods -o name); do
  kubectl logs $pod > ${pod#pod/}.log 2>&1
  kubectl logs $pod -p > ${pod#pod/}.previous.log 2>&1 || true
done

# Collect cluster events
kubectl get events --sort-by='.lastTimestamp' > events.log

# Collect pod descriptions
for pod in $(kubectl get pods -o name); do
  kubectl describe $pod > ${pod#pod/}.describe.txt
done

# Compress for transfer
tar czf cdn-logs-$TS.tar.gz *.log *.txt

Emergency Procedures

Complete Cluster Recovery

If the cluster is completely down:

  1. Assess node status:

    kubectl get nodes
    
  2. Restart K3s on nodes:

    # On each node
    systemctl restart k3s
    
  3. If primary server failed:

    • Promote another server node
    • Update load balancer/DNS to point to new primary
  4. Restore from backup if necessary:

    • See Upgrade Guide for restore procedures

Data Recovery

For data recovery scenarios:

  • PostgreSQL: Use Cloudnative PG backup/restore
  • Longhorn: Restore from volume snapshots
  • Kafka: Replication handles most failures

Getting Help

If issues persist:

  1. Collect logs using the procedure above
  2. Check release notes for known issues
  3. Contact support with log bundle and issue description

Next Steps

After resolving issues:

  1. Operations Guide - Preventive maintenance procedures
  2. Configuration Guide - Verify configuration is correct
  3. Architecture Guide - Understand component dependencies

12 - Glossary

Terminology and definitions

Overview

This glossary defines key terms and acronyms used throughout the AgileTV CDN Manager (ESB3027) documentation.

A

ACD (Agile Content Delivery)

The overall CDN solution comprising the Manager (ESB3027) and Director (ESB3024) components.

Agent Node

A Kubernetes node that runs workloads but does not participate in the control plane. Agent nodes provide additional capacity for running application pods.

API Gateway

See NGinx Gateway.

ASN (Autonomous System Number)

A unique identifier for a network on the internet. Used in GeoIP-based routing decisions.

C

CDN Director

The Edge Server Business (ESB3024) component that handles actual content routing and delivery. Multiple Directors can be managed by a single CDN Manager.

Cloudnative PG (CNPG)

A Kubernetes operator that manages PostgreSQL clusters. Provides high availability, automatic failover, and backup capabilities for the Manager’s database layer.

Confd

Configuration daemon that synchronizes configuration from the Manager to CDN Directors. Runs as a sidecar or separate deployment.

CORS (Cross-Origin Resource Sharing)

A security mechanism that allows web applications to make requests to a different domain. Zitadel enforces CORS policies requiring the external domain to match the configured hostname.

CrashLoopBackOff

A Kubernetes pod state indicating the container is repeatedly crashing and being restarted. Typically indicates a configuration or dependency issue.

D

Datastore

The internal key-value storage system used by the Manager for short-lived or simple structured data. Backed by Redis.

Descheduler

A Kubernetes component that periodically analyzes pod distribution and evicts pods from overutilized nodes to optimize cluster balance.

Director

See CDN Director.

E

EDB (EnterpriseDB)

A company that provides PostgreSQL-related software and services. The Cloudnative PG operator was originally developed by EDB.

Ephemeral Storage

Temporary storage available to pods. Used for temporary files and caches. Not persistent across pod restarts.

ESB (Edge Server Business)

The product family designation for CDN components. ESB3027 is the Manager, ESB3024 is the Director.

etcd

A distributed key-value store used by Kubernetes for cluster state management. Runs on Server nodes as part of the control plane.

F

FailedScheduling

A Kubernetes event indicating a pod could not be scheduled due to insufficient resources or scheduling constraints.

Flannel

A network overlay solution for Kubernetes. Provides VXLAN-based networking for pod-to-pod communication.

Frontend GUI

See MIB Frontend.

G

GeoIP

Geographic IP lookup service using MaxMind databases. Used for location-based routing decisions.

Grafana

A visualization and dashboard platform for time-series data. Used to display metrics collected by Telegraf and stored in VictoriaMetrics.

H

Helm Chart

A package of pre-configured Kubernetes resources. The CDN Manager is deployed via a Helm chart that handles all component installation.

HPA (Horizontal Pod Autoscaler)

A Kubernetes feature that automatically scales the number of pods based on CPU/memory utilization or custom metrics.

HTTP Server

The main API server component of the Manager, built with Actix Web (Rust framework).

I

Ingress

A Kubernetes resource that exposes HTTP/HTTPS routes from outside the cluster to services within. The CDN Manager uses Traefik as the ingress controller.

Ingress Controller

A component that implements ingress rules. The CDN Manager uses Traefik for primary ingress and NGinx for external Director communication.

K

Kafka

A distributed event streaming platform used by the Manager for asynchronous communication and event processing.

K3s

A lightweight Kubernetes distribution optimized for edge and production deployments. Used as the underlying cluster technology.

Kubernetes (K8s)

An open-source container orchestration platform. The CDN Manager runs on a K3s-based Kubernetes cluster.

L

Longhorn

A distributed block storage system for Kubernetes. Provides persistent volumes for stateful components like PostgreSQL and Kafka.

Liveness Probe

A Kubernetes health check that determines if a container is running properly. Failed liveness probes trigger container restart.

M

Manager

The central management component (ESB3027) for configuring and monitoring CDN Directors.

MaxMind

A provider of IP intelligence databases including GeoIP City, GeoLite2 ASN, and Anonymous IP databases used by the Manager.

MIB Frontend

The web-based configuration GUI for CDN operators. Provides a user interface for managing streams, routers, and other configuration.

Multi-Factor Authentication (MFA)

An authentication method requiring multiple forms of verification. Note: MFA is not currently supported in the CDN Manager and should be skipped during setup.

N

Name-based Virtual Hosting

A technique where multiple hostnames are served from the same IP address. Zitadel uses this for CORS validation.

Namespace

A Kubernetes abstraction for organizing cluster resources. The CDN Manager uses namespaces to group related components.

NGinx Gateway

An NGinx-based gateway that handles external communication with CDN Directors.

Node Token

A secret token used to authenticate new nodes joining a K3s cluster. Located at /var/lib/rancher/k3s/server/node-token on Server nodes.

O

Operator

A method of packaging, deploying, and managing a Kubernetes application. Cloudnative PG is an operator for PostgreSQL.

OOMKilled

A Kubernetes pod state indicating the container was terminated due to exceeding memory limits.

P

PDB (Pod Disruption Budget)

A Kubernetes feature that ensures a minimum number of pods remain available during voluntary disruptions like maintenance.

PersistentVolume (PV)

A piece of storage in the Kubernetes cluster. Created dynamically by Longhorn for stateful components.

PersistentVolumeClaim (PVC)

A request for storage by a pod. Bound to a PersistentVolume.

Pod

The smallest deployable unit in Kubernetes. Contains one or more containers.

PostgreSQL

An open-source relational database. Used by the Manager for persistent data storage, managed by Cloudnative PG.

Probe

A Kubernetes health check mechanism. Types include liveness, readiness, and startup probes.

Prometheus

An open-source monitoring and alerting toolkit. Telegraf exports metrics in Prometheus format.

R

RBAC (Role-Based Access Control)

A method of regulating access to resources based on user roles. Used by Kubernetes for authorization.

Readiness Probe

A Kubernetes health check that determines if a container is ready to receive traffic. Failed readiness probes remove the pod from service load balancing.

Redis

An in-memory data structure store used for caching and as the datastore backend for the Manager.

Replica

A copy of a pod. Multiple replicas provide high availability and load distribution.

Resource Preset

Predefined resource configurations (nano, micro, small, medium, large, xlarge, 2xlarge) for common deployment sizes.

Rolling Update

A deployment strategy that updates pods one at a time to maintain availability during upgrades.

S

Selection Input

A key-value storage mechanism used for configuration data that can be queried with wildcard patterns. Available in v1 and v2 APIs with different semantics.

Server Node

A Kubernetes node that participates in the control plane (etcd, API server). Can also run workloads unless tainted.

Service

A Kubernetes abstraction that defines a logical set of pods and a policy for accessing them. Provides stable networking endpoints.

ServiceAccount

A Kubernetes identity for processes running in pods. Used for authentication between Kubernetes components.

StatefulSet

A Kubernetes workload API object for managing stateful applications. Used for PostgreSQL and Kafka deployments.

Startup Probe

A Kubernetes health check that determines if a container application has started. Disables liveness and readiness checks until it succeeds.

Stream

A content stream configuration defining source and routing parameters.

T

Telegraf

An agent for collecting, processing, aggregating, and writing metrics. Runs on each node to gather system and application metrics.

TLS (Transport Layer Security)

A cryptographic protocol for secure communication. The CDN Manager uses TLS for all external HTTPS connections.

Topology Aware Hints

A Kubernetes feature that prefers routing traffic to pods in the same zone as the source. Reduces latency by keeping traffic local.

Traefik

A modern HTTP reverse proxy and ingress controller. Used as the primary ingress controller for the CDN Manager.

TTL (Time To Live)

The duration after which data expires. Used in the datastore and selection input APIs.

V

Values.yaml

The Helm chart configuration file. Contains all configurable parameters for the CDN Manager deployment.

VictoriaMetrics

A time-series database used for storing metrics data. Provides long-term storage and querying capabilities.

VXLAN

Virtual Extensible LAN. A network virtualization technology used by Flannel for pod networking.

Z

Zitadel

An identity and access management (IAM) platform used for authentication and authorization in the CDN Manager. Provides OAuth2/OIDC capabilities.

Default Credentials

The following table lists all default credentials used by the CDN Manager. Change these defaults before deploying to production.

ServiceUsernamePasswordNotes
Zitadel Consoleadmin@agiletv.devPassword1!Primary identity management; accessed at /ui/console
GrafanaadminedgewareMonitoring dashboards; accessed at /grafana

Security Warning: These are default credentials only. For production deployments, you must change all default passwords before exposing the system to users.

Zitadel Default Account: Use the default admin@agiletv.dev account only to create a new administrator account with proper roles. After verifying the new account works, disable or delete the default admin account. For details on required roles and administrator permissions, see Zitadel’s Administrator Documentation. See the Next Steps Guide for initial configuration procedures.

Common Abbreviations

AbbreviationMeaning
APIApplication Programming Interface
ASNAutonomous System Number
CORSCross-Origin Resource Sharing
CPUCentral Processing Unit
DNSDomain Name System
EDBEnterpriseDB
ESBEdge Server Business
GUIGraphical User Interface
HAHigh Availability
HelmHelm Package Manager
HPAHorizontal Pod Autoscaler
HTTPHypertext Transfer Protocol
HTTPSHTTP Secure
IAMIdentity and Access Management
IPInternet Protocol
JSONJavaScript Object Notation
K8sKubernetes
MFAMulti-Factor Authentication
MIBManagement Information Base
NICNetwork Interface Card
OAuthOpen Authorization
OIDCOpenID Connect
PVCPersistentVolumeClaim
PVPersistentVolume
RBACRole-Based Access Control
SSLSecure Sockets Layer
TCPTransmission Control Protocol
TLSTransport Layer Security
TTLTime To Live
UDPUser Datagram Protocol
UIUser Interface
VPAVertical Pod Autoscaler
VXLANVirtual Extensible LAN

Next Steps

After reviewing terminology:

  1. Architecture Guide - Understand component relationships
  2. Configuration Guide - Full configuration reference
  3. Operations Guide - Day-to-day operational procedures