Architecture Guide
Overview
The AgileTV CDN Manager (ESB3027) is a cloud-native Kubernetes application designed for managing CDN operations. This guide provides a detailed description of the system architecture, component interactions, and scaling considerations.
High-Level Architecture
The CDN Manager follows a microservices architecture deployed on Kubernetes. The system is organized into logical layers:
graph LR
Clients[API Clients] --> Ingress[Ingress Controller]
Ingress --> Manager[Core Manager]
Ingress --> Frontend[MIB Frontend]
Ingress --> Grafana[Grafana]
Manager --> Redis[(Redis)]
Manager --> Kafka[(Kafka)]
Manager --> PostgreSQL[(PostgreSQL)]
Manager --> Zitadel[Zitadel IAM]
Manager --> Confd[Configuration Service]
Grafana --> VM[(VictoriaMetrics)]
Confd -.-> Gateway[NGinx Gateway]
Gateway --> Director[CDN Director]Component Architecture
Ingress Layer
The ingress layer manages all incoming traffic to the cluster:
| Component | Role |
|---|---|
| Ingress Controller | Primary ingress for all cluster traffic; routes requests to internal services based on path |
| NGinx Gateway | Reverse proxy for routing traffic to external CDN Directors; used by MIB Frontend to communicate with remote Confd instances on CDN Director nodes |
Traffic flow:
- API clients and Operator UI connect via the Ingress Controller at
/apiand/guipaths respectively - Grafana dashboards are accessed via the Ingress Controller at
/grafana - Zitadel authentication console is accessed via the Ingress Controller at
/ui/console - MIB Frontend uses NGinx Gateway when communicating with external Confd instances on CDN Director nodes
Application Services
The application layer contains the core CDN Manager services:
| Component | Role | Scaling |
|---|---|---|
| Core Manager | Main REST API server (v1/v2 endpoints); handles authentication, configuration, routing, and discovery | Horizontally scalable via HPA |
| MIB Frontend | Web-based configuration GUI for operators | Horizontally scalable via HPA |
| Confd | Configuration service for routing configuration; synchronizes with Core Manager application | Single instance |
| Grafana | Monitoring and visualization dashboards | Single instance |
| Selection Input Worker | Consumes selection input events from Kafka and updates configuration | Single instance |
| Metrics Aggregator | Collects and aggregates metrics from CDN components | Single instance |
| Telegraf | System-level metrics collection from cluster nodes | DaemonSet (one per node) |
| Alertmanager | Alert routing and notification management | Single instance |
Data Layer
The data layer provides persistent and ephemeral storage:
| Component | Role | Scaling |
|---|---|---|
| Redis | In-memory caching, session storage, and ephemeral state | Master + replicas (read-only) |
| Kafka | Event streaming for selection input and metrics; provides durable message queue | Controller cluster (odd count) |
| PostgreSQL | Persistent configuration and state storage | 3-node cluster with HA |
| VictoriaMetrics (Analytics) | Real-time and short-term metrics for operational dashboards | Single instance |
| VictoriaMetrics (Billing) | Long-term metrics retention (1+ years) for billing and license compliance | Single instance |
External Integrations
| Component | Role |
|---|---|
| Zitadel IAM | Identity and access management; provides OAuth2/OIDC authentication |
| CDN Director (ESB3024) | Edge routing infrastructure; receives configuration from Confd |
Detailed Component Descriptions
Core Manager
The Core Manager is the central application server that exposes the REST API. It is implemented in Rust using the Actix-web framework.
Key Responsibilities:
- Authentication and session management via Zitadel
- Configuration document storage and retrieval
- Selection input CRUD operations
- Routing rule evaluation and GeoIP lookups
- Service discovery for CDN Directors and edge servers
- Operator UI helper endpoints
API Endpoints:
/api/v1/auth/*- Authentication (login, token, logout)/api/v1/configuration- Configuration management/api/v1/selection_input/*- Selection input operations/api/v2/selection_input/*- Enhanced selection input with list operations/api/v1/routing/*- Routing evaluation and validation/api/v1/discovery/*- Host and namespace discovery/api/v1/metrics- System metrics/api/v1/health/*- Liveness and readiness probes/api/v1/operator_ui/*- Operator helper endpoints
Runtime Modes: The Core Manager supports multiple runtime modes, each deployed as a separate container:
http-server- Primary HTTP API server (default)metrics-aggregator- Background worker for metrics collectionselection-input- Background worker for Kafka selection input consumption
MIB Frontend
The MIB Frontend provides a web-based GUI for configuration management.
Key Features:
- Intuitive web interface for CDN configuration
- Real-time configuration validation
- Integration with Zitadel for SSO authentication
- Uses NGinx Gateway for external Director communication
Confd (Configuration Service)
Confd provides routing configuration services and synchronizes with the Core Manager application.
Key Responsibilities:
- Hosts the service configuration for routing decisions
- Provides API and CLI for configuration management
- Synchronizes routing configuration with Core Manager
- Maintains configuration state in PostgreSQL
Selection Input Worker
The Selection Input Worker processes selection input events from the Kafka stream.
Key Responsibilities:
- Consumes messages from the
selection_inputKafka topic - Validates and transforms input data
- Updates configuration in the data store
- Maintains message ordering within partitions
Scaling Limitation: The Selection Input Worker cannot be scaled beyond a single consumer per Kafka partition, as message ordering must be preserved.
Metrics Aggregator
The Metrics Aggregator collects and processes metrics from CDN components.
Key Responsibilities:
- Polls metrics from Director instances
- Aggregates usage statistics
- Writes data to VictoriaMetrics (Analytics) for dashboards
- Writes long-term data to VictoriaMetrics (Billing) for compliance
Telegraf
Telegraf is deployed as a DaemonSet to collect host-level metrics.
Key Responsibilities:
- CPU, memory, disk, and network metrics from each node
- Container-level resource usage
- Kubernetes cluster metrics
- Forwards metrics to VictoriaMetrics
Grafana
Grafana provides visualization and dashboard capabilities.
Features:
- Pre-built dashboards for CDN monitoring
- Custom dashboard support
- VictoriaMetrics as data source
- Alerting integration with Alertmanager
Access: https://<host>/grafana
Alertmanager
Alertmanager handles alert routing and notifications.
Key Responsibilities:
- Receives alerts from Grafana and other sources
- Deduplicates and groups alerts
- Routes to notification channels (email, webhook, etc.)
- Manages alert silencing and inhibition
Data Storage
Redis
Redis provides in-memory storage for:
- User sessions and authentication tokens
- Ephemeral configuration cache
- Real-time state synchronization
Deployment: Master + read replicas for high availability
Kafka
Kafka provides durable event streaming for:
- Selection input events
- Metrics data streams
- Inter-service communication
Deployment: Controller cluster with 3 replicas for production, 1 replica for lab deployments
Node Affinity: Kafka replicas must be scheduled on separate nodes to ensure high availability. The Helm chart configures pod anti-affinity rules to enforce this distribution.
Topics:
selection_input- Selection input eventsmetrics- Metrics data streams
Note: For lab/single-node deployments, the Kafka replica count must be set to 1 in the Helm values. Production deployments require 3 replicas for fault tolerance.
PostgreSQL
PostgreSQL provides persistent storage for:
- Configuration documents
- User and permission data
- System state
Deployment: 3-node cluster managed by Cloudnative PG (CNPG) operator
High Availability: The CNPG operator manages automatic failover and ensures high availability:
- One primary node handles read/write operations
- Two replica nodes provide redundancy and can be promoted to primary on failure
- Automatic failover occurs within seconds of primary node failure
- Synchronous replication ensures data consistency
Note: The PostgreSQL cluster is deployed and managed automatically by the CNPG operator. Manual intervention is typically not required for normal operations.
VictoriaMetrics
Two VictoriaMetrics instances serve different purposes:
VictoriaMetrics (Analytics):
- Real-time and short-term metrics storage
- Supports Grafana dashboards
- Retention: Configurable (typically 30-90 days)
VictoriaMetrics (Billing):
- Long-term metrics retention
- Billing and license compliance data
- Retention: Minimum 1 year
Authentication and Authorization
Zitadel Integration
Zitadel provides identity and access management:
Authentication Flow:
- User accesses MIB Frontend or API
- Redirected to Zitadel for authentication
- Zitadel validates credentials and issues session token
- Session token exchanged for access token
- Access token included in API requests (Bearer authentication)
Default Credentials: See the Glossary for default login credentials.
Access Paths:
- Zitadel Console:
/ui/console - API authentication:
/api/v1/auth/*
CORS Configuration
Zitadel enforces Cross-Origin Resource Sharing (CORS) policies. The external hostname configured in Zitadel must match the first entry in global.hosts.manager in the Helm values.
Network Architecture
Traffic Flow
graph TB
External[External Clients] --> Ingress[Ingress Controller]
External --> Redis[(Redis)]
External --> Kafka[(Kafka)]
External --> Telegraf[Telegraf]
Ingress --> Manager[Core Manager]
Ingress --> Frontend[MIB Frontend]
Ingress --> Grafana[Grafana]
Ingress --> Zitadel[Zitadel]Note: Certain services (Redis, Kafka, Telegraf) can be accessed directly by external clients without traversing the ingress controller. This is typically used for metrics collection, event streaming, and direct data access scenarios.
Internal Communication
All internal services communicate over the Kubernetes overlay network (Flannel VXLAN). Services discover each other via Kubernetes DNS.
External Communication
- CDN Directors: Accessed via NGinx Gateway for simplified routing
- MaxMind GeoIP: Local database files (no external calls)
Scaling
Horizontal Pod Autoscaler (HPA)
The following components support automatic horizontal scaling via HPA:
| Component | Minimum | Maximum | Scale Metrics |
|---|---|---|---|
| Core Manager | 3 | 8 | CPU (50%), Memory (80%) |
| NGinx Gateway | 2 | 4 | CPU (75%), Memory (80%) |
| MIB Frontend | 2 | 4 | CPU (75%), Memory (90%) |
Note: HPA is enabled by default in the Helm chart. The default configuration is tuned for production deployments. Adjust min/max values based on expected load and available cluster capacity.
Manual Scaling
Components can also be scaled manually by setting replica counts in the Helm values:
manager:
replicaCount: 3
mib-frontend:
replicaCount: 2
Important: When manually setting replica counts, you must disable the Horizontal Pod Autoscaler (HPA) for the corresponding component. If HPA remains enabled, it will override manual replica settings. To disable HPA, set
autoscaling.hpa.enabled: falsefor the component in your Helm values.
Components That Do Not Scale
The following components do not support horizontal scaling:
| Component | Reason |
|---|---|
| Confd | Single instance required for configuration consistency |
| PostgreSQL | Cloudnative PG cluster; scaled by adding replicas via operator configuration |
| Kafka | Scaled by adding controllers, not via replica count |
| VictoriaMetrics | Stateful; single instance per role |
| Redis | Master is single; replicas are read-only |
| Grafana | Single instance sufficient for dashboard access |
| Alertmanager | Single instance for alert routing |
| Selection Input Worker | Kafka message ordering requires single consumer |
| Metrics Aggregator | Single instance for consistent metrics aggregation |
Node Scaling
Additional Agent nodes can be added to the cluster at any time to increase workload capacity. Kubernetes automatically schedules pods to nodes with available resources.
Cluster Balancing
The CDN Manager deployment includes the Kubernetes Descheduler to maintain balanced resource utilization across cluster nodes:
- Automatic Rebalancing: The descheduler periodically analyzes pod distribution and evicts pods from overutilized nodes
- Node Balance: Helps prevent resource hotspots by redistributing workloads across available nodes
- Integration with HPA: Works in conjunction with Horizontal Pod Autoscaler to optimize both pod count and placement
The descheduler runs as a background process and does not require manual intervention under normal operating conditions.
Resource Configuration
For detailed resource preset configurations and planning guidance, see the Configuration Guide.
High Availability
Server Node Redundancy
Production deployments require a minimum of 3 Server nodes:
- Survives loss of 1 server node
- Maintains quorum for etcd and Kafka
For enhanced availability, use 5 Server nodes:
- Survives loss of 2 server nodes
- Recommended for critical production environments
For large-scale deployments, 7 or more Server nodes can be used:
- Survives loss of 3+ server nodes
- Suitable for high-capacity production environments
Pod Distribution
Kubernetes automatically distributes pods across nodes to maximize availability:
- Pods with the same deployment are scheduled on different nodes when possible
- Pod Disruption Budgets (PDB) ensure minimum availability during maintenance
Data Replication
| Component | Replication Strategy |
|---|---|
| Redis | Single instance (backup via Longhorn snapshots) |
| Kafka | Replicated partitions (default: 3) |
| PostgreSQL | 3-node cluster via Cloudnative PG |
| VictoriaMetrics | Single instance (backup via snapshots) |
| Longhorn | Single replica with pod-node affinity |
Longhorn Storage: Longhorn volumes are configured with a single replica by default. Pod scheduling is configured with node affinity to prefer scheduling pods on the same node as their persistent volume data. This approach optimizes I/O performance while maintaining data locality.
Next Steps
After understanding the architecture:
- Installation Guide - Deploy the CDN Manager
- Configuration Guide - Configure components for your environment
- Operations Guide - Day-to-day operational procedures
- Performance Tuning Guide - Optimize system performance
- Metrics & Monitoring - Set up monitoring and alerting