Operations Guide
Overview
This guide covers day-to-day operational procedures for managing the AgileTV CDN Manager (ESB3027). Topics include routine maintenance, backup procedures, log management, and common operational tasks.
Prerequisites
Before performing operations, ensure you have:
kubectlaccess to the clusterhelmCLI installed- Access to the node where
values.yamlis stored - Appropriate RBAC permissions for administrative tasks
Cluster Access
There are two supported methods for accessing the Kubernetes cluster:
- SSH to a Server Node (Recommended for operations staff) - SSH into any Server node and run
kubectlcommands directly - Remote kubectl - Install
kubectlon your local machine and configure it to connect to the cluster remotely
Method 1: SSH to Server Node (Recommended)
The kubectl command-line tool is pre-configured on all Server nodes and can be used directly without additional setup:
# SSH to any Server node
ssh root@<server-ip>
# Run kubectl commands directly
kubectl get nodes
kubectl get pods
This method is recommended for day-to-day operations as it requires no local configuration and provides direct access to the cluster.
Method 2: Remote kubectl from Local Machine
To use kubectl from your local workstation or laptop:
Step 1: Install kubectl
Download and install kubectl for your operating system:
- Official Documentation: Install kubectl
- macOS (Homebrew):
brew install kubectl - Linux: Download from the official Kubernetes release page
- Windows: Download from the official Kubernetes release page
Step 2: Copy kubeconfig from Server Node
# Copy kubeconfig from any Server node
scp root@<server-ip>:/etc/rancher/k3s/k3s.yaml ~/.kube/config
Step 3: Update kubeconfig
Edit the kubeconfig file to point to the correct server address:
# Replace localhost with the actual server IP
# macOS/Linux:
sed -i '' 's/127.0.0.1/<server-ip>/g' ~/.kube/config # macOS
sed -i 's/127.0.0.1/<server-ip>/g' ~/.kube/config # Linux
# Or manually edit ~/.kube/config and change:
# server: https://127.0.0.1:6443
# to:
# server: https://<server-ip>:6443
Step 4: Verify connectivity
kubectl get nodes
Managing Multiple Clusters
If you manage multiple Kubernetes clusters from the same machine, you can maintain multiple kubeconfig files:
# Set KUBECONFIG environment variable to include multiple config files
export KUBECONFIG=~/.kube/config-prod:~/.kube/config-lab
# View all contexts
kubectl config get-contexts
# Switch between clusters
kubectl config use-context <context-name>
# View current context
kubectl config current-context
For more information, see the official Kubernetes documentation: Organizing Cluster Access
Helm Commands
Helm releases are managed cluster-wide:
# List all releases
helm list
# View release history
helm history acd-manager
# Get deployed values
helm get values acd-manager -o yaml
# Get deployed manifest
helm get manifest acd-manager
Note: If using remote kubectl, ensure helm is installed on your local machine. See Helm Installation for instructions.
Helm Commands
Helm releases are managed cluster-wide:
# List all releases
helm list
# View release history
helm history acd-manager
# Get deployed values
helm get values acd-manager -o yaml
# Get deployed manifest
helm get manifest acd-manager
Backup Procedures
PostgreSQL Backup
PostgreSQL is managed by the Cloudnative PG operator, which provides continuous backup capabilities.
# Check backup status
kubectl get backup
# Create manual backup
kubectl apply -f - <<EOF
apiVersion: postgresql.cnpg.io/v1
kind: Backup
metadata:
name: manual-backup-$(date +%Y%m%d-%H%M%S)
spec:
cluster:
name: acd-cluster-postgresql
EOF
# List available backups
kubectl get backup -o wide
# Restore from backup (requires downtime)
# See Upgrade Guide for restore procedures
Longhorn Volume Backups
Longhorn provides snapshot and backup capabilities for persistent volumes:
# List all volumes
kubectl get volumes -n longhorn-system
# Create snapshot via Longhorn UI
# Port-forward to Longhorn UI (do not expose via ingress)
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80
# Access: http://localhost:8080
# WARNING: Longhorn UI grants access to sensitive storage information
# and should never be exposed through the ingress controller
Accessing Internal Services
For debugging and troubleshooting, you may need direct access to internal services.
PostgreSQL
PostgreSQL is managed by the Cloudnative PG operator. Connection details are stored in the acd-cluster-postgresql-app Secret:
# View connection details
kubectl describe secret acd-cluster-postgresql-app
# Extract individual fields
PG_HOST=$(kubectl get secret acd-cluster-postgresql-app -o jsonpath='{.data.host}' | base64 -d)
PG_USER=$(kubectl get secret acd-cluster-postgresql-app -o jsonpath='{.data.username}' | base64 -d)
PG_PASS=$(kubectl get secret acd-cluster-postgresql-app -o jsonpath='{.data.password}' | base64 -d)
PG_DB=$(kubectl get secret acd-cluster-postgresql-app -o jsonpath='{.data.dbname}' | base64 -d)
# Connect via psql
kubectl exec -it acd-cluster-postgresql-0 -- psql -U $PG_USER -d $PG_DB
Secret fields: The CNPG operator populates the following fields: username, password, host, port, dbname, uri, jdbc-uri, fqdn-uri, fqdn-jdbc-uri, pgpass.
Redis
Redis runs on port 6379 with no authentication:
# Connect via redis-cli
kubectl exec -it acd-manager-redis-master-0 -- redis-cli
# Or connect from another pod
kubectl run redis-test --rm -it --image=redis -- redis-cli -h acd-manager-redis-master
Kafka
Kafka is accessible on port 9095 from any cluster node:
# Connect from within cluster
kubectl exec -it acd-manager-kafka-controller-0 -- kafka-topics.sh --bootstrap-server localhost:9092 --list
# Connect from external (via any node IP)
kafka-topics.sh --bootstrap-server <node-ip>:9095 --list
The selection_input topic is pre-configured for selection input events.
Longhorn Storage
Longhorn is a distributed block storage system for Kubernetes that provides persistent volumes for stateful applications such as PostgreSQL and Kafka.
Architecture
Longhorn deploys controller and replica engines on each node, forming a distributed storage system. When a volume is created, Longhorn replicates data across multiple nodes to ensure durability even in the event of node failures.
Storage Protocols:
- iSCSI: Used for standard Read-Write-Once (RWO) volumes
- NFS: Used for Read-Write-Many (RWX) volumes that can be mounted by multiple pods simultaneously
Configuration
The CDN Manager deploys Longhorn with a single replica configuration, which differs from the Longhorn default of 3 replicas. This configuration is optimized for the cluster architecture where:
- Pod-node affinity is configured to schedule pods on the same node as their persistent volume data
- This optimizes I/O performance by reducing network traffic
- Data locality is maintained while still providing volume portability
Capacity Planning
Longhorn storage requires an additional 30% capacity headroom for internal operations and scaling. If less than 30% of the total partition capacity is available, Longhorn may mark volumes as “full” and prevent further writes.
For detailed storage requirements and disk partitioning guidance, see the System Requirements Guide.
Configuration Backup
Always backup your Helm values before making changes:
# Export current values
helm get values acd-manager -o yaml > ~/values-backup-$(date +%Y%m%d).yaml
# Backup custom values files
cp ~/values.yaml ~/values-backup-$(date +%Y%m%d).yaml
Backup Schedule Recommendations
| Component | Frequency | Retention |
|---|---|---|
| PostgreSQL | Daily | 30 days |
| Longhorn Snapshots | Before changes | 7 days |
| Configuration | Before each change | Indefinite |
Updating MaxMind GeoIP Databases
The MaxMind GeoIP databases (GeoIP2-City, GeoLite2-ASN, GeoIP2-Anonymous-IP) are used for GeoIP-based routing and validation features. These databases should be updated periodically to ensure accurate IP geolocation data.
Prerequisites
- Updated MaxMind database files (
.mmdbformat) obtained from MaxMind - Access to the cluster via
kubectl - Helm CLI installed
Update Procedure
Step 1: Create New Volume with Updated Databases
Run the volume generation utility with a unique volume name that includes a revision identifier:
# Mount the installation ISO if not already mounted
mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027
# Generate new volume with updated databases
/mnt/esb3027/generate-maxmind-volume
When prompted:
- Provide the paths to the three database files:
GeoIP2-City.mmdbGeoLite2-ASN.mmdbGeoIP2-Anonymous-IP.mmdb
- Enter a unique volume name with a revision number or date, for example:
maxmind-geoip-2026-04maxmind-geoip-v2
Tip: Using a revision-based naming convention simplifies rollback if needed.
Step 2: Update Helm Configuration
Edit your values.yaml file to reference the new volume:
manager:
maxmindDbVolume: maxmind-geoip-2026-04
Replace maxmind-geoip-2026-04 with the volume name you specified in Step 1.
Step 3: Apply Configuration Update
Upgrade the Helm release with the updated configuration:
helm upgrade acd-manager /mnt/esb3027/charts/acd-manager --values ~/values.yaml
Step 4: Rolling Restart (Optional)
To ensure all pods immediately use the new database files, perform a rolling restart of the manager deployment:
kubectl rollout restart deployment acd-manager
Monitor the rollout status:
kubectl rollout status deployment acd-manager
Step 5: Verify Update
Verify the pods are running with the new volume:
kubectl get pods
kubectl describe pod -l app.kubernetes.io/component=manager | grep -A 5 "Volumes"
Step 6: Clean Up Old Volume (Optional)
After verifying the new databases are working correctly, you can delete the old persistent volume:
# List persistent volumes to find the old one
kubectl get pv
# Delete the old volume
kubectl delete pv <old-volume-name>
Caution: Ensure the new volume is functioning correctly before deleting the old volume. Keep the old volume for at least 24-48 hours as a rollback option.
Rollback Procedure
If issues occur after updating the databases:
- Revert the
maxmindDbVolumevalue in yourvalues.yamlto the previous volume name - Run
helm upgradewith the reverted configuration - Optionally restart the deployment:
kubectl rollout restart deployment acd-manager
Update Frequency Recommendations
| Database | Recommended Update Frequency |
|---|---|
| GeoIP2-City | Weekly or monthly |
| GeoLite2-ASN | Monthly |
| GeoIP2-Anonymous-IP | Weekly or monthly |
MaxMind releases database updates on a regular schedule. Subscribe to MaxMind notifications to stay informed of new releases.
Log Management
Application Logs
# View manager logs
kubectl logs -l app.kubernetes.io/component=manager
# Follow logs in real-time
kubectl logs -l app.kubernetes.io/component=manager -f
# View logs from specific pod
kubectl logs <pod-name>
# View previous instance logs (after crash)
kubectl logs <pod-name> -p
# View logs with timestamps
kubectl logs <pod-name> --timestamps
# View logs from all containers in pod
kubectl logs <pod-name> --all-containers
Component-Specific Logs
# Zitadel logs
kubectl logs -l app.kubernetes.io/name=zitadel
# Gateway logs
kubectl logs -l app.kubernetes.io/component=gateway
# Confd logs
kubectl logs -l app.kubernetes.io/component=confd
# MIB Frontend logs
kubectl logs -l app.kubernetes.io/component=mib-frontend
# PostgreSQL logs
kubectl logs -l app.kubernetes.io/name=postgresql
# Kafka logs
kubectl logs -l app.kubernetes.io/name=kafka
# Redis logs
kubectl logs -l app.kubernetes.io/name=redis
Log Aggregation
Logs are collected by Telegraf and sent to VictoriaMetrics:
# Access Grafana for log visualization
# https://<manager-host>/grafana
# Query logs via Grafana Explore
# Select VictoriaMetrics datasource and use log queries
Log Rotation
Container logs are automatically rotated by Kubernetes:
- Default max size: 10MB per container
- Default max files: 5 rotated files
- Total per pod: ~50MB maximum
Scaling Operations
Manual Scaling
Note: If HPA (Horizontal Pod Autoscaler) is enabled for a deployment, manual scaling changes will be overridden by the HPA. To manually scale, you must first disable the HPA.
# Check if HPA is enabled
kubectl get hpa
# Disable HPA before manual scaling
kubectl patch hpa acd-manager -p '{"spec": {"minReplicas": null, "maxReplicas": null}}'
# Or delete the HPA entirely
kubectl delete hpa acd-manager
# Scale manager replicas
kubectl scale deployment acd-manager --replicas=3
# Scale gateway replicas
kubectl scale deployment acd-manager-gateway --replicas=2
# Scale MIB frontend replicas
kubectl scale deployment acd-manager-mib-frontend --replicas=2
HPA Configuration
# View HPA status
kubectl get hpa
# Describe HPA details
kubectl describe hpa acd-manager
# Edit HPA configuration
kubectl edit hpa acd-manager
Configuration Updates
Updating Helm Values
# Edit values file
vi ~/values.yaml
# Validate with dry-run
helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
--values ~/values.yaml \
--dry-run
# Apply changes
helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
--values ~/values.yaml
# Verify rollout
kubectl rollout status deployment/acd-manager
Rolling Back Changes
# View revision history
helm history acd-manager
# Rollback to previous revision
helm rollback acd-manager
# Rollback to specific revision
helm rollback acd-manager <revision>
# Verify rollback
helm history acd-manager
Certificate Management
Checking Certificate Expiration
# Check TLS secret expiration
kubectl get secret acd-manager-tls -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates
# Check via Grafana dashboard
# Certificate expiration metrics are available in Grafana
Renewing Certificates
# For Helm-managed self-signed certificates
helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
--values ~/values.yaml \
--set ingress.selfSigned=true
# For manual certificates, update the secret
kubectl create secret tls acd-manager-tls \
--cert=new-tls.crt \
--key=new-tls.key \
--dry-run=client -o yaml | kubectl apply -f -
# Restart pods to pick up new certificate
kubectl rollout restart deployment acd-manager
Health Checks
Component Health
# Check all pods
kubectl get pods
# Check specific component
kubectl get pods -l app.kubernetes.io/component=manager
# Check persistent volumes
kubectl get pvc
# Check cluster status
kubectl get nodes
# Check ingress
kubectl get ingress
API Health Endpoints
# Liveness check
curl -k https://<manager-host>/api/v1/health/alive
# Readiness check
curl -k https://<manager-host>/api/v1/health/ready
Database Health
# PostgreSQL cluster status
kubectl get clusters -n default
# Check PostgreSQL pods
kubectl get pods -l app.kubernetes.io/name=postgresql
# Kafka cluster status
kubectl get pods -l app.kubernetes.io/name=kafka
# Redis status
kubectl get pods -l app.kubernetes.io/name=redis
Maintenance Windows
Planned Maintenance
Before performing maintenance:
- Notify users of potential service impact
- Verify backups are current
- Document the maintenance procedure
- Prepare rollback plan
Node Maintenance
# Cordon node to prevent new pods
kubectl cordon <node-name>
# Drain node (evict pods)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# Perform maintenance
# Uncordon node
kubectl uncordon <node-name>
Cluster Upgrades
See the Upgrade Guide for cluster upgrade procedures.
Troubleshooting Quick Reference
Common Commands
# Describe problematic pod
kubectl describe pod <pod-name>
# View pod events
kubectl get events --sort-by='.lastTimestamp'
# Check resource usage
kubectl top pods
kubectl top nodes
# Exec into container
kubectl exec -it <pod-name> -- /bin/sh
# Check network policies
kubectl get networkpolicies
# Check service endpoints
kubectl get endpoints
Restarting Components
# Restart deployment
kubectl rollout restart deployment/<deployment-name>
# Restart statefulset
kubectl rollout restart statefulset/<statefulset-name>
# Delete pod (auto-recreated)
kubectl delete pod <pod-name>
Security Operations
Rotating Service Account Tokens
# Delete service account secret (auto-regenerated)
kubectl delete secret <service-account-token-secret>
# Tokens are automatically regenerated
Updating RBAC Permissions
# View current roles
kubectl get roles
kubectl get clusterroles
# View role bindings
kubectl get rolebindings
kubectl get clusterrolebindings
# Edit role
kubectl edit role <role-name>
Audit Log Access
# K3s audit logs location
/var/lib/rancher/k3s/server/logs/audit.log
# View recent audit events
tail -f /var/lib/rancher/k3s/server/logs/audit.log
Disaster Recovery
Pod Recovery
Pods are automatically recreated if they fail:
# Check pod status
kubectl get pods
# If pod is stuck in Terminating
kubectl delete pod <pod-name> --force --grace-period=0
# If pod is stuck in Pending, check resources
kubectl describe pod <pod-name>
kubectl get events --sort-by='.lastTimestamp'
Node Failure Recovery
When a node fails:
- Automatic: Pods are rescheduled on healthy nodes (after timeout)
- Manual: Force delete stuck pods
# Force delete pods on failed node
kubectl delete pod --all --force --grace-period=0 \
--field-selector spec.nodeName=<failed-node>
Data Recovery
For data recovery scenarios, refer to:
- PostgreSQL: Cloudnative PG backup/restore procedures
- Longhorn: Volume snapshot restoration
- Kafka: Partition replication handles node failures
Routine Maintenance Checklist
Daily
- Review Grafana dashboards for anomalies
- Check alert notifications
- Verify backup completion
Weekly
- Review pod restart counts
- Check certificate expiration dates
- Review log storage usage
- Verify HPA is functioning correctly
Monthly
- Test backup restoration procedure
- Review and rotate credentials if needed
- Update documentation if configuration changed
- Review resource utilization trends
Next Steps
After mastering operations:
- Troubleshooting Guide - Deep dive into problem resolution
- Performance Tuning Guide - Optimize system performance
- Metrics & Monitoring Guide - Comprehensive monitoring setup
- API Guide - REST API reference and automation