Operations Guide

How to operate ESB3027 Agiletv CDN Manager

Overview

This guide details some of the common commands that will be necessary to operate the ESB3027 AgileTV CDN Manager software. Before starting, you will need at least a basic understanding of the following command line tooling.

Getting and Describing Kubernetes Resources

The two most common commands in Kubernetes are get and describe for a specific resource such as a Pod or Service. Using kubectl get typically lists all resources of a particular type; for example, kubectl get pods will display all pods in the current namespace. To obtain more detailed information about a specific resource, use kubectl describe <resource>, such as kubectl describe pod postgresql-0 to view details about that particular pod.

When describing a pod, the output includes a recent Event history at the bottom. This can be extremely helpful for troubleshooting issues, such as why a pod failed to deploy or was restarted. However, keep in mind that this event history only reflects the most recent events from the past few hours, so it may not provide insights into problems that occurred days or weeks ago.

Obtaining Logs

Each Pod maintains its own logs for each container. To fetch the logs of a specific pod, use kubectl logs <pod_name>. Adding the -f flag will stream the logs in follow mode, allowing real-time monitoring. If a pod contains multiple containers, by default, only the logs from the primary container are shown. To view logs from a different container within the same pod, use the -c <container_name> flag.

Since each pod maintains its own logs, retrieving logs from all replicas of a Deployment or StatefulSet may be necessary to get a complete view. You can use label selectors to collect logs from all pods associated with the same application. For example, to fetch logs from all pods belonging to the “acd-manager” deployment, run:

kubectl logs -l app.kubernetes.io/name=acd-manager

To find the labels associated with a specific Deployment or ReplicaSet, describe the resource and look for the “Labels” field.

The following table describes the common labels currently used by deployments in the cluster.

Component Labels

Label (key=value)	Description
app.kubernetes.io/component=manager	Identifies the ACD Manager service
app.kubernetes.io/component=confd	Identifies the confd service
app.kubernetes.io/component=frontend	Identifies the GUI (frontend) service
app.kubernetes.io/component=gateway	Identifies the API gateway service
app.kubernetes.io/component=grafana	Identifies the Grafana monitoring service
app.kubernetes.io/component=metrics-aggregator	Identifies the metrics aggregator service
app.kubernetes.io/component=mib-frontend	Identifies the MIB frontend service
app.kubernetes.io/component=server	Identifies the Prometheus server component
app.kubernetes.io/component=selection-input	Identifies the selection input service
app.kubernetes.io/component=start	Identifies the Zitadel startup/init component
app.kubernetes.io/component=primary	Identifies the PostgreSQL primary node
app.kubernetes.io/component=controller-eligible	Identifies the Kafka controller-eligible node
app.kubernetes.io/component=alertmanager	Identifies the Prometheus Alertmanager
app.kubernetes.io/component=master	Identifies the Redis master node
app.kubernetes.io/component=replica	Identifies the Redis replica node

Instance, Name, and Part-of Labels

Label (key=value)	Description
app.kubernetes.io/instance=acd-manager	Helm release instance name (acd-manager)
app.kubernetes.io/instance=acd-cluster	Helm release instance name (acd-cluster)
app.kubernetes.io/name=acd-manager	Resource name: acd-manager
app.kubernetes.io/name=confd	Resource name: confd
app.kubernetes.io/name=grafana	Resource name: grafana
app.kubernetes.io/name=mib-frontend	Resource name: mib-frontend
app.kubernetes.io/name=prometheus	Resource name: prometheus
app.kubernetes.io/name=telegraf	Resource name: telegraf
app.kubernetes.io/name=zitadel	Resource name: zitadel
app.kubernetes.io/name=postgresql	Resource name: postgresql
app.kubernetes.io/name=kafka	Resource name: kafka
app.kubernetes.io/name=redis	Resource name: redis
app.kubernetes.io/name=victoria-metrics-single	Resource name: victoria-metrics-single
app.kubernetes.io/part-of=prometheus	Part of the Prometheus stack
app.kubernetes.io/part-of=kafka	Part of the Kafka stack

Restarting a Pod

Since Kubernetes maintains a fixed number of replicas for each Deployment or ReplicaSet, deleting a pod will cause Kubernetes to immediately recreate it, effectively restarting the pod. For example, to restart the pod acd-manager-6c85ddd747-5j5gt, run:

kubectl delete pod acd-manager-6c85ddd747-5j5gt

Kubernetes will automatically detach that pod from any associated Service, preventing new connections from reaching it. It then spawns a new instance, which goes through startup, liveness, and readiness probes. Once the new pod passes the readiness probes and is marked as ready, the Service will start forwarding new traffic to it.

If multiple replicas are running, traffic will be distributed among the existing pods while the new pod is initializing, ensuring a seamless, zero-downtime operation.

Stopping and Starting a Deployment

Unlike traditional services, Kubernetes does not have a concept of stopping a service directly. Instead, you can temporarily scale a Deployment to zero replicas, which has the same effect.

For example, to stop the acd-manager Deployment, run:

kubectl scale deployment acd-manager --replicas=0

To restart it later, scale the deployment back to its original number of replicas, e.g.,

kubectl scale deployment acd-manager --replicas=1

If you want to perform a hard restart of all pods within a deployment, you can delete all pods with a specific label, and Kubernetes will automatically recreate them. For example, to restart all pods with the component label “manager,” use:

kubectl delete pod -l app.kubernetes.io/component=manager

This command causes Kubernetes to delete all matching pods, which are then recreated, effectively restarting the service without changing the deployment configuration.

If you want to perform a graceful restart of all pods within a deployment or statefulset, you can use the rollout restart command to do so.

kubectl rollout restart deployment acd-manager

Or for a statefulset such as kafka.

kubectl rollout restart statefulset acd-manager-kafka-controller

Running command inside a pod

Sometimes it is necessary to run a command inside an existing Pod such as obtaining a bash shell.

Using the kubectl exec -it <podname> -- <command> can be used to do just that. Assuming we need to run the confcli tool inside the confd pod acd-manager-confd-558f49ffb5-n8dmr that can be accomplished using the following command:

kubectl exec -it acd-manager-confd-558f49ffb5-n8dmr -- /usr/bin/python3.11 /usr/local/bin/confcli

Note: The confd container does not have a shell, so specifying the python interpreter is necessary on this image.

Monitoring resource usage

Kubernetes includes an internal metrics API which can give some insight into the resource usage of the Pods and of the Nodes.

To list the current usage of the Pods in the cluster issue the following:

kubectl top pods

This will give output similar to the following:

NAME                                             CPU(cores)   MEMORY(bytes)
acd-cluster-postgresql-0                         3m           44Mi
acd-manager-6c85ddd747-rdlg6                     4m           15Mi
acd-manager-confd-558f49ffb5-n8dmr               1m           47Mi
acd-manager-gateway-7594479477-z4bbr             0m           10Mi
acd-manager-grafana-78c76d8c5-c2tl6              18m          144Mi
acd-manager-kafka-controller-0                   19m          763Mi
acd-manager-kafka-controller-1                   19m          967Mi
acd-manager-kafka-controller-2                   25m          1127Mi
acd-manager-metrics-aggregator-f6ff99654-tjbfs   4m           2Mi
acd-manager-mib-frontend-67678c69df-tkklr        1m           26Mi
acd-manager-prometheus-alertmanager-0            2m           25Mi
acd-manager-prometheus-server-768f5d5c-q78xb     5m           53Mi
acd-manager-redis-master-0                       12m          18Mi
acd-manager-redis-replicas-0                     15m          14Mi
acd-manager-selection-input-844599bc4d-x7dct     3m           3Mi
acd-manager-telegraf-585dfc5ff8-n8m5c            1m           27Mi
acd-manager-victoria-metrics-single-server-0     2m           10Mi
acd-manager-zitadel-69b6546f8f-v9lkp             1m           76Mi
acd-manager-zitadel-69b6546f8f-wwcmx             1m           72Mi

Querying the metrics API for the nodes gives the aggregated totals for each node:

kubectl top nodes

Yields output similar to the following:

NAME                 CPU(cores)   CPU(%)   MEMORY(bytes)   MEMORY(%)
k3d-local-agent-0    118m         0%       1698Mi          21%
k3d-local-agent-1    120m         0%       661Mi           8%
k3d-local-agent-2    84m          0%       1054Mi          13%
k3d-local-server-0   115m         0%       1959Mi          25%

Taking a node out of service

To temporarily take a node out of service for maintenance, you can do so with minimal downtime, provided there are enough resources on other nodes in the cluster to handle the pods from the target node.

Step 1: Cordon the node.
This prevents new pods from being scheduled on the node:

kubectl cordon <node-name>

Step 2: Drain the node.
This moves existing pods off the node, respecting DaemonSets and local data:

kubectl drain <node-name> --ignore-daemonsets --delete-local-data

The --ignore-daemonsets flag skips DaemonSet-managed pods, which are typically managed separately.
The --delete-local-data flag removes any local ephemeral data stored on the node.

Once drained, the node is effectively out of service.

To bring the node back into service:
Uncordon the node with:

kubectl uncordon <node-name>

This allows Kubernetes to schedule new pods on the node. It won’t automatically move existing pods back; you may need to manually restart or reschedule pods if desired. Since the node now has more available resources, Kubernetes will attempt to schedule new pods there to balance the load across the cluster.

Backup and restore of persistent volumes

The Longhorn storage driver, which provides the persistent storage used in the cluster, (See the Storage Guide for more details) provides built-in mechanisms for backup, restore, and snapshotting volumes. This can be performed entirely from within the Longhorn WebUI. See the relevant section of the Storage Guide for details on accessing that UI, since it requires setting up a port forward, which is described there.

See the relevant Longhorn Documentation for how to configure Longhorn and to manage Snapshotting and Backup and Restore.