Operations Guide

How to operate ESB3027 Agiletv CDN Manager

Overview

This guide details some of the common commands that will be necessary to operate the ESB3027 AgileTV CDN Manager software. Before starting, you will need at least a basic understanding of the following command line tooling.

Getting and Describing Kubernetes Resources

The two most common commands in Kubernetes are get and describe for a specific resource such as a Pod or Service. Using kubectl get typically lists all resources of a particular type; for example, kubectl get pods will display all pods in the current namespace. To obtain more detailed information about a specific resource, use kubectl describe <resource>, such as kubectl describe pod postgresql-0 to view details about that particular pod.

When describing a pod, the output includes a recent Event history at the bottom. This can be extremely helpful for troubleshooting issues, such as why a pod failed to deploy or was restarted. However, keep in mind that this event history only reflects the most recent events from the past few hours, so it may not provide insights into problems that occurred days or weeks ago.

Obtaining Logs

Each Pod maintains its own logs for each container. To fetch the logs of a specific pod, use kubectl logs <pod_name>. Adding the -f flag will stream the logs in follow mode, allowing real-time monitoring. If a pod contains multiple containers, by default, only the logs from the primary container are shown. To view logs from a different container within the same pod, use the -c <container_name> flag.

Since each pod maintains its own logs, retrieving logs from all replicas of a Deployment or StatefulSet may be necessary to get a complete view. You can use label selectors to collect logs from all pods associated with the same application. For example, to fetch logs from all pods belonging to the “acd-manager” deployment, run:

kubectl logs -l app.kubernetes.io/name=acd-manager

To find the labels associated with a specific Deployment or ReplicaSet, describe the resource and look for the “Labels” field.

The following table describes the common labels currently used by deployments in the cluster.

Component Labels

Label (key=value)Description
app.kubernetes.io/component=managerIdentifies the ACD Manager service
app.kubernetes.io/component=confdIdentifies the confd service
app.kubernetes.io/component=frontendIdentifies the GUI (frontend) service
app.kubernetes.io/component=gatewayIdentifies the API gateway service
app.kubernetes.io/component=grafanaIdentifies the Grafana monitoring service
app.kubernetes.io/component=metrics-aggregatorIdentifies the metrics aggregator service
app.kubernetes.io/component=mib-frontendIdentifies the MIB frontend service
app.kubernetes.io/component=serverIdentifies the Prometheus server component
app.kubernetes.io/component=selection-inputIdentifies the selection input service
app.kubernetes.io/component=startIdentifies the Zitadel startup/init component
app.kubernetes.io/component=primaryIdentifies the PostgreSQL primary node
app.kubernetes.io/component=controller-eligibleIdentifies the Kafka controller-eligible node
app.kubernetes.io/component=alertmanagerIdentifies the Prometheus Alertmanager
app.kubernetes.io/component=masterIdentifies the Redis master node
app.kubernetes.io/component=replicaIdentifies the Redis replica node

Instance, Name, and Part-of Labels

Label (key=value)Description
app.kubernetes.io/instance=acd-managerHelm release instance name (acd-manager)
app.kubernetes.io/instance=acd-clusterHelm release instance name (acd-cluster)
app.kubernetes.io/name=acd-managerResource name: acd-manager
app.kubernetes.io/name=confdResource name: confd
app.kubernetes.io/name=grafanaResource name: grafana
app.kubernetes.io/name=mib-frontendResource name: mib-frontend
app.kubernetes.io/name=prometheusResource name: prometheus
app.kubernetes.io/name=telegrafResource name: telegraf
app.kubernetes.io/name=zitadelResource name: zitadel
app.kubernetes.io/name=postgresqlResource name: postgresql
app.kubernetes.io/name=kafkaResource name: kafka
app.kubernetes.io/name=redisResource name: redis
app.kubernetes.io/name=victoria-metrics-singleResource name: victoria-metrics-single
app.kubernetes.io/part-of=prometheusPart of the Prometheus stack
app.kubernetes.io/part-of=kafkaPart of the Kafka stack

Restarting a Pod

Since Kubernetes maintains a fixed number of replicas for each Deployment or ReplicaSet, deleting a pod will cause Kubernetes to immediately recreate it, effectively restarting the pod. For example, to restart the pod acd-manager-6c85ddd747-5j5gt, run:

kubectl delete pod acd-manager-6c85ddd747-5j5gt

Kubernetes will automatically detach that pod from any associated Service, preventing new connections from reaching it. It then spawns a new instance, which goes through startup, liveness, and readiness probes. Once the new pod passes the readiness probes and is marked as ready, the Service will start forwarding new traffic to it.

If multiple replicas are running, traffic will be distributed among the existing pods while the new pod is initializing, ensuring a seamless, zero-downtime operation.

Stopping and Starting a Deployment

Unlike traditional services, Kubernetes does not have a concept of stopping a service directly. Instead, you can temporarily scale a Deployment to zero replicas, which has the same effect.

For example, to stop the acd-manager Deployment, run:

kubectl scale deployment acd-manager --replicas=0

To restart it later, scale the deployment back to its original number of replicas, e.g.,

kubectl scale deployment acd-manager --replicas=1

If you want to perform a simple restart of all pods within a deployment, you can delete all pods with a specific label, and Kubernetes will automatically recreate them. For example, to restart all pods with the component label “manager,” use:

kubectl delete pod -l app.kubernetes.io/component=manager

This command causes Kubernetes to delete all matching pods, which are then recreated, effectively restarting the service without changing the deployment configuration.

Running command inside a pod

Sometimes it is necessary to run a command inside an existing Pod such as obtaining a bash shell.

Using the kubectl exec -it <podname> -- <command> can be used to do just that. Assuming we need to run the confcli tool inside the confd pod acd-manager-confd-558f49ffb5-n8dmr that can be accomplished using the following command:

kubectl exec -it acd-manager-confd-558f49ffb5-n8dmr -- /usr/bin/python3.11 /usr/local/bin/confcli

Note: The confd container does not have a shell, so specifying the python interpreter is necessary on this image.

Monitoring resource usage

Kubernetes includes an internal metrics API which can give some insight into the resource usage of the Pods and of the Nodes.

To list the current usage of the Pods in the cluster issue the following:

kubectl top pods

This will give output similar to the following:

NAME                                             CPU(cores)   MEMORY(bytes)
acd-cluster-postgresql-0                         3m           44Mi
acd-manager-6c85ddd747-rdlg6                     4m           15Mi
acd-manager-confd-558f49ffb5-n8dmr               1m           47Mi
acd-manager-gateway-7594479477-z4bbr             0m           10Mi
acd-manager-grafana-78c76d8c5-c2tl6              18m          144Mi
acd-manager-kafka-controller-0                   19m          763Mi
acd-manager-kafka-controller-1                   19m          967Mi
acd-manager-kafka-controller-2                   25m          1127Mi
acd-manager-metrics-aggregator-f6ff99654-tjbfs   4m           2Mi
acd-manager-mib-frontend-67678c69df-tkklr        1m           26Mi
acd-manager-prometheus-alertmanager-0            2m           25Mi
acd-manager-prometheus-server-768f5d5c-q78xb     5m           53Mi
acd-manager-redis-master-0                       12m          18Mi
acd-manager-redis-replicas-0                     15m          14Mi
acd-manager-selection-input-844599bc4d-x7dct     3m           3Mi
acd-manager-telegraf-585dfc5ff8-n8m5c            1m           27Mi
acd-manager-victoria-metrics-single-server-0     2m           10Mi
acd-manager-zitadel-69b6546f8f-v9lkp             1m           76Mi
acd-manager-zitadel-69b6546f8f-wwcmx             1m           72Mi

Querying the metrics API for the nodes gives the aggregated totals for each node:

kubectl top nodes

Yields output similar to the following:

NAME                 CPU(cores)   CPU(%)   MEMORY(bytes)   MEMORY(%)
k3d-local-agent-0    118m         0%       1698Mi          21%
k3d-local-agent-1    120m         0%       661Mi           8%
k3d-local-agent-2    84m          0%       1054Mi          13%
k3d-local-server-0   115m         0%       1959Mi          25%

Taking a node out of service

To temporarily take a node out of service for maintenance, you can do so with minimal downtime, provided there are enough resources on other nodes in the cluster to handle the pods from the target node.

Step 1: Cordon the node.
This prevents new pods from being scheduled on the node:

kubectl cordon <node-name>

Step 2: Drain the node.
This moves existing pods off the node, respecting DaemonSets and local data:

kubectl drain <node-name> --ignore-daemonsets --delete-local-data
  • The --ignore-daemonsets flag skips DaemonSet-managed pods, which are typically managed separately.
  • The --delete-local-data flag removes any local ephemeral data stored on the node.

Once drained, the node is effectively out of service.

To bring the node back into service:
Uncordon the node with:

kubectl uncordon <node-name>

This allows Kubernetes to schedule new pods on the node. It won’t automatically move existing pods back; you may need to manually restart or reschedule pods if desired. Since the node now has more available resources, Kubernetes will attempt to schedule new pods there to balance the load across the cluster.

Backup and restore of persistent volumes

The Longhorn storage driver, which provides the persistent storage used in the cluster, (See the Storage Guide for more details) provides built-in mechanisms for backup, restore, and snapshotting volumes. This can be performed entirely from within the Longhorn WebUI. See the relevant section of the Storage Guide for details on accessing that UI, since it requires setting up a port forward, which is described there.

See the relevant Longhorn Documentation for how to configure Longhorn and to manage Snapshotting and Backup and Restore.