This is the multi-page printable view of this section. Click here to print.
Solutions
1 - Cache hardware metrics: monitoring and routing
Observability and monitoring is important in a CDN. To facilitate this, we use the open source application Telegraf as an agent on the caches to collect metrics, such as CPU usage, memory usage, and network interface statistics.
These metrics can be exported to two services installed alongside the Director:
acd-metrics-aggregator
: aggregates cache metrics from the CDN and exports them to the Director’s selection input API.acd-telegraf-metrics-database
: stores a smaller amount of time-series cache metrics produced by Telegraf metrics agents and allows the metrics to be scraped by a Prometheus instance, enabling alarms and visualization of the metrics in Grafana.
Telegraf Metrics Agent
All deployments of ESB2001 Orbit TV Server (from version 3.6.2) and ESB3004 SW Streamer (from version 1.36.2) come bundled with an instance of Telegraf running as a metrics agent. If you are using other caches in your CDN, it is possible to install and configure Telegraf manually, if licensing allows and the caches are accessible.
Configuring Telegraf is done using the confcli
tool. Before configuration,
ensure that the service telegraf-configd
is active by running
systemctl status telegraf-configd
systemctl restart telegraf-configd # if the service is not active
If telegraf-configd
fails to start, check the log file /var/log/telegraf-configd.log
for errors.
Configuration
Once the configuration daemon is running, Telegraf can be configured using
confcli
:
$ confcli integration.acd.telegraf
{
"telegraf": {
"aggregators": [],
"databases": [],
"enable": true,
"enable_tls_verification": true,
"hostname": "",
"interfaces": [],
"pushInterval": 5,
"secrets": {
"enable": false,
"key": "metrics_auth_token",
"secretStoreID": "telegraf_configd_secretstore"
}
}
}
The configuration fields are:
aggregators
: a list of URLs toacd-metrics-aggregator
instances to which the metrics will be exported.databases
: a list of URLs toacd-telegraf-metrics-database
instances to which the metrics will be exported.enable
: whether the Telegraf instance is enabled or not.enable_tls_verification
: whether to verify the TLS certificate of the target when exporting metrics.hostname
: the hostname of the cache. This is used as a tag in the metrics. If not set, the hostname of the machine is used.interfaces
: a list of network interfaces to collect metrics from. Supports wildcards, e.g.eths*
.pushInterval
: the interval in seconds for pushing metrics to targets.secrets
: configuration for using a secret token to authenticate with the targets.secrets.enable
: sets the headerAuthorization: Token secret_value
when exporting metrics to the targets.secrets.key
: the key/name of the secret. Defaults tometrics_auth_token
.secrets.secretStoreID
: the ID of the secret store to use. Defaults totelegraf_configd_secretstore
.
Targets
acd-metrics-aggregator
acd-metrics-aggregator
is a service that aggregates cache metrics in the CDN and
exports them to the Director’s selection input API.
The service is installed alongside the Director.
The service is started by default after installation. You can check the status of the service by running
systemctl status acd-metrics-aggregator
systemctl restart acd-metrics-aggregator # if the service is not active
The logs are accessed by running
journalctl -u acd-metrics-aggregator
The configuration file for the aggregator is located at
/opt/edgeware/acd/metrics/aggregator-conf.json
. The default configuration
created during installation will export metrics to the local Director’s
selection input API:
{
"tls_cert": "",
"tls_key": "",
"metrics_listen_port": 8087,
"interval": 5,
"log_level": "INFO",
"targets": [
{
"url": "https://127.0.0.1:5001/v1/timestamped_selection_input",
"http_headers": {
"x-api-key": "8a3094e875b841d480482cfd82e3e313"
}
}
],
"secrets": {
"enable": false,
"key": "metrics_auth_token"
}
}
The configuration fields are:
tls_cert
: the path to the TLS certificate file ifacd-metrics-aggregator
should use TLS.tls_key
: the path to the TLS key file ifacd-metrics-aggregator
should use TLS.- Note that the service is run as a container with the directory
/opt/edgeware/acd/ssl
mounted as a volume with the same path in the container. The certificate and key files should be placed in this directory.
- Note that the service is run as a container with the directory
metrics_listen_port
: the port on which the service listens for metrics.interval
: the interval in seconds for pushing metrics to the targets.log_level
: the log level for the aggregator. Can beDEBUG
,INFO
orWARNING
.targets
: a list of targets to which the metrics will be exported.url
: the URL of the target.http_headers
: HTTP headers to send with the request.x-api-key
: the API key for the Director’s REST API. Populated by default.
secrets
: configuration for using secret token to authenticate incoming metrics requests.enable
: requires the headerAuthorization: Token secret_value
when receiving metrics.key
: the key/name of the secret. Defaults tometrics_auth_token
.
acd-telegraf-metrics-database
acd-telegraf-metrics-database
is a service that stores cache metrics in a time
series database to be scraped by a Prometheus instance, allowing for alarms and
vizualization of the metrics in Grafana. The service is installed alongside the
Director.
This service is running Telegraf in a container, receiving metrics from the Telegraf metrics agents. Note that this service is not a full-fledged time series database, like InfluxDB, but a smaller database that stores a limited amount of data. The service acts like a middle man between the Telegraf metrics agents and the scraping Prometheus instance.
The service is started by default after installation. You can check the status of the service by running
systemctl status acd-telegraf-metrics-database
systemctl restart acd-telegraf-metrics-database # if the service is not active
The logs are accessed by running
journalctl -u acd-telegraf-metrics-database
The configuration file for the database is located at
/opt/edgeware/acd/metrics/telegraf-metrics-database.conf
:
# Global tags can be specified here in key="value" format.
[global_tags]
# Example configuration for aggregator
[agent]
interval = "5s" # Data collection interval
round_interval = true # Round collection interval to 'interval'
metric_batch_size = 1000 # Max metrics sent in one batch
metric_buffer_limit = 10000 # Max metrics stored when outputs are unavailable
collection_jitter = "0s" # Collection jitter to stagger collection times
flush_interval = "5s" # Metrics output interval
flush_jitter = "0s" # Output jitter to stagger output times
precision = "" # Timestamp precision in output
debug = false # Enable more log info for debugging
quiet = false # Enable less log info
logfile = "" # Log file path, empty means log to stderr
hostname = "" # Host identifier, empty means use os.Hostname()
omit_hostname = false # Include hostname in output
# Listen for Telegraf metrics agents
[[inputs.influxdb_v2_listener]]
service_address = ":8086"
# tls_cert = ""
# tls_key = ""
# token = "@{secretstore:metrics_auth_token}" # Uncomment this line to use token authentication
# Expose port for prometheus to scrape all metrics from
[[outputs.prometheus_client]]
listen = ":12001"
# Secretstore configuration
[[secretstores.docker]]
id = "secretstore"
See https://docs.influxdata.com/telegraf/v1/plugins/ for full plugin directory and plugin documentation. The important fields are:
inputs.influxdb_v2_listener
: plugin that listens for incoming metrics from cache agents.service_address
: the address and port to listen on.tls_cert
: the path to the TLS certificate file if TLS should be used.tls_key
: the path to the TLS key file if TLS should be used.- Note that the service is run as a container with the directory
/opt/edgeware/acd/ssl
mounted as a volume with the same path in the container. The certificate and key files should be placed in this directory.
- Note that the service is run as a container with the directory
token
: the token to use for authentication. This field is commented out by default. Uncomment this line to use token authentication.
outputs.prometheus_client
: opens a port for Prometheus to scrape metrics from.listen
: the port to listen on.
secretstores.docker
: plugin for using Podman secrets to authenticate incoming metrics.id
: the ID of the secret store to use. Used in thetoken
field for theinputs.influxdb_v2_listener
plugin.
Using Secrets for Request Authorization
Secrets can be used to authenticate incoming metrics requests to
acd-metrics-aggregator
and acd-telegraf-metrics-database
by requiring the
header Authorization: Token secret_value
when receiving metrics. The Telegraf
metrics agent can be configured to attach this header when exporting metrics to
acd-metrics-aggregator
and acd-telegraf-metrics-database
.
Cache Metrics Agent
The secret configuration on the Telegraf metrics agent can be seen by running
$ confcli integration.acd.telegraf.secrets
{
"secrets": {
"enable": false,
"key": "metrics_auth_token",
"secretStoreID": "telegraf_configd_secretstore"
}
}
To set the secret value with the key/name metrics_auth_token
using the secret
store telegraf_configd_secretstore
, run
$ telegraf secret set telegraf_configd_secretstore metrics_auth_token
Enter secret value:
This will prompt you to enter the secret value. Once the secret value is set, the Telegraf metrics agent can use the secret to authenticate with the targets by running
$ confcli integration.acd.telegraf.secrets.enable true
integration.acd.telegraf.secrets.enable = True
This will set the header Authorization: Token secret_value
when exporting
metrics to the configured targets.
Note that if you change the secret value, you need to restart the Telegraf metrics agent by either updating the configuration, e.g. disabling the service and enabling it again, or by running
systemctl restart telegraf
acd-metrics-aggregator
and acd-telegraf-metrics-database
Both acd-metrics-aggregator
and acd-telegraf-metrics-database
use secrets
supplied by Podman to enable request authorization. During installation, a
placeholder secret is created with the name metrics_auth_token
in Podman. This
secret is loaded into the respective containers when starting the services.
To securely set the secret value, the following commands will prompt you to
enter the secret value and store it in a temporary environment variable. The
secret value is then piped to podman
to store the secret value. Lastly,
the environment variable is unset
to ensure that the secret value is removed
from the environment:
read -sp "Enter secret value: " SECRET_VALUE
printf "$SECRET_VALUE" | podman secret create --replace metrics_auth_token -
unset SECRET_VALUE
This will set the secret value to secret-value
.
Enabling Request Authorization
To use the secret for request authorization in acd-metrics-aggregator
, modify
the secrets
field in configuration file
/opt/edgeware/acd/metrics/aggregator-conf.json
to:
{
"secrets": {
"enable": true,
"key": "metrics_auth_token"
}
}
Then restart the service:
systemctl restart acd-metrics-aggregator
To use the secret for request authorization in acd-telegraf-metrics-database
,
modify the configuration file to use the secret by uncommenting the token
field in the inputs.influxdb_v2_listener
plugin:
[[inputs.influxdb_v2_listener]]
service_address = ":8086"
token = "@{secretstore:metrics_auth_token}"
Make sure the secret store ID and secret key in the token
field match the
values in the secretstores.docker
section of the configuration file.
Then restart the service:
systemctl restart acd-telegraf-metrics-database
All incoming metrics requests to acd-metrics-aggregator
and
acd-telegraf-metrics-database
will now require the header
Authorization: Token secret_value
, otherwise the request will be rejected.
Example
A Telegraf metrics agent can be configured to export metrics to a Director that
is installed on the host director-1
like this.
Assuming that acd-metrics-aggregator
is listening on port 8087 and
acd-telegraf-metrics-database
is listening on port 8086, the following
configuration will track the CPU usage, memory usage, and network interface
statistics for interfaces eths0
and eths1
:
$ confcli integration.acd.telegraf.
{
"telegraf": {
"aggregators": [
"http://director-1:8087/metrics"
],
"databases": [
"http://director-1:8086"
],
"enable": true,
"enable_tls_verification": true,
"hostname": "cache-1",
"interfaces": [
"eths0",
"eths1"
],
"pushInterval": 5,
"secrets": {
"enable": false,
"key": "metrics_auth_token",
"secretStoreID": "telegraf_configd_secretstore"
}
}
}
Note that each entry in aggregators
requires the path /metrics
to be
appended to the URL. The entries in databases
do not require any path to be
appended to the URL.
Once the configuration is set, the cache metrics agent will start exporting
metrics to the instances of acd-metrics-aggregator
and
acd-telegraf-metrics-database
running on director-1
.
acd-metrics-aggregator
will aggregate the metrics and export them to the
Director’s selection input API. The metrics can be viewed by running
curl -k https://director-1:5001/v1/selection_input -H "x-api-key: <your-api-key>"
{
"cache-1": {
"hardware_metrics": {
"/media": {
"free": 2247610073088,
"total": 2261300281344,
"used": 13690208256,
"used_percent": 0.6054131054131053
},
"/non-volatile": {
"free": 1722658816,
"total": 1934635008,
"used": 95219712,
"used_percent": 5.237957901662393
},
"/var/log": {
"free": 487481344,
"total": 536870912,
"used": 49389568,
"used_percent": 9.19952392578125
},
"cpu_load1": 0.07,
"cpu_load15": 0,
"cpu_load5": 0.03,
"ehsd_online": 1,
"mem_available": 7252512768,
"mem_available_percent": 88.21865022898756,
"mem_total": 8221065216,
"mem_used": 818151424,
"n_cpus": 4
},
"per_interface_metrics": {
"eths0": {
"bytes_recv": 155734911,
"bytes_recv_rate": 1552,
"bytes_sent": 2967510,
"bytes_sent_rate": 1378,
"drop_in": 843508,
"drop_in_rate": 7,
"drop_out": 0,
"drop_out_rate": 0,
"err_in": 0,
"err_in_rate": 0,
"err_out": 0,
"err_out_rate": 0,
"interface_up": true,
"link": 1,
"megabits_recv": 1245.879288,
"megabits_recv_rate": 0.012416,
"megabits_sent": 23.74008,
"megabits_sent_rate": 0.011024,
"speed": 10000,
"speed_rate": 0
},
"eths1": {
"bytes_recv": 66197103,
"bytes_recv_rate": 256.5,
"bytes_sent": 612,
"bytes_sent_rate": 0,
"drop_in": 3399,
"drop_in_rate": 0,
"drop_out": 0,
"drop_out_rate": 0,
"err_in": 0,
"err_in_rate": 0,
"err_out": 0,
"err_out_rate": 0,
"interface_up": true,
"link": 1,
"megabits_recv": 529.576824,
"megabits_recv_rate": 0.002052,
"megabits_sent": 0.004896,
"megabits_sent_rate": 0,
"speed": 10000,
"speed_rate": 0
}
}
}
}
acd-telegraf-metrics-database
will store the metrics in a time series database
to be scraped by a Prometheus instance. The metrics can be scraped manually by
running
curl -k http://director-1:12001/metrics
# HELP disk_used_percent Telegraf collected metric
# TYPE disk_used_percent untyped
disk_used_percent{device="dm-0",fstype="xfs",host="cache-1",label="vg_main-lv_root",metric_owner="telegraf-configd",mode="rw",path="/"} 12.752679256881688
# HELP ehsd_ehsd_online Telegraf collected metric
# TYPE ehsd_ehsd_online untyped
ehsd_ehsd_online{host="cache-1",metric_owner="telegraf-configd"} 1
# HELP net_megabits_sent_rate Telegraf collected metric
# TYPE net_megabits_sent_rate untyped
net_megabits_sent_rate{host="cache-1",interface="eth0",metric_owner="telegraf-configd"} 0.002174
...
Using Hardware Metrics in Routing
When the hardware metrics have been succesfully injected into the selection input API, the data is ready to be used for routing decisions. In addition to creating routing conditions, hardware metrics are particularly suited to host health checks.
Host Health Checks
When the hardware metrics have been succesfully injected into the selection input API, the data is ready to make routing decisions with. In addition to creating routing conditions, hardware metrics are particularly suited to host health checks.
Host health checks are Lua functions that determine whether or not a host is ready to take on more clients. To configure host health checks, see configuring CDNs and hosts. See Built-in Lua functions for documentation on how to use the built-in health functions. Note that these health check functions can also be used for making routing conditions in the routing tree.
Routing Based on Cache Metrics
Instead of using health check functions to incorporate hardware metrics into the routing decisions, regular routing conditions can be used.
As an example, using the health check function cpu_load_ok()
in routing can be
configured as follows:
$ confcli services.routing.rules -w
Running wizard for resource 'rules'
Hint: Hitting return will set a value to its default.
Enter '?' to receive the help string
rules : [
rule can be one of
1: allow
2: consistentHashing
3: contentPopularity
4: deny
5: firstMatch
6: random
7: rawGroup
8: rawHost
9: split
10: weighted
Choose element index or name: firstMatch
Adding a 'firstMatch' element
rule : {
name (default: ): dont_overload_cache
type (default: firstMatch):
targets : [
target : {
onMatch (default: ): default_host
condition (default: always()): cpu_load_ok()
}
Add another 'target' element to array 'targets'? [y/N]: y
target : {
onMatch (default: ): offload_host
condition (default: always()):
}
Add another 'target' element to array 'targets'? [y/N]: n
]
}
Add another 'rule' element to array 'rules'? [y/N]: n
]
Generated config:
{
"rules": [
{
"name": "dont-overload-cache",
"type": "firstMatch",
"targets": [
{
"onMatch": "default_host",
"condition": "cpu_load_ok()"
},
{
"onMatch": "offload_host",
"condition": "always()"
}
]
}
]
}
Merge and apply the config? [y/n]: y
2 - Private CDN Offload Routing with DNS
This shows a setup to do DNS based routing and 3rd party CDN offload from a private main CDN. The following components are used in the setup:
- A private CDN with Convoy (ESB3006), Request Router (ESB3008) and SW Streamers (ESB3004)
- ESB3024 Router to offload traffic to an external CDN
- CoreDNS to support DNS based routing using ESB3024
The following figure shows an overview of the different components and how they interact.
A client retrieves a DNS name from it’s content portal. The DNS name is then resolved by CoreDNS that asks the router for a suitable host. The router either returns a host from ESB3008 Request Router or a configured offload host.
The following sequence diagram illustrates the flow.
Follow the links below to configure different use cases of this setup.
3 - Token blocking
A common attempt of CDN piracy is to share unique session tokens between multiple users. If your tokens do not encode device identifiers it might be possible for attackers to share a token between multiple devices.
This is a common problem in the industry, and one solution is to block tokens after they have been used. This can be accomplished by storing the used tokens in a central database and checking against that database before allowing a new session to be created.
How it works
AgileTV CDN Director can be configured to produce and consume messages to and from a Kafka broker/cluster, preferably the Kafka cluster deployed by the AgileTV CDN Manager. Utilizing Kafka, the Director can produce Kafka messages containing the used tokens which other Director instances can consume and store locally in selection input.
Configuration
Kafka
The Manager deploys a Kafka cluster that can be used to store used tokens. The
Kafka cluster has a default topic called selection_input
that can be used to
store selection input data. This can be verified by running the following command
on the Manager host and checking that the topic selection_input
is listed:
kubectl exec acd-manager-kafka-controller-0 -- kafka-topics.sh --bootstrap-server localhost:9092 --list
__consumer_offsets
selection_input
Note that the used tokens can be stored under any Kafka topic, but for this
example we will use the default selection_input
topic. If you want to use a
different topic, refer to the Kafka documentation
for more information on how to create a new topic. Note that other services
deployed by the Manager may also use the selection_input
topic, so be
careful when using a different topic.
Connecting to the Kafka cluster
All instances of the Director must be configured to connect to the same Kafka
cluster. Assuming that an instance of the Manager is running on the host
manager-host
, the Kafka cluster can be accessed through port 9095. The
following configuration will make the Director connect to the Kafka cluster
served by the Manager:
confcli integration.kafka.bootstrapServers
{
"bootstrapServers": [
"manager-host:9095"
]
}
Data streams
To consume and produce messages to and from the Kafka cluster, data streams
are used. Assuming that the Kafka topic selection_input
is used to store the used
tokens, the following configuration will make the Director consume messages from
the selection_input
topic and store them as selection input data:
confcli services.routing.dataStreams
{
"dataStreams": {
"incoming": [
{
"name": "incomingSelectionInput",
"source": "kafka",
"target": "selectionInput",
"kafkaTopics": [
"selection_input"
]
},
],
"outgoing": [
{
"name": "outgoingSelectionInput",
"type": "kafka",
"kafkaTopic": "selection_input"
}
]
}
}
The incoming
section specifies that the Director will consume messages from
the selection_input
topic and store them as selection input data. Messages are
consumed periodically and no more configuration is needed.
The outgoing
section specifies that the Director will produce messages to the
selection_input
topic. The kafkaTopic
option specifies the name of the topic
to produce messages to. However, to actually produce messages, we need to
employ a Lua function.
Lua function
The Lua function blocked_tokens.add
produces messages
to the Kafka topic. This message will be consumed by other Director instances
configured to consume messages from the selection_input
topic and be stored as
selection input data.
During routing we also need to check if the token attached to the request is
blocked and deny the request accordingly. This can be done using the Lua function
blocked_tokens.is_blocked
.
These two functions can be utilized in a custom Lua function to check if the token is blocked and produce a message to the Kafka topic if it is not.
Consider a scenario where the token is included in the query string of a request.
The Lua function below checks if the token exists in the query string, retrieves
it, and determines whether it is blocked using blocked_tokens.is_blocked
. If
the token is not blocked, it is added to the outgoing data stream with a
time-to-live (TTL) of 1 hour (3600 seconds) using blocked_tokens.add
. However,
if the token is blocked, the function returns 1, signaling that the token is
blocked and the request should be blocked.
function is_token_used()
-- This function checks if the token is present in the query string
-- and if it is blocked. If the token is missing or blocked, the function
-- returns 1. If the token has not been blocked, the function returns 0.
-- Verify if the token is present in the query string
local token = request_query_params.token
if token == nil then
-- Token is not present, return 1
return 1
end
-- Check if the token is blocked
if not blocked_tokens.is_blocked(token) then
-- Token is not blocked, add the blocked token to the outgoing data
-- stream with a TTL of 1 hour (3600 seconds)
local ttl_s = 3600
blocked_tokens.add('outgoingSelectionInput', token, ttl_s)
return 0
else
-- Token is blocked, return 1
return 1
end
end
Save this function in a file called is_token_used.lua
and upload it to the
Director using the Lua API.
When the function has been uploaded to the Director, it can be used in a deny
rule block to fully handle all token blocking:
confcli services.routing.rules
{
"rules": [
{
"name": "checkTokenRule",
"type": "deny",
"condition": "is_token_used()",
"onMiss": "someOtherRule"
}
]
}
This rule will call the function is_token_used()
for every request. If the
token is missing or blocked, the function will return 1 and the request will be
denied. If the token is not blocked, the function will add the token to the
outgoing data stream to block it for future use and return 0, allowing the
request to continue to the next rule in the chain, which is someOtherRule
in
this case.
4 - Monitor ACD with Prometheus, Grafana and Alert Manager
ACD can be monitored with standard monitoring solutions. This allows you to adapt the monitoring to your needs. The following shows a setup with the following components
- Prometheus as metrics database
- Grafana to create Dashboards
- Alert Manager to create alarms from Prometheus alerts
Prometheus and Grafana can also be installed with ESB3024 Router installer in which case Grafana is also pre-populated with standard routing dashboards as well as the router troubleshooting Dashboards.