This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Solutions

ACD Reference Solutions. Blueprints for different families of use cases.

1: Cache hardware metrics: monitoring and routing
2: Private CDN Offload Routing with DNS
3: Token blocking
4: Monitor ACD with Prometheus, Grafana and Alert Manager

1 - Cache hardware metrics: monitoring and routing

How to set up monitoring and make routing decisions based on cache hardware metrics

Observability and monitoring is important in a CDN. To facilitate this, we use the open source application Telegraf as an agent on the caches to collect metrics, such as CPU usage, memory usage, and network interface statistics.

These metrics can be exported to two services installed alongside the Director:

acd-metrics-aggregator: aggregates cache metrics from the CDN and exports them to the Director’s selection input API.
acd-telegraf-metrics-database: stores a smaller amount of time-series cache metrics produced by Telegraf metrics agents and allows the metrics to be scraped by a Prometheus instance, enabling alarms and visualization of the metrics in Grafana.

Telegraf Metrics Agent

All deployments of ESB2001 Orbit TV Server (from version 3.6.2) and ESB3004 SW Streamer (from version 1.36.2) come bundled with an instance of Telegraf running as a metrics agent. If you are using other caches in your CDN, it is possible to install and configure Telegraf manually, if licensing allows and the caches are accessible.

Configuring Telegraf is done using the confcli tool. Before configuration, ensure that the service telegraf-configd is active by running

systemctl status telegraf-configd
systemctl restart telegraf-configd # if the service is not active

If telegraf-configd fails to start, check the log file /var/log/telegraf-configd.log for errors.

Configuration

Once the configuration daemon is running, Telegraf can be configured using confcli:

$ confcli integration.acd.telegraf
{
  "telegraf": {
    "aggregators": [],
    "databases": [],
    "enable": true,
    "enable_tls_verification": true,
    "hostname": "",
    "interfaces": [],
    "pushInterval": 5,
    "secrets": {
      "enable": false,
      "key": "metrics_auth_token",
      "secretStoreID": "telegraf_configd_secretstore"
    }
  }
}

The configuration fields are:

aggregators: a list of URLs to acd-metrics-aggregator instances to which the metrics will be exported.
databases: a list of URLs to acd-telegraf-metrics-database instances to which the metrics will be exported.
enable: whether the Telegraf instance is enabled or not.
enable_tls_verification: whether to verify the TLS certificate of the target when exporting metrics.
hostname: the hostname of the cache. This is used as a tag in the metrics. If not set, the hostname of the machine is used.
interfaces: a list of network interfaces to collect metrics from. Supports wildcards, e.g. eths*.
pushInterval: the interval in seconds for pushing metrics to targets.
secrets: configuration for using a secret token to authenticate with the targets.
- secrets.enable: sets the header Authorization: Token secret_value when exporting metrics to the targets.
- secrets.key: the key/name of the secret. Defaults to metrics_auth_token.
- secrets.secretStoreID: the ID of the secret store to use. Defaults to telegraf_configd_secretstore.

Targets

`acd-metrics-aggregator`

acd-metrics-aggregator is a service that aggregates cache metrics in the CDN and exports them to the Director’s selection input API. The service is installed alongside the Director.

The service is started by default after installation. You can check the status of the service by running

systemctl status acd-metrics-aggregator
systemctl restart acd-metrics-aggregator # if the service is not active

The logs are accessed by running

journalctl -u acd-metrics-aggregator

The configuration file for the aggregator is located at /opt/edgeware/acd/metrics/aggregator-conf.json. The default configuration created during installation will export metrics to the local Director’s selection input API:

{
  "tls_cert": "",
  "tls_key": "",
  "metrics_listen_port": 8087,
  "interval": 5,
  "log_level": "INFO",
  "targets": [
    {
      "url": "https://127.0.0.1:5001/v1/timestamped_selection_input",
      "http_headers": {
        "x-api-key": "8a3094e875b841d480482cfd82e3e313"
      }
    }
  ],
  "secrets": {
    "enable": false,
    "key": "metrics_auth_token"
  }
}

The configuration fields are:

tls_cert: the path to the TLS certificate file if acd-metrics-aggregator should use TLS.
tls_key: the path to the TLS key file if acd-metrics-aggregator should use TLS.
- Note that the service is run as a container with the directory /opt/edgeware/acd/ssl mounted as a volume with the same path in the container. The certificate and key files should be placed in this directory.
metrics_listen_port: the port on which the service listens for metrics.
interval: the interval in seconds for pushing metrics to the targets.
log_level: the log level for the aggregator. Can be DEBUG, INFO or WARNING.
targets: a list of targets to which the metrics will be exported.
- url: the URL of the target.
- http_headers: HTTP headers to send with the request.
  - x-api-key: the API key for the Director’s REST API. Populated by default.
secrets: configuration for using secret token to authenticate incoming metrics requests.
- enable: requires the header Authorization: Token secret_value when receiving metrics.
- key: the key/name of the secret. Defaults to metrics_auth_token.

`acd-telegraf-metrics-database`

acd-telegraf-metrics-database is a service that stores cache metrics in a time series database to be scraped by a Prometheus instance, allowing for alarms and vizualization of the metrics in Grafana. The service is installed alongside the Director.

This service is running Telegraf in a container, receiving metrics from the Telegraf metrics agents. Note that this service is not a full-fledged time series database, like InfluxDB, but a smaller database that stores a limited amount of data. The service acts like a middle man between the Telegraf metrics agents and the scraping Prometheus instance.

The service is started by default after installation. You can check the status of the service by running

systemctl status acd-telegraf-metrics-database
systemctl restart acd-telegraf-metrics-database # if the service is not active

The logs are accessed by running

journalctl -u acd-telegraf-metrics-database

The configuration file for the database is located at /opt/edgeware/acd/metrics/telegraf-metrics-database.conf:

# Global tags can be specified here in key="value" format.
[global_tags]

# Example configuration for aggregator
[agent]
  interval = "5s"  # Data collection interval
  round_interval = true  # Round collection interval to 'interval'
  metric_batch_size = 1000  # Max metrics sent in one batch
  metric_buffer_limit = 10000  # Max metrics stored when outputs are unavailable
  collection_jitter = "0s"  # Collection jitter to stagger collection times
  flush_interval = "5s"  # Metrics output interval
  flush_jitter = "0s"  # Output jitter to stagger output times
  precision = ""  # Timestamp precision in output
  debug = false  # Enable more log info for debugging
  quiet = false  # Enable less log info
  logfile = ""  # Log file path, empty means log to stderr
  hostname = ""  # Host identifier, empty means use os.Hostname()
  omit_hostname = false  # Include hostname in output

# Listen for Telegraf metrics agents
[[inputs.influxdb_v2_listener]]
  service_address = ":8086"
#  tls_cert = ""
#  tls_key = ""
#  token = "@{secretstore:metrics_auth_token}" # Uncomment this line to use token authentication

# Expose port for prometheus to scrape all metrics from
[[outputs.prometheus_client]]
listen = ":12001"

# Secretstore configuration
[[secretstores.docker]]
  id = "secretstore"

See https://docs.influxdata.com/telegraf/v1/plugins/ for full plugin directory and plugin documentation. The important fields are:

inputs.influxdb_v2_listener: plugin that listens for incoming metrics from cache agents.
- service_address: the address and port to listen on.
- tls_cert: the path to the TLS certificate file if TLS should be used.
- tls_key: the path to the TLS key file if TLS should be used.
  - Note that the service is run as a container with the directory /opt/edgeware/acd/ssl mounted as a volume with the same path in the container. The certificate and key files should be placed in this directory.
- token: the token to use for authentication. This field is commented out by default. Uncomment this line to use token authentication.
outputs.prometheus_client: opens a port for Prometheus to scrape metrics from.
- listen: the port to listen on.
secretstores.docker: plugin for using Podman secrets to authenticate incoming metrics.
- id: the ID of the secret store to use. Used in the token field for the inputs.influxdb_v2_listener plugin.

Using Secrets for Request Authorization

Secrets can be used to authenticate incoming metrics requests to acd-metrics-aggregator and acd-telegraf-metrics-database by requiring the header Authorization: Token secret_value when receiving metrics. The Telegraf metrics agent can be configured to attach this header when exporting metrics to acd-metrics-aggregator and acd-telegraf-metrics-database.

Cache Metrics Agent

The secret configuration on the Telegraf metrics agent can be seen by running

$ confcli integration.acd.telegraf.secrets
{
    "secrets": {
        "enable": false,
        "key": "metrics_auth_token",
        "secretStoreID": "telegraf_configd_secretstore"
    }
}

To set the secret value with the key/name metrics_auth_token using the secret store telegraf_configd_secretstore, run

$ telegraf secrets set telegraf_configd_secretstore metrics_auth_token
Enter secret value:

This will prompt you to enter the secret value. Once the secret value is set, the Telegraf metrics agent can use the secret to authenticate with the targets by running

$ confcli integration.acd.telegraf.secrets.enable true
integration.acd.telegraf.secrets.enable = True

This will set the header Authorization: Token secret_value when exporting metrics to the configured targets.

Note that if you change the secret value, you need to restart the Telegraf metrics agent by either updating the configuration, e.g. disabling the service and enabling it again, or by running

systemctl restart telegraf

`acd-metrics-aggregator` and `acd-telegraf-metrics-database`

Both acd-metrics-aggregator and acd-telegraf-metrics-database use secrets supplied by Podman to enable request authorization. During installation, a placeholder secret is created with the name metrics_auth_token in Podman. This secret is loaded into the respective containers when starting the services.

To securely set the secret value, the following commands will prompt you to enter the secret value and store it in a temporary environment variable. The secret value is then piped to podman to store the secret value. Lastly, the environment variable is unset to ensure that the secret value is removed from the environment:

read -sp "Enter secret value: " SECRET_VALUE
printf "$SECRET_VALUE" | podman secret create --replace metrics_auth_token -
unset SECRET_VALUE

This will set the secret value to secret-value.

Enabling Request Authorization

To use the secret for request authorization in acd-metrics-aggregator, modify the secrets field in configuration file /opt/edgeware/acd/metrics/aggregator-conf.json to:

{
  "secrets": {
    "enable": true,
    "key": "metrics_auth_token"
  }
}

Then restart the service:

systemctl restart acd-metrics-aggregator

To use the secret for request authorization in acd-telegraf-metrics-database, modify the configuration file to use the secret by uncommenting the token field in the inputs.influxdb_v2_listener plugin:

[[inputs.influxdb_v2_listener]]
  service_address = ":8086"
  token = "@{secretstore:metrics_auth_token}"

Make sure the secret store ID and secret key in the token field match the values in the secretstores.docker section of the configuration file.

Then restart the service:

systemctl restart acd-telegraf-metrics-database

All incoming metrics requests to acd-metrics-aggregator and acd-telegraf-metrics-database will now require the header Authorization: Token secret_value, otherwise the request will be rejected.

Example

A Telegraf metrics agent can be configured to export metrics to a Director that is installed on the host director-1 like this.

Assuming that acd-metrics-aggregator is listening on port 8087 and acd-telegraf-metrics-database is listening on port 8086, the following configuration will track the CPU usage, memory usage, and network interface statistics for interfaces eths0 and eths1:

$ confcli integration.acd.telegraf.
{
    "telegraf": {
        "aggregators": [
            "http://director-1:8087/metrics"
        ],
        "databases": [
            "http://director-1:8086"
        ],
        "enable": true,
        "enable_tls_verification": true,
        "hostname": "cache-1",
        "interfaces": [
            "eths0",
            "eths1"
        ],
        "pushInterval": 5,
        "secrets": {
            "enable": false,
            "key": "metrics_auth_token",
            "secretStoreID": "telegraf_configd_secretstore"
        }
    }
}

Note that each entry in aggregators requires the path /metrics to be appended to the URL. The entries in databases do not require any path to be appended to the URL.

Once the configuration is set, the cache metrics agent will start exporting metrics to the instances of acd-metrics-aggregator and acd-telegraf-metrics-database running on director-1.

acd-metrics-aggregator will aggregate the metrics and export them to the Director’s selection input API. The metrics can be viewed by running

curl -k https://director-1:5001/v1/selection_input -H "x-api-key: <your-api-key>"
{
  "cache-1": {
    "hardware_metrics": {
      "/media": {
        "free": 2247610073088,
        "total": 2261300281344,
        "used": 13690208256,
        "used_percent": 0.6054131054131053
      },
      "/non-volatile": {
        "free": 1722658816,
        "total": 1934635008,
        "used": 95219712,
        "used_percent": 5.237957901662393
      },
      "/var/log": {
        "free": 487481344,
        "total": 536870912,
        "used": 49389568,
        "used_percent": 9.19952392578125
      },
      "cpu_load1": 0.07,
      "cpu_load15": 0,
      "cpu_load5": 0.03,
      "ehsd_online": 1,
      "mem_available": 7252512768,
      "mem_available_percent": 88.21865022898756,
      "mem_total": 8221065216,
      "mem_used": 818151424,
      "n_cpus": 4
    },
    "per_interface_metrics": {
      "eths0": {
        "bytes_recv": 155734911,
        "bytes_recv_rate": 1552,
        "bytes_sent": 2967510,
        "bytes_sent_rate": 1378,
        "drop_in": 843508,
        "drop_in_rate": 7,
        "drop_out": 0,
        "drop_out_rate": 0,
        "err_in": 0,
        "err_in_rate": 0,
        "err_out": 0,
        "err_out_rate": 0,
        "interface_up": true,
        "link": 1,
        "megabits_recv": 1245.879288,
        "megabits_recv_rate": 0.012416,
        "megabits_sent": 23.74008,
        "megabits_sent_rate": 0.011024,
        "speed": 10000,
        "speed_rate": 0
      },
      "eths1": {
        "bytes_recv": 66197103,
        "bytes_recv_rate": 256.5,
        "bytes_sent": 612,
        "bytes_sent_rate": 0,
        "drop_in": 3399,
        "drop_in_rate": 0,
        "drop_out": 0,
        "drop_out_rate": 0,
        "err_in": 0,
        "err_in_rate": 0,
        "err_out": 0,
        "err_out_rate": 0,
        "interface_up": true,
        "link": 1,
        "megabits_recv": 529.576824,
        "megabits_recv_rate": 0.002052,
        "megabits_sent": 0.004896,
        "megabits_sent_rate": 0,
        "speed": 10000,
        "speed_rate": 0
      }
    }
  }
}

acd-telegraf-metrics-database will store the metrics in a time series database to be scraped by a Prometheus instance. The metrics can be scraped manually by running

curl -k http://director-1:12001/metrics
# HELP disk_used_percent Telegraf collected metric
# TYPE disk_used_percent untyped
disk_used_percent{device="dm-0",fstype="xfs",host="cache-1",label="vg_main-lv_root",metric_owner="telegraf-configd",mode="rw",path="/"} 12.752679256881688
# HELP ehsd_ehsd_online Telegraf collected metric
# TYPE ehsd_ehsd_online untyped
ehsd_ehsd_online{host="cache-1",metric_owner="telegraf-configd"} 1
# HELP net_megabits_sent_rate Telegraf collected metric
# TYPE net_megabits_sent_rate untyped
net_megabits_sent_rate{host="cache-1",interface="eth0",metric_owner="telegraf-configd"} 0.002174
...

Using Hardware Metrics in Routing

When the hardware metrics have been succesfully injected into the selection input API, the data is ready to be used for routing decisions. In addition to creating routing conditions, hardware metrics are particularly suited to host health checks.

Host Health Checks

When the hardware metrics have been succesfully injected into the selection input API, the data is ready to make routing decisions with. In addition to creating routing conditions, hardware metrics are particularly suited to host health checks.

Host health checks are Lua functions that determine whether or not a host is ready to take on more clients. To configure host health checks, see configuring CDNs and hosts. See Built-in Lua functions for documentation on how to use the built-in health functions. Note that these health check functions can also be used for making routing conditions in the routing tree.

Routing Based on Cache Metrics

Instead of using health check functions to incorporate hardware metrics into the routing decisions, regular routing conditions can be used.

As an example, using the health check function cpu_load_ok() in routing can be configured as follows:

$ confcli services.routing.rules -w
Running wizard for resource 'rules'

Hint: Hitting return will set a value to its default.
Enter '?' to receive the help string

rules : [
  rule can be one of
    1: allow
    2: consistentHashing
    3: contentPopularity
    4: deny
    5: firstMatch
    6: random
    7: rawGroup
    8: rawHost
    9: split
    10: weighted
  Choose element index or name: firstMatch
  Adding a 'firstMatch' element
    rule : {
      name (default: ): dont_overload_cache
      type (default: firstMatch): 
      targets : [
        target : {
          onMatch (default: ): default_host
          condition (default: always()): cpu_load_ok()
        }
        Add another 'target' element to array 'targets'? [y/N]: y
        target : {
          onMatch (default: ): offload_host
          condition (default: always()): 
        }
        Add another 'target' element to array 'targets'? [y/N]: n
      ]
    }
  Add another 'rule' element to array 'rules'? [y/N]: n
]
Generated config:
{
  "rules": [
    {
      "name": "dont-overload-cache",
      "type": "firstMatch",
      "targets": [
        {
          "onMatch": "default_host",
          "condition": "cpu_load_ok()"
        },
        {
          "onMatch": "offload_host",
          "condition": "always()"
        }
      ]
    }
  ]
}
Merge and apply the config? [y/n]: y

2 - Private CDN Offload Routing with DNS

Use CoreDNS and ESB3024 Router to offload traffic from a private CDN

This shows a setup to do DNS based routing and 3rd party CDN offload from a private main CDN. The following components are used in the setup:

A private CDN with Convoy (ESB3006), Request Router (ESB3008) and SW Streamers (ESB3004)
ESB3024 Router to offload traffic to an external CDN
CoreDNS to support DNS based routing using ESB3024

The following figure shows an overview of the different components and how they interact.

Solution Overview

A client retrieves a DNS name from it’s content portal. The DNS name is then resolved by CoreDNS that asks the router for a suitable host. The router either returns a host from ESB3008 Request Router or a configured offload host.

The following sequence diagram illustrates the flow.

DNS Resolution Sequence Diagram

Follow the links below to configure different use cases of this setup.

3 - Token blocking

How to propagate used tokens between multiple instances of AgileTV CDN Director and block them from being used again.

A common attempt of CDN piracy is to share unique session tokens between multiple users. If your tokens do not encode device identifiers it might be possible for attackers to share a token between multiple devices.

This is a common problem in the industry, and one solution is to block tokens after they have been used. This can be accomplished by storing the used tokens in a central database and checking against that database before allowing a new session to be created.

How it works

AgileTV CDN Director can be configured to produce and consume messages to and from a Kafka broker/cluster, preferably the Kafka cluster deployed by the AgileTV CDN Manager. Utilizing Kafka, the Director can produce Kafka messages containing the used tokens which other Director instances can consume and store locally in selection input.

Configuration

Kafka

The Manager deploys a Kafka cluster that can be used to store used tokens. The Kafka cluster has a default topic called selection_input that can be used to store selection input data. This can be verified by running the following command on the Manager host and checking that the topic selection_input is listed:

kubectl exec acd-manager-kafka-controller-0 -- kafka-topics.sh --bootstrap-server localhost:9092 --list
__consumer_offsets
selection_input

Note that the used tokens can be stored under any Kafka topic, but for this example we will use the default selection_input topic. If you want to use a different topic, refer to the Kafka documentation for more information on how to create a new topic. Note that other services deployed by the Manager may also use the selection_input topic, so be careful when using a different topic.

Connecting to the Kafka cluster

All instances of the Director must be configured to connect to the same Kafka cluster. Assuming that an instance of the Manager is running on the host manager-host, the Kafka cluster can be accessed through port 9095. The following configuration will make the Director connect to the Kafka cluster served by the Manager:

confcli integration.kafka.bootstrapServers
{
    "bootstrapServers": [
        "manager-host:9095"
    ]
}

Data streams

To consume and produce messages to and from the Kafka cluster, data streams are used. Assuming that the Kafka topic selection_input is used to store the used tokens, the following configuration will make the Director consume messages from the selection_input topic and store them as selection input data:

confcli services.routing.dataStreams
{
    "dataStreams": {
        "incoming": [
            {
                "name": "incomingSelectionInput",
                "source": "kafka",
                "target": "selectionInput",
                "kafkaTopics": [
                    "selection_input"
                ]
            },
        ],
        "outgoing": [
            {
                "name": "outgoingSelectionInput",
                "type": "kafka",
                "kafkaTopic": "selection_input"
            }
        ]
    }
}

The incoming section specifies that the Director will consume messages from the selection_input topic and store them as selection input data. Messages are consumed periodically and no more configuration is needed.

The outgoing section specifies that the Director will produce messages to the selection_input topic. The kafkaTopic option specifies the name of the topic to produce messages to. However, to actually produce messages, we need to employ a Lua function.

Lua function

The Lua function blocked_tokens.add produces messages to the Kafka topic. This message will be consumed by other Director instances configured to consume messages from the selection_input topic and be stored as selection input data.

During routing we also need to check if the token attached to the request is blocked and deny the request accordingly. This can be done using the Lua function blocked_tokens.is_blocked.

These two functions can be utilized in a custom Lua function to check if the token is blocked and produce a message to the Kafka topic if it is not.

Consider a scenario where the token is included in the query string of a request. The Lua function below checks if the token exists in the query string, retrieves it, and determines whether it is blocked using blocked_tokens.is_blocked. If the token is not blocked, it is added to the outgoing data stream with a time-to-live (TTL) of 1 hour (3600 seconds) using blocked_tokens.add. However, if the token is blocked, the function returns 1, signaling that the token is blocked and the request should be blocked.

function is_token_used()
    -- This function checks if the token is present in the query string
    -- and if it is blocked. If the token is missing or blocked, the function
    -- returns 1. If the token has not been blocked, the function returns 0.

    -- Verify if the token is present in the query string
    local token = request_query_params.token
    if token == nil then
        -- Token is not present, return 1
        return 1
    end

    -- Check if the token is blocked
    if not blocked_tokens.is_blocked(token) then
        -- Token is not blocked, add the blocked token to the outgoing data
        -- stream with a TTL of 1 hour (3600 seconds)
        local ttl_s = 3600
        blocked_tokens.add('outgoingSelectionInput', token, ttl_s)
        return 0
    else
        -- Token is blocked, return 1
        return 1
    end
end

Save this function in a file called is_token_used.lua and upload it to the Director using the Lua API.

When the function has been uploaded to the Director, it can be used in a deny rule block to fully handle all token blocking:

confcli services.routing.rules
{
    "rules": [
        {
            "name": "checkTokenRule",
            "type": "deny",
            "condition": "is_token_used()",
            "onMiss": "someOtherRule"
        }
    ]
}

This rule will call the function is_token_used() for every request. If the token is missing or blocked, the function will return 1 and the request will be denied. If the token is not blocked, the function will add the token to the outgoing data stream to block it for future use and return 0, allowing the request to continue to the next rule in the chain, which is someOtherRule in this case.

4 - Monitor ACD with Prometheus, Grafana and Alert Manager

Use Prometheus, Grafana and Alert Manager to monitor ACD

ACD can be monitored with standard monitoring solutions. This allows you to adapt the monitoring to your needs. The following shows a setup with the following components

Prometheus as metrics database
Grafana to create Dashboards
Alert Manager to create alarms from Prometheus alerts

Overview

Prometheus and Grafana can also be installed with ESB3024 Router installer in which case Grafana is also pre-populated with standard routing dashboards as well as the router troubleshooting Dashboards.

Solutions

1 - Cache hardware metrics: monitoring and routing

Telegraf Metrics Agent

Configuration

Targets

acd-metrics-aggregator

acd-telegraf-metrics-database

Using Secrets for Request Authorization

Cache Metrics Agent

acd-metrics-aggregator and acd-telegraf-metrics-database

Enabling Request Authorization

Example

Using Hardware Metrics in Routing

Host Health Checks

Routing Based on Cache Metrics

2 - Private CDN Offload Routing with DNS

3 - Token blocking

How it works

Configuration

Kafka

Connecting to the Kafka cluster

Data streams

Lua function

4 - Monitor ACD with Prometheus, Grafana and Alert Manager

`acd-metrics-aggregator`

`acd-telegraf-metrics-database`

`acd-metrics-aggregator` and `acd-telegraf-metrics-database`