Cache hardware metrics: monitoring and routing

How to set up monitoring and make routing decisions based on cache hardware metrics

Observability and monitoring is important in a CDN. To facilitate this, we use the open source application Telegraf as an agent on the caches to collect metrics, such as CPU usage, memory usage, and network interface statistics.

These metrics can be exported to two services installed alongside the Director:

acd-metrics-aggregator: aggregates cache metrics from the CDN and exports them to the Director’s selection input API.
acd-telegraf-metrics-database: stores a smaller amount of time-series cache metrics produced by Telegraf metrics agents and allows the metrics to be scraped by a Prometheus instance, enabling alarms and visualization of the metrics in Grafana.

Telegraf Metrics Agent

All deployments of ESB2001 Orbit TV Server (from version 3.6.2) and ESB3004 SW Streamer (from version 1.36.2) come bundled with an instance of Telegraf running as a metrics agent. If you are using other caches in your CDN, it is possible to install and configure Telegraf manually, if licensing allows and the caches are accessible.

Configuring Telegraf is done using the confcli tool. Before configuration, ensure that the service telegraf-configd is active by running

systemctl status telegraf-configd
systemctl restart telegraf-configd # if the service is not active

If telegraf-configd fails to start, check the log file /var/log/telegraf-configd.log for errors.

Configuration

Once the configuration daemon is running, Telegraf can be configured using confcli:

$ confcli integration.acd.telegraf
{
  "telegraf": {
    "aggregators": [],
    "databases": [],
    "enable": true,
    "enable_tls_verification": true,
    "hostname": "",
    "interfaces": [],
    "pushInterval": 5,
    "secrets": {
      "enable": false,
      "key": "metrics_auth_token",
      "secretStoreID": "telegraf_configd_secretstore"
    }
  }
}

The configuration fields are:

aggregators: a list of URLs to acd-metrics-aggregator instances to which the metrics will be exported.
databases: a list of URLs to acd-telegraf-metrics-database instances to which the metrics will be exported.
enable: whether the Telegraf instance is enabled or not.
enable_tls_verification: whether to verify the TLS certificate of the target when exporting metrics.
hostname: the hostname of the cache. This is used as a tag in the metrics. If not set, the hostname of the machine is used.
interfaces: a list of network interfaces to collect metrics from. Supports wildcards, e.g. eths*.
pushInterval: the interval in seconds for pushing metrics to targets.
secrets: configuration for using a secret token to authenticate with the targets.
- secrets.enable: sets the header Authorization: Token secret_value when exporting metrics to the targets.
- secrets.key: the key/name of the secret. Defaults to metrics_auth_token.
- secrets.secretStoreID: the ID of the secret store to use. Defaults to telegraf_configd_secretstore.

Targets

`acd-metrics-aggregator`

acd-metrics-aggregator is a service that aggregates cache metrics in the CDN and exports them to the Director’s selection input API. The service is installed alongside the Director.

The service is started by default after installation. You can check the status of the service by running

systemctl status acd-metrics-aggregator
systemctl restart acd-metrics-aggregator # if the service is not active

The logs are accessed by running

journalctl -u acd-metrics-aggregator

The configuration file for the aggregator is located at /opt/edgeware/acd/metrics/aggregator-conf.json. The default configuration created during installation will export metrics to the local Director’s selection input API:

{
  "tls_cert": "",
  "tls_key": "",
  "metrics_listen_port": 8087,
  "interval": 5,
  "log_level": "INFO",
  "targets": [
    {
      "url": "https://127.0.0.1:5001/v1/timestamped_selection_input",
      "http_headers": {
        "x-api-key": "8a3094e875b841d480482cfd82e3e313"
      }
    }
  ],
  "secrets": {
    "enable": false,
    "key": "metrics_auth_token"
  }
}

The configuration fields are:

tls_cert: the path to the TLS certificate file if acd-metrics-aggregator should use TLS.
tls_key: the path to the TLS key file if acd-metrics-aggregator should use TLS.
- Note that the service is run as a container with the directory /opt/edgeware/acd/ssl mounted as a volume with the same path in the container. The certificate and key files should be placed in this directory.
metrics_listen_port: the port on which the service listens for metrics.
interval: the interval in seconds for pushing metrics to the targets.
log_level: the log level for the aggregator. Can be DEBUG, INFO or WARNING.
targets: a list of targets to which the metrics will be exported.
- url: the URL of the target.
- http_headers: HTTP headers to send with the request.
  - x-api-key: the API key for the Director’s REST API. Populated by default.
secrets: configuration for using secret token to authenticate incoming metrics requests.
- enable: requires the header Authorization: Token secret_value when receiving metrics.
- key: the key/name of the secret. Defaults to metrics_auth_token.

`acd-telegraf-metrics-database`

acd-telegraf-metrics-database is a service that stores cache metrics in a time series database to be scraped by a Prometheus instance, allowing for alarms and vizualization of the metrics in Grafana. The service is installed alongside the Director.

This service is running Telegraf in a container, receiving metrics from the Telegraf metrics agents. Note that this service is not a full-fledged time series database, like InfluxDB, but a smaller database that stores a limited amount of data. The service acts like a middle man between the Telegraf metrics agents and the scraping Prometheus instance.

The service is started by default after installation. You can check the status of the service by running

systemctl status acd-telegraf-metrics-database
systemctl restart acd-telegraf-metrics-database # if the service is not active

The logs are accessed by running

journalctl -u acd-telegraf-metrics-database

The configuration file for the database is located at /opt/edgeware/acd/metrics/telegraf-metrics-database.conf:

# Global tags can be specified here in key="value" format.
[global_tags]

# Example configuration for aggregator
[agent]
  interval = "5s"  # Data collection interval
  round_interval = true  # Round collection interval to 'interval'
  metric_batch_size = 1000  # Max metrics sent in one batch
  metric_buffer_limit = 10000  # Max metrics stored when outputs are unavailable
  collection_jitter = "0s"  # Collection jitter to stagger collection times
  flush_interval = "5s"  # Metrics output interval
  flush_jitter = "0s"  # Output jitter to stagger output times
  precision = ""  # Timestamp precision in output
  debug = false  # Enable more log info for debugging
  quiet = false  # Enable less log info
  logfile = ""  # Log file path, empty means log to stderr
  hostname = ""  # Host identifier, empty means use os.Hostname()
  omit_hostname = false  # Include hostname in output

# Listen for Telegraf metrics agents
[[inputs.influxdb_v2_listener]]
  service_address = ":8086"
#  tls_cert = ""
#  tls_key = ""
#  token = "@{secretstore:metrics_auth_token}" # Uncomment this line to use token authentication

# Expose port for prometheus to scrape all metrics from
[[outputs.prometheus_client]]
listen = ":12001"

# Secretstore configuration
[[secretstores.docker]]
  id = "secretstore"

See https://docs.influxdata.com/telegraf/v1/plugins/ for full plugin directory and plugin documentation. The important fields are:

inputs.influxdb_v2_listener: plugin that listens for incoming metrics from cache agents.
- service_address: the address and port to listen on.
- tls_cert: the path to the TLS certificate file if TLS should be used.
- tls_key: the path to the TLS key file if TLS should be used.
  - Note that the service is run as a container with the directory /opt/edgeware/acd/ssl mounted as a volume with the same path in the container. The certificate and key files should be placed in this directory.
- token: the token to use for authentication. This field is commented out by default. Uncomment this line to use token authentication.
outputs.prometheus_client: opens a port for Prometheus to scrape metrics from.
- listen: the port to listen on.
secretstores.docker: plugin for using Podman secrets to authenticate incoming metrics.
- id: the ID of the secret store to use. Used in the token field for the inputs.influxdb_v2_listener plugin.

Using Secrets for Request Authorization

Secrets can be used to authenticate incoming metrics requests to acd-metrics-aggregator and acd-telegraf-metrics-database by requiring the header Authorization: Token secret_value when receiving metrics. The Telegraf metrics agent can be configured to attach this header when exporting metrics to acd-metrics-aggregator and acd-telegraf-metrics-database.

Cache Metrics Agent

The secret configuration on the Telegraf metrics agent can be seen by running

$ confcli integration.acd.telegraf.secrets
{
    "secrets": {
        "enable": false,
        "key": "metrics_auth_token",
        "secretStoreID": "telegraf_configd_secretstore"
    }
}

To set the secret value with the key/name metrics_auth_token using the secret store telegraf_configd_secretstore, run

$ telegraf secrets set telegraf_configd_secretstore metrics_auth_token
Enter secret value:

This will prompt you to enter the secret value. Once the secret value is set, the Telegraf metrics agent can use the secret to authenticate with the targets by running

$ confcli integration.acd.telegraf.secrets.enable true
integration.acd.telegraf.secrets.enable = True

This will set the header Authorization: Token secret_value when exporting metrics to the configured targets.

Note that if you change the secret value, you need to restart the Telegraf metrics agent by either updating the configuration, e.g. disabling the service and enabling it again, or by running

systemctl restart telegraf

`acd-metrics-aggregator` and `acd-telegraf-metrics-database`

Both acd-metrics-aggregator and acd-telegraf-metrics-database use secrets supplied by Podman to enable request authorization. During installation, a placeholder secret is created with the name metrics_auth_token in Podman. This secret is loaded into the respective containers when starting the services.

To securely set the secret value, the following commands will prompt you to enter the secret value and store it in a temporary environment variable. The secret value is then piped to podman to store the secret value. Lastly, the environment variable is unset to ensure that the secret value is removed from the environment:

read -sp "Enter secret value: " SECRET_VALUE
printf "$SECRET_VALUE" | podman secret create --replace metrics_auth_token -
unset SECRET_VALUE

This will set the secret value to secret-value.

Enabling Request Authorization

To use the secret for request authorization in acd-metrics-aggregator, modify the secrets field in configuration file /opt/edgeware/acd/metrics/aggregator-conf.json to:

{
  "secrets": {
    "enable": true,
    "key": "metrics_auth_token"
  }
}

Then restart the service:

systemctl restart acd-metrics-aggregator

To use the secret for request authorization in acd-telegraf-metrics-database, modify the configuration file to use the secret by uncommenting the token field in the inputs.influxdb_v2_listener plugin:

[[inputs.influxdb_v2_listener]]
  service_address = ":8086"
  token = "@{secretstore:metrics_auth_token}"

Make sure the secret store ID and secret key in the token field match the values in the secretstores.docker section of the configuration file.

Then restart the service:

systemctl restart acd-telegraf-metrics-database

All incoming metrics requests to acd-metrics-aggregator and acd-telegraf-metrics-database will now require the header Authorization: Token secret_value, otherwise the request will be rejected.

Example

A Telegraf metrics agent can be configured to export metrics to a Director that is installed on the host director-1 like this.

Assuming that acd-metrics-aggregator is listening on port 8087 and acd-telegraf-metrics-database is listening on port 8086, the following configuration will track the CPU usage, memory usage, and network interface statistics for interfaces eths0 and eths1:

$ confcli integration.acd.telegraf.
{
    "telegraf": {
        "aggregators": [
            "http://director-1:8087/metrics"
        ],
        "databases": [
            "http://director-1:8086"
        ],
        "enable": true,
        "enable_tls_verification": true,
        "hostname": "cache-1",
        "interfaces": [
            "eths0",
            "eths1"
        ],
        "pushInterval": 5,
        "secrets": {
            "enable": false,
            "key": "metrics_auth_token",
            "secretStoreID": "telegraf_configd_secretstore"
        }
    }
}

Note that each entry in aggregators requires the path /metrics to be appended to the URL. The entries in databases do not require any path to be appended to the URL.

Once the configuration is set, the cache metrics agent will start exporting metrics to the instances of acd-metrics-aggregator and acd-telegraf-metrics-database running on director-1.

acd-metrics-aggregator will aggregate the metrics and export them to the Director’s selection input API. The metrics can be viewed by running

curl -k https://director-1:5001/v1/selection_input -H "x-api-key: <your-api-key>"
{
  "cache-1": {
    "hardware_metrics": {
      "/media": {
        "free": 2247610073088,
        "total": 2261300281344,
        "used": 13690208256,
        "used_percent": 0.6054131054131053
      },
      "/non-volatile": {
        "free": 1722658816,
        "total": 1934635008,
        "used": 95219712,
        "used_percent": 5.237957901662393
      },
      "/var/log": {
        "free": 487481344,
        "total": 536870912,
        "used": 49389568,
        "used_percent": 9.19952392578125
      },
      "cpu_load1": 0.07,
      "cpu_load15": 0,
      "cpu_load5": 0.03,
      "ehsd_online": 1,
      "mem_available": 7252512768,
      "mem_available_percent": 88.21865022898756,
      "mem_total": 8221065216,
      "mem_used": 818151424,
      "n_cpus": 4
    },
    "per_interface_metrics": {
      "eths0": {
        "bytes_recv": 155734911,
        "bytes_recv_rate": 1552,
        "bytes_sent": 2967510,
        "bytes_sent_rate": 1378,
        "drop_in": 843508,
        "drop_in_rate": 7,
        "drop_out": 0,
        "drop_out_rate": 0,
        "err_in": 0,
        "err_in_rate": 0,
        "err_out": 0,
        "err_out_rate": 0,
        "interface_up": true,
        "link": 1,
        "megabits_recv": 1245.879288,
        "megabits_recv_rate": 0.012416,
        "megabits_sent": 23.74008,
        "megabits_sent_rate": 0.011024,
        "speed": 10000,
        "speed_rate": 0
      },
      "eths1": {
        "bytes_recv": 66197103,
        "bytes_recv_rate": 256.5,
        "bytes_sent": 612,
        "bytes_sent_rate": 0,
        "drop_in": 3399,
        "drop_in_rate": 0,
        "drop_out": 0,
        "drop_out_rate": 0,
        "err_in": 0,
        "err_in_rate": 0,
        "err_out": 0,
        "err_out_rate": 0,
        "interface_up": true,
        "link": 1,
        "megabits_recv": 529.576824,
        "megabits_recv_rate": 0.002052,
        "megabits_sent": 0.004896,
        "megabits_sent_rate": 0,
        "speed": 10000,
        "speed_rate": 0
      }
    }
  }
}

acd-telegraf-metrics-database will store the metrics in a time series database to be scraped by a Prometheus instance. The metrics can be scraped manually by running

curl -k http://director-1:12001/metrics
# HELP disk_used_percent Telegraf collected metric
# TYPE disk_used_percent untyped
disk_used_percent{device="dm-0",fstype="xfs",host="cache-1",label="vg_main-lv_root",metric_owner="telegraf-configd",mode="rw",path="/"} 12.752679256881688
# HELP ehsd_ehsd_online Telegraf collected metric
# TYPE ehsd_ehsd_online untyped
ehsd_ehsd_online{host="cache-1",metric_owner="telegraf-configd"} 1
# HELP net_megabits_sent_rate Telegraf collected metric
# TYPE net_megabits_sent_rate untyped
net_megabits_sent_rate{host="cache-1",interface="eth0",metric_owner="telegraf-configd"} 0.002174
...

Using Hardware Metrics in Routing

When the hardware metrics have been succesfully injected into the selection input API, the data is ready to be used for routing decisions. In addition to creating routing conditions, hardware metrics are particularly suited to host health checks.

Host Health Checks

When the hardware metrics have been succesfully injected into the selection input API, the data is ready to make routing decisions with. In addition to creating routing conditions, hardware metrics are particularly suited to host health checks.

Host health checks are Lua functions that determine whether or not a host is ready to take on more clients. To configure host health checks, see configuring CDNs and hosts. See Built-in Lua functions for documentation on how to use the built-in health functions. Note that these health check functions can also be used for making routing conditions in the routing tree.

Routing Based on Cache Metrics

Instead of using health check functions to incorporate hardware metrics into the routing decisions, regular routing conditions can be used.

As an example, using the health check function cpu_load_ok() in routing can be configured as follows:

$ confcli services.routing.rules -w
Running wizard for resource 'rules'

Hint: Hitting return will set a value to its default.
Enter '?' to receive the help string

rules : [
  rule can be one of
    1: allow
    2: consistentHashing
    3: contentPopularity
    4: deny
    5: firstMatch
    6: random
    7: rawGroup
    8: rawHost
    9: split
    10: weighted
  Choose element index or name: firstMatch
  Adding a 'firstMatch' element
    rule : {
      name (default: ): dont_overload_cache
      type (default: firstMatch): 
      targets : [
        target : {
          onMatch (default: ): default_host
          condition (default: always()): cpu_load_ok()
        }
        Add another 'target' element to array 'targets'? [y/N]: y
        target : {
          onMatch (default: ): offload_host
          condition (default: always()): 
        }
        Add another 'target' element to array 'targets'? [y/N]: n
      ]
    }
  Add another 'rule' element to array 'rules'? [y/N]: n
]
Generated config:
{
  "rules": [
    {
      "name": "dont-overload-cache",
      "type": "firstMatch",
      "targets": [
        {
          "onMatch": "default_host",
          "condition": "cpu_load_ok()"
        },
        {
          "onMatch": "offload_host",
          "condition": "always()"
        }
      ]
    }
  ]
}
Merge and apply the config? [y/n]: y