Cache hardware metrics: monitoring and routing

How to set up monitoring and make routing decisions based on cache hardware metrics

Observability and monitoring is important in a CDN. To facilitate this, we use the open source application Telegraf for two key roles: as a metrics agent and as a metrics aggregator. Telegraf performs both these roles depending on its configuration.

The image below demonstrates an example architecture when using multiple caches and instances of ESB3024 Router.

Architecture

Metrics agent

All deployments of ESB2001 Orbit TV Server (from version 3.6.0) and ESB3004 SW Streamer (from version 1.36.0) come bundled with an instance of Telegraf running as a metrics agent, collecting information on hardware metrics such as usage of CPU, memory and network interfaces. If you are using other caches in your CDN, Telegraf can be manually installed and configured.

Telegraf is configured via a configuration daemon, telegraf-configd. On the Orbit TV Server is starts automatically, but on the SW Streamer it needs to be started manually by typing this:

systemctl start telegraf-configd

To make it always start during boot, type

systemctl enable telegraf-configd

Once the configuration daemon is running, Telegraf can be configured using confcli:

$ confcli integration.acd.telegraf
{
  "telegraf": {
    "enable": true,
    "exportUrls": [],
    "hostname": ""
  }
}

To enable or disable the Telegraf instance, type

$ confcli integration.acd.telegraf.enable true
integration.acd.telegraf.enable = True

$ confcli integration.acd.telegraf.enable false
integration.acd.telegraf.enable = false

The Telegraf agents need to export their metrics to an aggregator to be useful. The field integration.acd.telegraf.exportUrls is a list of Telegraf aggregator instances to which the metrics will be exported. It is recommended to use at least two Telegraf aggregator instances for redundancy.

As an example, type this to configure a cache to export metrics to http://aggregator-host-1:8086 and http://aggregator-host-2:8086:

$ confcli integration.acd.telegraf.exportUrls -w
Running wizard for resource 'exportUrls'
<A list of aggregator URLs to export metrics to (default: [])>

Hint: Hitting return will set a value to its default.
Enter '?' to receive the help string

exportUrls <A list of aggregator URLs to export metrics to (default: [])>: [
  exportUrl (default: ): http://aggregator-host-1:8086
  Add another 'exportUrl' element to array 'exportUrls'? [y/N]: y
  exportUrl (default: ): http://aggregator-host-2:8086
  Add another 'exportUrl' element to array 'exportUrls'? [y/N]: n
]
Generated config:
{
  "exportUrls": [
    "http://aggregator-host-1:8086",
    "http://aggregator-host-2:8086"
  ]
}
Merge and apply the config? [y/n]: y

The Telegraf agent tags all metrics it collects with the hostname of the machine it is running on. To change this hostname, the field integration.acd.telegraf.hostname can be modifed:

$ confcli integration.acd.telegraf.hostname 'new-hostname'
integration.acd.telegraf.hostname = 'new-hostname'

Metrics aggregator

Telegraf is not installed together with the router since it is dependent on your system. Either follow these steps or ask an Agile Content integrator to set it up for your system. To set up Telegraf as an aggregator on your desired hosts, two steps need to be performed.

  1. Install Telegraf according to official instructions.
  2. Configure Telegraf to act as an aggregator by adding the following configuration file: /etc/telegraf/telegraf.conf

This makes Telegraf export the metrics to the timestamped selection input endpoint (i.e. https://acd-host:5001/v2/timestamped_selection_input). By default, metrics sent to this endpoint have a timeout of 15 seconds, after which the data will be removed to avoid using stale data. This timeout is configurable by running

$ confcli services.routing.tuning.general.timestampedSelectionInputTimeoutSeconds 30
services.routing.tuning.general.timestampedSelectionInputTimeoutSeconds = 30

Due to a limitation of the [[outputs.http]] plugin, multiple output URLs cannot be defined as a list, the plugin only accepts a single URL. This means that the whole [[outputs.http]] and [outputs.http.headers] needs to be repeated for each router host.

Each router section needs to be modified in two places:

  1. Below [[outputs.http]], the url directive has to be made direct to the router’s address.
  2. Below [outputs.http.headers], X-API-Key has to contain the router’s REST API key. The key is stored in /opt/edgeware/acd/router/cache/rest-api-key.json on the router server.

After the configuration file is complete, type systemctl reload telegraf to make Telegraf reload the new configuration.

When the aggregator is succesfully started and configured, the metrics JSON packet is sent to the selection input API with the following structure:

{
  "cache-1": {
    "hardware_metrics": {
      "/": {
        "free": 18113810432,
        "total": 34887954432,
        "used": 16774144000,
        "used_percent": 48.08004445400899
      },
      "cpu_load1": 0.70,
      "cpu_load5": 0.53,
      "cpu_load15": 0.40,
      "mem_available": 3129425920,
      "mem_available_percent": 76.34786936062332,
      "mem_total": 4098904064,
      "mem_used": 519901184,
      "n_cpus": 2
    },
    "per_interface_metrics": {
      "eth0": {
        "bytes_recv": 57585596566,
        "bytes_recv_rate": 939.4, // bytes per second
        "bytes_sent": 10127106702,
        "bytes_sent_rate": 637.9, // bytes per second
        "drop_in": 1800079,
        "drop_in_rate": 0,
        "drop_out": 0,
        "drop_out_rate": 0,
        "err_in": 0,
        "err_in_rate": 0,
        "err_out": 0,
        "err_out_rate": 0,
        "interface_up": true,
        "link": 1,
        "megabits_recv": 460684,
        "megabits_recv_rate": 0.0075152, // megabits per second
        "megabits_sent": 81016,
        "megabits_sent_rate": 0.0051032, // megabits per second
        "speed": 100
      }
    },
    "timestamp": 1709296800
  }
}

Monitoring

All router installations come bundled with a Prometheus instance that can be configured to scrape the aggregator instances, allowing alarms and visualization of all caches’ metrics.

On a router host, modify the configuration file /opt/edgeware/acd/prometheus/prometheus.yaml to include a scrape job for the aggregators in your system:

global:
  scrape_interval: 15s

rule_files:
  - recording-rules.yaml

scrape_configs:
  - job_name: 'router-scraper'
    scheme: https
    tls_config:
      insecure_skip_verify: true
    static_configs:
    - targets:
      - 10.16.48.106:5001
    metrics_path: /m1/v1/metrics
    honor_timestamps: true
  - job_name: 'edns-proxy-scraper'
    scheme: http
    static_configs:
    - targets:
      - 10.16.48.106:8888
    metrics_path: /metrics
    honor_timestamps: true
  - job_name: 'metric-aggregators'
    scheme: http
    static_configs:
    - targets:
      - aggregator-host-1:12001
      - aggregator-host-2:12001
    metrics_path: /metrics
    honor_timestamps: true

Once the new configuration file is set, restart the Prometheus instance with systemctl restart acd-prometheus. When the Prometheus instance starts scraping the aggregators, all caches’ metrics will be available for visualization in Grafana and alarms in AlertManager.

Using hardware metrics in routing

When the hardware metrics have been succesfully injected into the selection input API, the data is ready to be used for routing decisions. In addition to creating routing conditions, hardware metrics are particularly suited to host health checks.

Host health checks

When the hardware metrics have been succesfully injected into the selection input API, the data is ready to make routing decisions with. In addition to creating routing conditions, hardware metrics are particularly suited to host health checks.

Host health checks

Host health checks are Lua functions that determine whether or not a host is ready to take on more clients. To configure host health checks, see configuring CDNs and hosts. See Built-in Lua functions for documentation on how to use the built-in health functions. Note that these health check functions can also be used for making routing conditions in the routing tree.

Routing based on cache metrics

Instead of using health check functions to incorporate hardware metrics into the routing decisions, regular routing conditions can be used.

As an example, using the health check function cpu_load_ok() in routing can be configured as follows:

$ confcli services.routing.rules -w
Running wizard for resource 'rules'

Hint: Hitting return will set a value to its default.
Enter '?' to receive the help string

rules : [
  rule can be one of
    1: allow
    2: consistentHashing
    3: contentPopularity
    4: deny
    5: firstMatch
    6: random
    7: rawGroup
    8: rawHost
    9: split
    10: weighted
  Choose element index or name: firstMatch
  Adding a 'firstMatch' element
    rule : {
      name (default: ): dont_overload_cache
      type (default: firstMatch): 
      targets : [
        target : {
          onMatch (default: ): default_host
          condition (default: always()): cpu_load_ok()
        }
        Add another 'target' element to array 'targets'? [y/N]: y
        target : {
          onMatch (default: ): offload_host
          condition (default: always()): 
        }
        Add another 'target' element to array 'targets'? [y/N]: n
      ]
    }
  Add another 'rule' element to array 'rules'? [y/N]: n
]
Generated config:
{
  "rules": [
    {
      "name": "dont-overload-cache",
      "type": "firstMatch",
      "targets": [
        {
          "onMatch": "default_host",
          "condition": "cpu_load_ok()"
        },
        {
          "onMatch": "offload_host",
          "condition": "always()"
        }
      ]
    }
  ]
}
Merge and apply the config? [y/n]: y
{
  "routing": {
    "id": "dont_overload_cache",
    "member_order": "sequential",
    "members": [
      {
        "id": "default-node",
        "host_id": "default-host",
        "weight_function": "return cpu_load_ok()"
      }
      {
        "id": "offload-node",
        "host_id": "offload-host",
        "weight_function": "return 1"
      }
    ]
  }
}