Cache hardware metrics: monitoring and routing
Observability and monitoring is important in a CDN. To facilitate this, we use the open source application Telegraf for two key roles: as a metrics agent and as a metrics aggregator. Telegraf performs both these roles depending on its configuration.
The image below demonstrates an example architecture when using multiple caches and instances of ESB3024 Router.
Metrics agent
All deployments of ESB2001 Orbit TV Server (from version 3.6.0) and ESB3004 SW Streamer (from version 1.36.0) come bundled with an instance of Telegraf running as a metrics agent, collecting information on hardware metrics such as usage of CPU, memory and network interfaces. If you are using other caches in your CDN, Telegraf can be manually installed and configured.
Telegraf is configured via a configuration daemon, telegraf-configd
. On the
Orbit TV Server is starts automatically, but on the SW Streamer it needs to be
started manually by typing this:
systemctl start telegraf-configd
To make it always start during boot, type
systemctl enable telegraf-configd
Once the configuration daemon is running, Telegraf can be configured using
confcli
:
$ confcli integration.acd.telegraf
{
"telegraf": {
"enable": true,
"exportUrls": [],
"hostname": ""
}
}
To enable or disable the Telegraf instance, type
$ confcli integration.acd.telegraf.enable true
integration.acd.telegraf.enable = True
$ confcli integration.acd.telegraf.enable false
integration.acd.telegraf.enable = false
The Telegraf agents need to export their metrics to an aggregator to be useful.
The field integration.acd.telegraf.exportUrls
is a list of Telegraf aggregator
instances to which the metrics will be exported. It is recommended to use at
least two Telegraf aggregator instances for redundancy.
As an example, type this to configure a cache to export metrics to
http://aggregator-host-1:8086
and http://aggregator-host-2:8086
:
$ confcli integration.acd.telegraf.exportUrls -w
Running wizard for resource 'exportUrls'
<A list of aggregator URLs to export metrics to (default: [])>
Hint: Hitting return will set a value to its default.
Enter '?' to receive the help string
exportUrls <A list of aggregator URLs to export metrics to (default: [])>: [
exportUrl (default: ): http://aggregator-host-1:8086
Add another 'exportUrl' element to array 'exportUrls'? [y/N]: y
exportUrl (default: ): http://aggregator-host-2:8086
Add another 'exportUrl' element to array 'exportUrls'? [y/N]: n
]
Generated config:
{
"exportUrls": [
"http://aggregator-host-1:8086",
"http://aggregator-host-2:8086"
]
}
Merge and apply the config? [y/n]: y
The Telegraf agent tags all metrics it collects with the hostname of the machine
it is running on. To change this hostname, the field
integration.acd.telegraf.hostname
can be modifed:
$ confcli integration.acd.telegraf.hostname 'new-hostname'
integration.acd.telegraf.hostname = 'new-hostname'
Metrics aggregator
Telegraf is not installed together with the router since it is dependent on your system. Either follow these steps or ask an Agile Content integrator to set it up for your system. To set up Telegraf as an aggregator on your desired hosts, two steps need to be performed.
- Install Telegraf according to official instructions.
- Configure Telegraf to act as an aggregator by adding the following
configuration file:
/etc/telegraf/telegraf.conf
This makes Telegraf export the metrics to the timestamped selection input
endpoint (i.e. https://acd-host:5001/v2/timestamped_selection_input
). By
default, metrics sent to this endpoint have a timeout of 15 seconds, after which
the data will be removed to avoid using stale data. This timeout is configurable
by running
$ confcli services.routing.tuning.general.timestampedSelectionInputTimeoutSeconds 30
services.routing.tuning.general.timestampedSelectionInputTimeoutSeconds = 30
Due to a limitation of the [[outputs.http]]
plugin, multiple output URLs
cannot be defined as a list, the plugin only accepts a single URL. This means
that the whole [[outputs.http]]
and [outputs.http.headers]
needs to be
repeated for each router host.
Each router section needs to be modified in two places:
- Below
[[outputs.http]]
, theurl
directive has to be made direct to the router’s address. - Below
[outputs.http.headers]
,X-API-Key
has to contain the router’s REST API key. The key is stored in/opt/edgeware/acd/router/cache/rest-api-key.json
on the router server.
After the configuration file is complete, type systemctl reload telegraf
to
make Telegraf reload the new configuration.
When the aggregator is succesfully started and configured, the metrics JSON packet is sent to the selection input API with the following structure:
{
"cache-1": {
"hardware_metrics": {
"/": {
"free": 18113810432,
"total": 34887954432,
"used": 16774144000,
"used_percent": 48.08004445400899
},
"cpu_load1": 0.70,
"cpu_load5": 0.53,
"cpu_load15": 0.40,
"mem_available": 3129425920,
"mem_available_percent": 76.34786936062332,
"mem_total": 4098904064,
"mem_used": 519901184,
"n_cpus": 2
},
"per_interface_metrics": {
"eth0": {
"bytes_recv": 57585596566,
"bytes_recv_rate": 939.4, // bytes per second
"bytes_sent": 10127106702,
"bytes_sent_rate": 637.9, // bytes per second
"drop_in": 1800079,
"drop_in_rate": 0,
"drop_out": 0,
"drop_out_rate": 0,
"err_in": 0,
"err_in_rate": 0,
"err_out": 0,
"err_out_rate": 0,
"interface_up": true,
"link": 1,
"megabits_recv": 460684,
"megabits_recv_rate": 0.0075152, // megabits per second
"megabits_sent": 81016,
"megabits_sent_rate": 0.0051032, // megabits per second
"speed": 100
}
},
"timestamp": 1709296800
}
}
Monitoring
All router installations come bundled with a Prometheus instance that can be configured to scrape the aggregator instances, allowing alarms and visualization of all caches’ metrics.
On a router host, modify the configuration file /opt/edgeware/acd/prometheus/prometheus.yaml
to include a scrape job for the aggregators in your system:
global:
scrape_interval: 15s
rule_files:
- recording-rules.yaml
scrape_configs:
- job_name: 'router-scraper'
scheme: https
tls_config:
insecure_skip_verify: true
static_configs:
- targets:
- 10.16.48.106:5001
metrics_path: /m1/v1/metrics
honor_timestamps: true
- job_name: 'edns-proxy-scraper'
scheme: http
static_configs:
- targets:
- 10.16.48.106:8888
metrics_path: /metrics
honor_timestamps: true
- job_name: 'metric-aggregators'
scheme: http
static_configs:
- targets:
- aggregator-host-1:12001
- aggregator-host-2:12001
metrics_path: /metrics
honor_timestamps: true
Once the new configuration file is set, restart the Prometheus instance with
systemctl restart acd-prometheus
. When the Prometheus instance starts scraping the
aggregators, all caches’ metrics will be available for visualization in Grafana
and alarms in AlertManager.
Using hardware metrics in routing
When the hardware metrics have been succesfully injected into the selection input API, the data is ready to be used for routing decisions. In addition to creating routing conditions, hardware metrics are particularly suited to host health checks.
Host health checks
When the hardware metrics have been succesfully injected into the selection input API, the data is ready to make routing decisions with. In addition to creating routing conditions, hardware metrics are particularly suited to host health checks.
Host health checks
Host health checks are Lua functions that determine whether or not a host is ready to take on more clients. To configure host health checks, see configuring CDNs and hosts. See Built-in Lua functions for documentation on how to use the built-in health functions. Note that these health check functions can also be used for making routing conditions in the routing tree.
Routing based on cache metrics
Instead of using health check functions to incorporate hardware metrics into the routing decisions, regular routing conditions can be used.
As an example, using the health check function cpu_load_ok()
in routing can be
configured as follows:
$ confcli services.routing.rules -w
Running wizard for resource 'rules'
Hint: Hitting return will set a value to its default.
Enter '?' to receive the help string
rules : [
rule can be one of
1: allow
2: consistentHashing
3: contentPopularity
4: deny
5: firstMatch
6: random
7: rawGroup
8: rawHost
9: split
10: weighted
Choose element index or name: firstMatch
Adding a 'firstMatch' element
rule : {
name (default: ): dont_overload_cache
type (default: firstMatch):
targets : [
target : {
onMatch (default: ): default_host
condition (default: always()): cpu_load_ok()
}
Add another 'target' element to array 'targets'? [y/N]: y
target : {
onMatch (default: ): offload_host
condition (default: always()):
}
Add another 'target' element to array 'targets'? [y/N]: n
]
}
Add another 'rule' element to array 'rules'? [y/N]: n
]
Generated config:
{
"rules": [
{
"name": "dont-overload-cache",
"type": "firstMatch",
"targets": [
{
"onMatch": "default_host",
"condition": "cpu_load_ok()"
},
{
"onMatch": "offload_host",
"condition": "always()"
}
]
}
]
}
Merge and apply the config? [y/n]: y
{
"routing": {
"id": "dont_overload_cache",
"member_order": "sequential",
"members": [
{
"id": "default-node",
"host_id": "default-host",
"weight_function": "return cpu_load_ok()"
}
{
"id": "offload-node",
"host_id": "offload-host",
"weight_function": "return 1"
}
]
}
}