1 - System troubleshooting
ESB3024 contains the tool ew-sysinfo
that gives an overview of how the
system is doing. Simply use the command and the tool will output information
about the system and the installed ESB3024 services.
The output format can be changed using the --format
flag, possible values
are human
(default) and json
, e.g.:
$ ew-sysinfo
system:
os: ['5.4.17-2136.321.4.el8uek.x86_64', 'Oracle Linux Server 8.8']
cpu_cores: 2
cpu_load_average: [0.03, 0.03, 0.0]
memory_usage: 478 MB
memory_load_average: [0.03, 0.03, 0.0]
boot_time: 2023-09-08T08:30:57Z
uptime: 6 days, 3:43:44.640665
processes: 122
open_sockets:
ipv4: 12
ipv6: 18
ip_total: 30
tcp_over_ipv4: 9
tcp_over_ipv6: 16
tcp_total: 25
udp_over_ipv4: 3
udp_over_ipv6: 2
udp_total: 5
total: 145
system_disk (/):
total: 33271 MB
used: 7978 MB (24.00%)
free: 25293 MB
journal_disk (/run/log/journal):
total: 1954 MB
used: 217 MB (11.10%)
free: 1736 MB
vulnerabilities:
meltdown: Mitigation: PTI
spectre_v1: Mitigation: usercopy/swapgs barriers and __user pointer sanitization
spectre_v2: Mitigation: Retpolines, STIBP: disabled, RSB filling, PBRSB-eIBRS: Not affected
processes:
orc-re:
pid: 177199
status: sleeping
cpu_usage_percent: 1.0%
cpu_load_average: 131.11%
memory_usage: 14 MB (0.38%)
num_threads: 10
hints:
get_raw_router_config: cat /opt/edgeware/acd/router/cache/config.json
get_confd_config: cat /opt/edgeware/acd/confd/store/__active
get_router_logs: journalctl -u acd-router
get_edns_proxy_logs: journalctl -u acd-edns-proxy
check_firewall_status: systemctl status firewalld
check_firewall_config: iptables -nvL
# For --format=json, it's recommended to pipe the output to a JSON interpreter
# such as jq
$ ew-sysinfo --format=json | jq
{
"system": {
"os": [
"5.4.17-2136.321.4.el8uek.x86_64",
"Oracle Linux Server 8.8"
],
"cpu_cores": 2,
"cpu_load_average": [
0.01,
0.0,
0.0
],
"memory_usage": "479 MB",
"memory_load_average": [
0.01,
0.0,
0.0
],
"boot_time": "2023-09-08 08:30:57",
"uptime": "6 days, 5:12:24.617114",
"processes": 123,
"open_sockets": {
"ipv4": 13,
"ipv6": 18,
"ip_total": 31,
"tcp_over_ipv4": 10,
"tcp_over_ipv6": 16,
"tcp_total": 26,
"udp_over_ipv4": 3,
"udp_over_ipv6": 2,
"udp_total": 5,
"total": 146
}
},
"system_disk (/)": {
"total": "33271 MB",
"used": "7977 MB (24.00%)",
"free": "25293 MB"
},
"journal_disk (/run/log/journal)": {
"total": "1954 MB",
"used": "225 MB (11.50%)",
"free": "1728 MB"
},
"vulnerabilities": {
"meltdown": "Mitigation: PTI",
"spectre_v1": "Mitigation: usercopy/swapgs barriers and __user pointer sanitization",
"spectre_v2": "Mitigation: Retpolines, STIBP: disabled, RSB filling, PBRSB-eIBRS: Not affected"
},
"processes": {
"orc-re": {
"pid": 177199,
"status": "sleeping",
"cpu_usage_percent": "0.0%",
"cpu_load_average": "137.63%",
"memory_usage": "14 MB (0.38%)",
"num_threads": 10
}
}
}
Note that your system might have different monitored processes and field names.
The field hints
is different from the rest. It lists common commands
that can be used to further monitor system performance, useful for
quickly troubleshooting a faulty system.
2 - Scraping data with Prometheus
Prometheus is a third-party data scraper which is installed as a containerized service in the default installation of ESB3024 Router. It periodically reads metrics data from different services, such as acd-router, aggregates it and makes it available to other services that visualize the data. Those services include Grafana and Alertmanager.
The Prometheus configuration file can be found on the host at
/opt/edgeware/acd/prometheus/prometheus.yaml
.
Accessing Prometheus
Prometheus has a web interface that is listening for HTTP connections on port 9090. There is no authentication, so anyone who has access to the host that is running Prometheus can access the interface.
Starting / Stopping Prometheus
After the service is configured, it can be managed via systemd, under the
service unit acd-prometheus
.
systemctl start acd-prometheus
Logging
The container logs are automatically published to the system journal, under
the same unit descriptor, and can be viewed using journalctl
journalctl -u acd-prometheus
3 - Visualizing data with Grafana
3.1 - Managing Grafana
Grafana displays graphs based on data from Prometheus. A default deployment of Grafana is running in a container alongside ESB3024 Router.
Grafana’s configuration and runtime files are stored under
/opt/edgeware/acd/grafana
. It comes with default dashboards that are
documented at Grafana dashboards.
Accessing Grafana
Grafana’s web interface is listening for HTTP connections on port
3000. It has two default accounts, edgeware
and admin
.
The edgeware
account can only view graphs, while the admin
account can also
edit graphs. The accounts with default passwords are shown in the table below.
Account | Default password |
---|---|
edgeware | edgeware |
admin | edgeware |
Starting / Stopping Grafana
Grafana can be managed via systemd, under the service unit acd-grafana
.
systemctl start acd-grafana
Logging
The container logs are automatically published to the system journal, under
the same unit descriptor, and can be viewed using journalctl
journalctl -u acd-grafana
3.2 - Grafana Dashboards
Grafana will be populated with pre-configured graphs which present some metrics on a time scale. Below is a comprehensive list of those dashboards, along with short descriptions.
Router Monitoring dashboard
This dashboard is by default set as home
directory - it’s what user will see
after logging in.
Number Of Initial Routing Decisions
HTTP Status Codes
Total number of responses sent back to incoming requests, shown by their status codes. Metric: client-response-status
Incoming HTTP and HTTPS Requests
Total number of incoming requests that were deemed valid, divided into SSL
and Unencrypted
categories.
Metric: num_valid_http_requests
Debugging Information dashboard
Number of Lua Exceptions
Number of exceptions encountered so far while evaluating Lua rules. Metric: lua_num_errors
Number of Lua Contexts
Number of active Lua interpreters, both running and idle. Metric: lua_num_evaluators
Time Spent In Lua
Number of microseconds the Lua interpreters were running. Metric: lua_time_spent
Router Latencies
Histogram-like graph showing how many responses were sent within the given latency interval. Metric: orc_latency_bucket
Internal debugging
A folder that contains dashboards intended for internal use.
ACD: Incoming Internet Connections dashboard
SSL Warnings
Rate of warnings logged during TLS connections Metric: num_ssl_warnings_total
SSL Errors
Rate of errors logged during TLS connections Metric: num_ssl_errors_total
Valid Internet HTTPS Requests
Rate of incoming requests that were deemed valid, HTTPS only. Metric: num_valid_http_requests
Invalid Internet HTTPS Requests
Rate of incoming requests that were deemed invalid, HTTPS only. Metric: num_invalid_http_requests
Valid Internet HTTP Requests
Rate of incoming requests that were deemed valid, HTTP only. Metric: num_valid_http_requests
Invalid Internet HTTP Requests
Rate of incoming requests that were deemed invalid, HTTP only. Metric: num_invalid_http_requests
Prometheus: ACD dashboard
Logged Warnings
Rate of logged warnings since the router has started, divided into CDN-related and CDN-unrelated. Metric: num_log_warnings_total
Logged Errors
Rate of logged errors since the router has started. Metric: num_log_errors_total
HTTP Requests
Rate of responses sent to incoming connections. Metric: orc_latency_count
Number Of Active Sessions
Number of sessions opened on router that are still active. Metric: num_sessions
Total Number Of Sessions
Total number of sessions opened on router. Metric: num_sessions
Session Type Counts (Non-Stacked)
Number of active sessions divided by type; see metric documentation linked below for up-to-date list of types. Metric: num_sessions
Prometheus/ACD: Subrunners
Client Connections
Number of currently open client connections per subrunner. Metric: subrunner_client_conns
Asynchronous Queues (Current)
Number of queued events per subrunner, roughly corresponding to load. Metric: subrunner_async_queue
Used <Send/receive> Data Blocks
Number of send or receive data blocks currently in use per subrunner, as decided by the “Send/receive” drop down box. Metric: subrunner_used_send_data_blocks and subrunner_used_receive_data_blocks
Asynchronous Queues (Max)
Maximum number of events waiting in queue. Metric: subrunner_max_async_queue
Total <Send/receive> Data Blocks
Number of send or receive data blocks allocated per subrunner, as decided by the “Send/receive” drop down box. Metric: subrunner_total_send_data_blocks and subrunner_total_receive_data_blocks
Low Queue (Current)
Number of low priority events queued per subrunner. Metric: subrunner_low_queue
Medium Queue (Current)
Number of medium priority events queued per subrunner. Metric: subrunner_medium_queue
High Queue (Current)
Number of high priority events queued per subrunner. Metric: subrunner_high_queue
Low Queue (Max)
Maximum number of events waiting in low priority queue. Metric: subrunner_max_low_queue
Medium Queue (Max)
Maximum number of events waiting in medium priority queue. Metric: subrunner_max_medium_queue
High Queue (Max)
Maximum number of events waiting in high priority queue. Metric: subrunner_max_high_queue
Wakeups
The number of times a subrunner has been waken up from sleep. Metric: subrunner_io_wakeups
Overloaded
The number of times the number of queued events for a subrunner exceeded its maximum. Metric: subrunner_times_worker_overloaded
Autopause
Number of sockets that have been automatically paused. This happens when the work manager is under heavy load. Metric: subrunner_io_autopause_sockets
4 - Alarms and Alerting
Alerts are generated by the third-party service Prometheus, which sends them to the Alertmanager service. A default containerized instance of Alertmanager is deployed alongside ESB3024 Router. Out of the box, Alertmanager ships with only a sample configuration file, and will require manual configuration prior to enabling the alerting functionality. Due to the many different possible configurations for how alerts are both detected and where they are pushed, the official Alertmanager documentation should be followed for how to configure the service.
The router ships with Alertmanager 0.25, the documentation
for which can be found at prometheus.io.
The Alertmanager configuration file can be found on the host at
/opt/edgeware/acd/alertmanager/alertmanager.yml
.
Accessing Alertmanager
Alertmanager has a web interface that is listening for HTTP connections on port 9093. There is no authentication, so anyone who has access to the host that is running Alertmanager can access the interface.
Starting / Stopping Alertmanager
After the service is configured, it can be managed via systemd, under the
service unit acd-alertmanager
.
systemctl start acd-alertmanager
Logging
The container logs are automatically published to the system journal, under
the same unit descriptor, and can be viewed using journalctl
journalctl -u acd-alertmanager
5 - Monitoring multiple routers
By default an instance of Prometheus only monitors the ESB3024 Router that is installed on the same host as where Prometheus is installed. It is possible to make it monitor other router instances and visualize all instances on one Grafana instance.
Configuring of Prometheus
This is configured in the scraping configuration of Prometheus, which is found
in the file /opt/edgeware/acd/prometheus/prometheus.yaml
, which typically
looks like this:
global:
scrape_interval: 15s
rule_files:
- recording-rules.yaml
# A scrape configuration for router metrics
scrape_configs:
- job_name: 'router-scraper'
scheme: https
tls_config:
insecure_skip_verify: true
static_configs:
- targets:
- acd-router-1:5001
metrics_path: /m1/v1/metrics
honor_timestamps: true
- job_name: 'edns-proxy-scraper'
scheme: http
static_configs:
- targets:
- acd-router-1:8888
metrics_path: /metrics
honor_timestamps: true
More routers can be added to the scrape configuration by simply adding more
routers under targets
in the scraper jobs.
For instance, to monitor acd-router-2
and acd-router-3
along acd-router-1
,
the configuration file needs to be modified like this:
global:
scrape_interval: 15s
rule_files:
- recording-rules.yaml
# A scrape configuration for router metrics
scrape_configs:
- job_name: 'router-scraper'
scheme: https
tls_config:
insecure_skip_verify: true
static_configs:
- targets:
- acd-router-1:5001
- acd-router-2:5001
- acd-router-3:5001
metrics_path: /m1/v1/metrics
honor_timestamps: true
- job_name: 'edns-proxy-scraper'
scheme: http
static_configs:
- targets:
- acd-router-1:8888
- acd-router-2:8888
- acd-router-3:8888
metrics_path: /metrics
honor_timestamps: true
After the file has been modified, Prometheus needs to be restarted by typing
systemctl restart acd-prometheus
It is possible to use the same configuration on multiple routers, so that all routers in a deployment can monitor each other.
Selecting router in Grafana
In the top left corner the Grafana dashboards have a drop-down menu labeled “ACD Router”, which allows to choose which router to monitor.
6 - Routing Rule Evaluation Metrics
ESB3024 Router counts the number of times a node and any of its children is selected in the routing table.
The visit counters can be retrieved with the following end points:
/v1/node_visits
Returns visit counters for each node as a flat list of
host:counter
pairs in JSON.Example output:
{ "node1": "1", "node2": "1", "node3": "1", "top": "3" }
/v1/node_visits_graph
Returns a full graph of nodes with their respective visit counters in GraphML.
Example output:
<?xml version="1.0"?> <graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd"> <key id="visits" for="node" attr.name="visits" attr.type="string" /> <graph id="G" edgedefault="directed"> <node id="routing_table"> <data key="visits">5</data> </node> <node id="cdn1"> <data key="visits">1</data> </node> <node id="node1"> <data key="visits">1</data> </node> <node id="cdn2"> <data key="visits">2</data> </node> <node id="node2"> <data key="visits">2</data> </node> <node id="cdn3"> <data key="visits">2</data> </node> <node id="node3"> <data key="visits">2</data> </node> <edge id="e0" source="cdn1" target="node1" /> <edge id="e1" source="routing_table" target="cdn1" /> <edge id="e2" source="cdn2" target="node2" /> <edge id="e3" source="routing_table" target="cdn2" /> <edge id="e4" source="cdn3" target="node3" /> <edge id="e5" source="routing_table" target="cdn3" /> </graph> </graphml>
To receive the graph as JSON, specify
Accept:application/json
in the request headers.Example output:
{ "edges": [ { "source": "cdn1", "target": "node1" }, { "source": "routing_table", "target": "cdn1" }, { "source": "cdn2", "target": "node2" }, { "source": "routing_table", "target": "cdn2" }, { "source": "cdn3", "target": "node3" }, { "source": "routing_table", "target": "cdn3" } ], "nodes": [ { "id": "routing_table", "visits": "5" }, { "id": "cdn1", "visits": "1" }, { "id": "node1", "visits": "1" }, { "id": "cdn2", "visits": "2" }, { "id": "node2", "visits": "2" }, { "id": "cdn3", "visits": "2" }, { "id": "node3", "visits": "2" } ] }
Resetting Visit Counters
A node visit counter with an id
not matching any node id
of a newly applied
routing table is destroyed.
Reset all counters to zero by momentarily applying a configuration with a
placeholder routing
root node, that has unique id
and an empty members
list, e.g:
"routing": {
"id": "empty_routing_table",
"members": []
}
… and immediately reapply the desired configuration.
7 - Metrics
ESB3024 Router collects a large number of metrics that can give insight into
it’s condition at runtime. Those metrics are available in
Prometheus’ text-based exposition format
at endpoint :5001/m1/v1/metrics
.
Below is the description of these metrics along with their labels.
client_response_status
Number of responses sent back to incoming requests.
- Type: counter
lua_num_errors
Number of errors encountered when evaluating Lua rules.
- Type:
counter
lua_num_evaluators
Number of Lua rules evaluators (active interpreters).
- Type: gauge
lua_time_spent
Time spent by running Lua evaluators, in microseconds.
- Type:
counter
num_configuration_changes
Number of times configuration has been changed since the router has started.
- Type:
counter
num_endpoint_requests
Number of requests redirected per CDN endpoint.
- Type:
counter
- Labels:
endpoint
- CDN endpoint address.selector
- whether the request was counted duringinitial
orinstream
selection.
num_invalid_http_requests
Number of client requests that either use wrong method or wrong URL path. Also number of all requests that cannot be parsed as HTTP.
- Type:
counter
- Labels:
source
- name of internal filter function that classified request as invalid. Probably not of much use outside debugging.type
- whether the request was HTTP (Unencrypted
) or HTTPS (SSL
).
num_log_errors_total
Number of logged errors since the router has started.
- Type:
counter
num_log_warnings_total
Number of logged warnings since the router has started.
- Type:
counter
num_managed_redirects
Number of redirects to the router itself, which allows session management.
- Type:
counter
num_manifests
Number of cached manifests.
- Type:
gauge
- Labels:
count
- state of manifest in cache, can be eitherlru
,evicted
ortotal
.
num_qoe_losses
Number of “lost” QoE decisions per CDN.
- Type:
counter
- Labels:
cdn_id
- ID of CDN that loose QoE battle.cdn_name
- name of CDN that loose QoE battle.selector
- whether the decision was taken duringinitial
orinstream
selection.
num_qoe_wins
Number of “won” QoE decisions per CDN.
- Type:
counter
- Labels:
cdn_id
- ID of CDN that won QoE battle.cdn_name
- name of CDN that won QoE battle.selector
- whether the decision was taken duringinitial
orinstream
selection.
num_rejected_requests
Deprecated, should always be at 0.
- Type:
counter
- Labels:
selector
- whether the request was counted duringinitial
orinstream
selection.
num_requests
Total number of requests received by the router.
- Type:
counter
- Labels:
selector
- whether the request was counted duringinitial
orinstream
selection.
num_sessions
Number of sessions opened on router.
- Type:
gauge
- Labels:
state
- eitheractive
orinactive
.type
- one of:initial
,instream
,qoe_on
,qoe_off
,qoe_agent
orsp_agent
.
num_ssl_errors_total
Number of all errors logged during TLS connections, both incoming and outgoing.
- Type:
counter
num_ssl_warnings_total
Number of all warnings logged during TLS connections, both incoming and outgoing.
- Type:
counter
- Labels:
category
- which kind of TLS connection triggered the warning. Can be one of:cdn
,content
,generic
,repeated_session
or empty.
num_unhandled_requests
Number of requests for which no CDN could be found.
- Type:
counter
- Labels:
selector
- whether the request was counted duringinitial
orinstream
selection.
num_unmanaged_redirects
Number of redirects to “outside” the router - usually to CDN.
- Type:
counter
- Labels:
cdn_id
- ID of CDN picked for redirection.cdn_name
- name of CDN picked for redirection.selector
- whether the redirect was result ofinitial
orinstream
selection.
num_valid_http_requests
Number of received requests that were not deemed invalid, see
num_invalid_http_requests
.
- Type:
counter
- Labels:
source
- name of internal filter function that classified request as invalid. Probably not of much use outside debugging.type
- whether the request was HTTP (Unencrypted
) or HTTPS (SSL
).
orc_latency_bucket
Total number of responses sorted into “latency buckets” - labels denoting latency interval.
- Type:
counter
- Labels:
le
- latency bucket that given response falls into.orc_status_code
- HTTP status code of given response.
orc_latency_count
Total number of responses.
- Type:
counter
- Labels:
tls
- whether the response was sent via SSL/TLS connection or not.orc_status_code
- HTTP status code of given response.
ssl_certificate_days_remaining
Number of days until a SSL certificate expires.
- Type:
gauge
- Labels:
domain
- the common name of the domain that the certificate authenticates.not_valid_after
- the expiry time of the certificate.not_valid_before
- when the certificate starts being valid.usable
- if the certificate is usable to the router, see the ssl_certificate_usable_count metric for an explanation.
ssl_certificate_usable_count
Number of usable SSL certificates. A certificate is usable if it is valid and authenticates a domain name that points to the router.
- Type:
gauge
7.1 - Internal Metrics
A subrunner is an internal module of ESB3024 Router which handles routing requests. The subrunner metrics are technical and mainly of interest for Agile Content. These metrics will be briefly described here.
subrunner_async_queue
Number of queued events per subrunner, roughly corresponding to load.
- Type:
gauge
- Labels:
subrunner_id
- ID of given subrunner.
subrunner_client_conns
Number of currently open client connections per subrunner.
- Type:
gauge
- Labels:
subrunner_id
- ID of given subrunner.
subrunner_high_queue
Number of high priority events queued per subrunner.
- Type:
gauge
- Labels:
subrunner_id
- ID of given subrunner.
subrunner_io_autopause_sockets
Number of sockets that have been automatically paused. This happens when the work manager is under heavy load.
- Type:
counter
- Labels:
subrunner_id
- ID of given subrunner.
subrunner_io_send_data_fast_attempts
A fast data path was added that in many cases increases the performance of the router. This metric was added to verify that the fast data path is taken.
- Type:
counter
- Labels:
subrunner_id
- ID of given subrunner.
subrunner_io_wakeups
The number of times a subrunner has been waken up from sleep.
- Type:
counter
- Labels:
subrunner_id
- ID of given subrunner.
subrunner_low_queue
Number of low priority events queued per subrunner.
- Type:
gauge
- Labels:
subrunner_id
- ID of given subrunner.
subrunner_max_async_queue
Maximum number of events waiting in queue.
- Type:
gauge
- Labels:
subrunner_id
- ID of given subrunner.
subrunner_max_high_queue
Maximum number of events waiting in high priority queue.
- Type:
gauge
- Labels:
subrunner_id
- ID of given subrunner.
subrunner_max_low_queue
Maximum number of events waiting in low priority queue.
- Type:
gauge
- Labels:
subrunner_id
- ID of given subrunner.
subrunner_max_medium_queue
Maximum number of events waiting in medium priority queue.
- Type:
gauge
- Labels:
subrunner_id
- ID of given subrunner.
subrunner_medium_queue
Number of medium priority events queued per subrunner.
- Type:
gauge
- Labels:
subrunner_id
- ID of given subrunner.
subrunner_times_worker_overloaded
Number of times when queued events for given subrunner exceeded
the tuning.overload_threshold
value (defaults to 32).
- Type:
counter
- Labels:
subrunner_id
- ID of given subrunner.
subrunner_total_receive_data_blocks
Number of receive data blocks allocated per subrunner.
- Type:
gauge
- Labels:
subrunner_id
- ID of given subrunner.
subrunner_total_send_data_blocks
Number of send data blocks allocated per subrunner.
- Type:
gauge
- Labels:
subrunner_id
- ID of given subrunner.
subrunner_used_receive_data_blocks
Number of receive data blocks currently in use per subrunner. Same as subrunner_total_receive_data_blocks.
- Type:
gauge
- Labels:
subrunner_id
- ID of given subrunner.
subrunner_used_send_data_blocks
Number of send data blocks currently in use per subrunner. Same as subrunner_total_send_data_blocks.
- Type:
gauge
- Labels:
subrunner_id
- ID of given subrunner.