1 - System troubleshooting

Using ew-sysinfo to monitor and troubleshoot ESB3024

ESB3024 contains the tool ew-sysinfo that gives an overview of how the system is doing. Simply use the command and the tool will output information about the system and the installed ESB3024 services.

The output format can be changed using the --format flag, possible values are human (default) and json, e.g.:

$ ew-sysinfo
system:
   os: ['5.4.17-2136.321.4.el8uek.x86_64', 'Oracle Linux Server 8.8']
   cpu_cores: 2
   cpu_load_average: [0.03, 0.03, 0.0]
   memory_usage: 478 MB
   memory_load_average: [0.03, 0.03, 0.0]
   boot_time: 2023-09-08T08:30:57Z
   uptime: 6 days, 3:43:44.640665
   processes: 122
   open_sockets:
      ipv4: 12
      ipv6: 18
      ip_total: 30
      tcp_over_ipv4: 9
      tcp_over_ipv6: 16
      tcp_total: 25
      udp_over_ipv4: 3
      udp_over_ipv6: 2
      udp_total: 5
      total: 145
system_disk (/):
   total: 33271 MB
   used: 7978 MB (24.00%)
   free: 25293 MB
journal_disk (/run/log/journal):
   total: 1954 MB
   used: 217 MB (11.10%)
   free: 1736 MB
vulnerabilities:
   meltdown: Mitigation: PTI
   spectre_v1: Mitigation: usercopy/swapgs barriers and __user pointer sanitization
   spectre_v2: Mitigation: Retpolines, STIBP: disabled, RSB filling, PBRSB-eIBRS: Not affected
processes:
   orc-re:
      pid: 177199
      status: sleeping
      cpu_usage_percent: 1.0%
      cpu_load_average: 131.11%
      memory_usage: 14 MB (0.38%)
      num_threads: 10
hints:
   get_raw_router_config: cat /opt/edgeware/acd/router/cache/config.json
   get_confd_config: cat /opt/edgeware/acd/confd/store/__active
   get_router_logs: journalctl -u acd-router
   get_edns_proxy_logs: journalctl -u acd-edns-proxy
   check_firewall_status: systemctl status firewalld
   check_firewall_config: iptables -nvL
# For --format=json, it's recommended to pipe the output to a JSON interpreter
# such as jq

$ ew-sysinfo --format=json | jq
{
  "system": {
    "os": [
      "5.4.17-2136.321.4.el8uek.x86_64",
      "Oracle Linux Server 8.8"
    ],
    "cpu_cores": 2,
    "cpu_load_average": [
      0.01,
      0.0,
      0.0
    ],
    "memory_usage": "479 MB",
    "memory_load_average": [
      0.01,
      0.0,
      0.0
    ],
    "boot_time": "2023-09-08 08:30:57",
    "uptime": "6 days, 5:12:24.617114",
    "processes": 123,
    "open_sockets": {
      "ipv4": 13,
      "ipv6": 18,
      "ip_total": 31,
      "tcp_over_ipv4": 10,
      "tcp_over_ipv6": 16,
      "tcp_total": 26,
      "udp_over_ipv4": 3,
      "udp_over_ipv6": 2,
      "udp_total": 5,
      "total": 146
    }
  },
  "system_disk (/)": {
    "total": "33271 MB",
    "used": "7977 MB (24.00%)",
    "free": "25293 MB"
  },
  "journal_disk (/run/log/journal)": {
    "total": "1954 MB",
    "used": "225 MB (11.50%)",
    "free": "1728 MB"
  },
  "vulnerabilities": {
    "meltdown": "Mitigation: PTI",
    "spectre_v1": "Mitigation: usercopy/swapgs barriers and __user pointer sanitization",
    "spectre_v2": "Mitigation: Retpolines, STIBP: disabled, RSB filling, PBRSB-eIBRS: Not affected"
  },
  "processes": {
    "orc-re": {
      "pid": 177199,
      "status": "sleeping",
      "cpu_usage_percent": "0.0%",
      "cpu_load_average": "137.63%",
      "memory_usage": "14 MB (0.38%)",
      "num_threads": 10
    }
  }
}

Note that your system might have different monitored processes and field names.

The field hints is different from the rest. It lists common commands that can be used to further monitor system performance, useful for quickly troubleshooting a faulty system.

2 - Scraping data with Prometheus

Prometheus is a third-party data scraper which is installed as a containerized service in the default installation of ESB3024 Router. It periodically reads metrics data from different services, such as acd-router, aggregates it and makes it available to other services that visualize the data. Those services include Grafana and Alertmanager.

The Prometheus configuration file can be found on the host at /opt/edgeware/acd/prometheus/prometheus.yaml.

Accessing Prometheus

Prometheus has a web interface that is listening for HTTP connections on port 9090. There is no authentication, so anyone who has access to the host that is running Prometheus can access the interface.

Starting / Stopping Prometheus

After the service is configured, it can be managed via systemd, under the service unit acd-prometheus.

systemctl start acd-prometheus

Logging

The container logs are automatically published to the system journal, under the same unit descriptor, and can be viewed using journalctl

journalctl -u acd-prometheus

3 - Visualizing data with Grafana

3.1 - Managing Grafana

Grafana displays graphs based on data from Prometheus. A default deployment of Grafana is running in a container alongside ESB3024 Router.

Grafana’s configuration and runtime files are stored under /opt/edgeware/acd/grafana. It comes with default dashboards that are documented at Grafana dashboards.

Accessing Grafana

Grafana’s web interface is listening for HTTP connections on port 3000. It has two default accounts, edgeware and admin.

The edgeware account can only view graphs, while the admin account can also edit graphs. The accounts with default passwords are shown in the table below.

AccountDefault password
edgewareedgeware
adminedgeware

Starting / Stopping Grafana

Grafana can be managed via systemd, under the service unit acd-grafana.

systemctl start acd-grafana

Logging

The container logs are automatically published to the system journal, under the same unit descriptor, and can be viewed using journalctl

journalctl -u acd-grafana

3.2 - Grafana Dashboards

Dashboards in default Grafana installation

Grafana will be populated with pre-configured graphs which present some metrics on a time scale. Below is a comprehensive list of those dashboards, along with short descriptions.

Router Monitoring dashboard

This dashboard is by default set as home directory - it’s what user will see after logging in.

Number Of Initial Routing Decisions

HTTP Status Codes

Total number of responses sent back to incoming requests, shown by their status codes. Metric: client-response-status

Incoming HTTP and HTTPS Requests

Total number of incoming requests that were deemed valid, divided into SSL and Unencrypted categories. Metric: num_valid_http_requests

Debugging Information dashboard

Number of Lua Exceptions

Number of exceptions encountered so far while evaluating Lua rules. Metric: lua_num_errors

Number of Lua Contexts

Number of active Lua interpreters, both running and idle. Metric: lua_num_evaluators

Time Spent In Lua

Number of microseconds the Lua interpreters were running. Metric: lua_time_spent

Router Latencies

Histogram-like graph showing how many responses were sent within the given latency interval. Metric: orc_latency_bucket

Internal debugging

A folder that contains dashboards intended for internal use.

ACD: Incoming Internet Connections dashboard

SSL Warnings

Rate of warnings logged during TLS connections Metric: num_ssl_warnings_total

SSL Errors

Rate of errors logged during TLS connections Metric: num_ssl_errors_total

Valid Internet HTTPS Requests

Rate of incoming requests that were deemed valid, HTTPS only. Metric: num_valid_http_requests

Invalid Internet HTTPS Requests

Rate of incoming requests that were deemed invalid, HTTPS only. Metric: num_invalid_http_requests

Valid Internet HTTP Requests

Rate of incoming requests that were deemed valid, HTTP only. Metric: num_valid_http_requests

Invalid Internet HTTP Requests

Rate of incoming requests that were deemed invalid, HTTP only. Metric: num_invalid_http_requests

Prometheus: ACD dashboard

Logged Warnings

Rate of logged warnings since the router has started, divided into CDN-related and CDN-unrelated. Metric: num_log_warnings_total

Logged Errors

Rate of logged errors since the router has started. Metric: num_log_errors_total

HTTP Requests

Rate of responses sent to incoming connections. Metric: orc_latency_count

Number Of Active Sessions

Number of sessions opened on router that are still active. Metric: num_sessions

Total Number Of Sessions

Total number of sessions opened on router. Metric: num_sessions

Session Type Counts (Non-Stacked)

Number of active sessions divided by type; see metric documentation linked below for up-to-date list of types. Metric: num_sessions

Prometheus/ACD: Subrunners

Client Connections

Number of currently open client connections per subrunner. Metric: subrunner_client_conns

Asynchronous Queues (Current)

Number of queued events per subrunner, roughly corresponding to load. Metric: subrunner_async_queue

Used <Send/receive> Data Blocks

Number of send or receive data blocks currently in use per subrunner, as decided by the “Send/receive” drop down box. Metric: subrunner_used_send_data_blocks and subrunner_used_receive_data_blocks

Asynchronous Queues (Max)

Maximum number of events waiting in queue. Metric: subrunner_max_async_queue

Total <Send/receive> Data Blocks

Number of send or receive data blocks allocated per subrunner, as decided by the “Send/receive” drop down box. Metric: subrunner_total_send_data_blocks and subrunner_total_receive_data_blocks

Low Queue (Current)

Number of low priority events queued per subrunner. Metric: subrunner_low_queue

Medium Queue (Current)

Number of medium priority events queued per subrunner. Metric: subrunner_medium_queue

High Queue (Current)

Number of high priority events queued per subrunner. Metric: subrunner_high_queue

Low Queue (Max)

Maximum number of events waiting in low priority queue. Metric: subrunner_max_low_queue

Medium Queue (Max)

Maximum number of events waiting in medium priority queue. Metric: subrunner_max_medium_queue

High Queue (Max)

Maximum number of events waiting in high priority queue. Metric: subrunner_max_high_queue

Wakeups

The number of times a subrunner has been waken up from sleep. Metric: subrunner_io_wakeups

Overloaded

The number of times the number of queued events for a subrunner exceeded its maximum. Metric: subrunner_times_worker_overloaded

Autopause

Number of sockets that have been automatically paused. This happens when the work manager is under heavy load. Metric: subrunner_io_autopause_sockets

4 - Alarms and Alerting

Configuring alarms and alerting

Alerts are generated by the third-party service Prometheus, which sends them to the Alertmanager service. A default containerized instance of Alertmanager is deployed alongside ESB3024 Router. Out of the box, Alertmanager ships with only a sample configuration file, and will require manual configuration prior to enabling the alerting functionality. Due to the many different possible configurations for how alerts are both detected and where they are pushed, the official Alertmanager documentation should be followed for how to configure the service.

The router ships with Alertmanager 0.25, the documentation for which can be found at prometheus.io. The Alertmanager configuration file can be found on the host at /opt/edgeware/acd/alertmanager/alertmanager.yml.

Accessing Alertmanager

Alertmanager has a web interface that is listening for HTTP connections on port 9093. There is no authentication, so anyone who has access to the host that is running Alertmanager can access the interface.

Starting / Stopping Alertmanager

After the service is configured, it can be managed via systemd, under the service unit acd-alertmanager.

systemctl start acd-alertmanager

Logging

The container logs are automatically published to the system journal, under the same unit descriptor, and can be viewed using journalctl

journalctl -u acd-alertmanager

5 - Monitoring multiple routers

By default an instance of Prometheus only monitors the ESB3024 Router that is installed on the same host as where Prometheus is installed. It is possible to make it monitor other router instances and visualize all instances on one Grafana instance.

Configuring of Prometheus

This is configured in the scraping configuration of Prometheus, which is found in the file /opt/edgeware/acd/prometheus/prometheus.yaml, which typically looks like this:

global:
  scrape_interval:     15s

rule_files:
  - recording-rules.yaml

# A scrape configuration for router metrics
scrape_configs:
  - job_name: 'router-scraper'
    scheme: https
    tls_config:
      insecure_skip_verify: true
    static_configs:
    - targets:
      - acd-router-1:5001
    metrics_path: /m1/v1/metrics
    honor_timestamps: true
  - job_name: 'edns-proxy-scraper'
    scheme: http
    static_configs:
    - targets:
      - acd-router-1:8888
    metrics_path: /metrics
    honor_timestamps: true

More routers can be added to the scrape configuration by simply adding more routers under targets in the scraper jobs.

For instance, to monitor acd-router-2 and acd-router-3 along acd-router-1, the configuration file needs to be modified like this:

global:
  scrape_interval:     15s

rule_files:
  - recording-rules.yaml

# A scrape configuration for router metrics
scrape_configs:
  - job_name: 'router-scraper'
    scheme: https
    tls_config:
      insecure_skip_verify: true
    static_configs:
    - targets:
      - acd-router-1:5001
      - acd-router-2:5001
      - acd-router-3:5001
    metrics_path: /m1/v1/metrics
    honor_timestamps: true
  - job_name: 'edns-proxy-scraper'
    scheme: http
    static_configs:
    - targets:
      - acd-router-1:8888
      - acd-router-2:8888
      - acd-router-3:8888
    metrics_path: /metrics
    honor_timestamps: true

After the file has been modified, Prometheus needs to be restarted by typing

systemctl restart acd-prometheus

It is possible to use the same configuration on multiple routers, so that all routers in a deployment can monitor each other.

Selecting router in Grafana

In the top left corner the Grafana dashboards have a drop-down menu labeled “ACD Router”, which allows to choose which router to monitor.

6 - Routing Rule Evaluation Metrics

Node Visit counters

ESB3024 Router counts the number of times a node and any of its children is selected in the routing table.

The visit counters can be retrieved with the following end points:

/v1/node_visits

  • Returns visit counters for each node as a flat list of host:counter pairs in JSON.

  • Example output:

    {
      "node1": "1",
      "node2": "1",
      "node3": "1",
      "top": "3"
    }
    

/v1/node_visits_graph

  • Returns a full graph of nodes with their respective visit counters in GraphML.

  • Example output:

    <?xml version="1.0"?>
    <graphml xmlns="http://graphml.graphdrawing.org/xmlns"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
    http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
      <key id="visits" for="node" attr.name="visits" attr.type="string" />
      <graph id="G" edgedefault="directed">
        <node id="routing_table">
          <data key="visits">5</data>
        </node>
        <node id="cdn1">
          <data key="visits">1</data>
        </node>
        <node id="node1">
          <data key="visits">1</data>
        </node>
        <node id="cdn2">
          <data key="visits">2</data>
        </node>
        <node id="node2">
          <data key="visits">2</data>
        </node>
        <node id="cdn3">
          <data key="visits">2</data>
        </node>
        <node id="node3">
          <data key="visits">2</data>
        </node>
        <edge id="e0" source="cdn1" target="node1" />
        <edge id="e1" source="routing_table" target="cdn1" />
        <edge id="e2" source="cdn2" target="node2" />
        <edge id="e3" source="routing_table" target="cdn2" />
        <edge id="e4" source="cdn3" target="node3" />
        <edge id="e5" source="routing_table" target="cdn3" />
      </graph>
    </graphml>
    
  • To receive the graph as JSON, specify Accept:application/json in the request headers.

  • Example output:

    {
      "edges": [
        {
          "source": "cdn1",
          "target": "node1"
        },
        {
          "source": "routing_table",
          "target": "cdn1"
        },
        {
          "source": "cdn2",
          "target": "node2"
        },
        {
          "source": "routing_table",
          "target": "cdn2"
        },
        {
          "source": "cdn3",
          "target": "node3"
        },
        {
          "source": "routing_table",
          "target": "cdn3"
        }
      ],
      "nodes": [
        {
          "id": "routing_table",
          "visits": "5"
        },
        {
          "id": "cdn1",
          "visits": "1"
        },
        {
          "id": "node1",
          "visits": "1"
        },
        {
          "id": "cdn2",
          "visits": "2"
        },
        {
          "id": "node2",
          "visits": "2"
        },
        {
          "id": "cdn3",
          "visits": "2"
        },
        {
          "id": "node3",
          "visits": "2"
        }
      ]
    }
    

Resetting Visit Counters

A node visit counter with an id not matching any node id of a newly applied routing table is destroyed.

Reset all counters to zero by momentarily applying a configuration with a placeholder routing root node, that has unique id and an empty members list, e.g:

"routing": {
  "id": "empty_routing_table",
  "members": []
}

… and immediately reapply the desired configuration.

7 - Metrics

Metrics endpoint

ESB3024 Router collects a large number of metrics that can give insight into it’s condition at runtime. Those metrics are available in Prometheustext-based exposition format at endpoint :5001/m1/v1/metrics.

Below is the description of these metrics along with their labels.

client_response_status

Number of responses sent back to incoming requests.

lua_num_errors

Number of errors encountered when evaluating Lua rules.

  • Type: counter

lua_num_evaluators

Number of Lua rules evaluators (active interpreters).

lua_time_spent

Time spent by running Lua evaluators, in microseconds.

  • Type: counter

num_configuration_changes

Number of times configuration has been changed since the router has started.

  • Type: counter

num_endpoint_requests

Number of requests redirected per CDN endpoint.

  • Type: counter
  • Labels:
    • endpoint - CDN endpoint address.
    • selector - whether the request was counted during initial or instream selection.

num_invalid_http_requests

Number of client requests that either use wrong method or wrong URL path. Also number of all requests that cannot be parsed as HTTP.

  • Type: counter
  • Labels:
    • source - name of internal filter function that classified request as invalid. Probably not of much use outside debugging.
    • type - whether the request was HTTP (Unencrypted) or HTTPS (SSL).

num_log_errors_total

Number of logged errors since the router has started.

  • Type: counter

num_log_warnings_total

Number of logged warnings since the router has started.

  • Type: counter

num_managed_redirects

Number of redirects to the router itself, which allows session management.

  • Type: counter

num_manifests

Number of cached manifests.

  • Type: gauge
  • Labels:
    • count - state of manifest in cache, can be either lru, evicted or total.

num_qoe_losses

Number of “lost” QoE decisions per CDN.

  • Type: counter
  • Labels:
    • cdn_id - ID of CDN that loose QoE battle.
    • cdn_name - name of CDN that loose QoE battle.
    • selector - whether the decision was taken during initial or instream selection.

num_qoe_wins

Number of “won” QoE decisions per CDN.

  • Type: counter
  • Labels:
    • cdn_id - ID of CDN that won QoE battle.
    • cdn_name - name of CDN that won QoE battle.
    • selector - whether the decision was taken during initial or instream selection.

num_rejected_requests

Deprecated, should always be at 0.

  • Type: counter
  • Labels:
    • selector - whether the request was counted during initial or instream selection.

num_requests

Total number of requests received by the router.

  • Type: counter
  • Labels:
    • selector - whether the request was counted during initial or instream selection.

num_sessions

Number of sessions opened on router.

  • Type: gauge
  • Labels:
    • state - either active or inactive.
    • type - one of: initial, instream, qoe_on, qoe_off, qoe_agent or sp_agent.

num_ssl_errors_total

Number of all errors logged during TLS connections, both incoming and outgoing.

  • Type: counter

num_ssl_warnings_total

Number of all warnings logged during TLS connections, both incoming and outgoing.

  • Type: counter
  • Labels:
    • category - which kind of TLS connection triggered the warning. Can be one of: cdn, content, generic, repeated_session or empty.

num_unhandled_requests

Number of requests for which no CDN could be found.

  • Type: counter
  • Labels:
    • selector - whether the request was counted during initial or instream selection.

num_unmanaged_redirects

Number of redirects to “outside” the router - usually to CDN.

  • Type: counter
  • Labels:
    • cdn_id - ID of CDN picked for redirection.
    • cdn_name - name of CDN picked for redirection.
    • selector - whether the redirect was result of initial or instream selection.

num_valid_http_requests

Number of received requests that were not deemed invalid, see num_invalid_http_requests.

  • Type: counter
  • Labels:
    • source - name of internal filter function that classified request as invalid. Probably not of much use outside debugging.
    • type - whether the request was HTTP (Unencrypted) or HTTPS (SSL).

orc_latency_bucket

Total number of responses sorted into “latency buckets” - labels denoting latency interval.

  • Type: counter
  • Labels:
    • le - latency bucket that given response falls into.
    • orc_status_code - HTTP status code of given response.

orc_latency_count

Total number of responses.

  • Type: counter
  • Labels:
    • tls - whether the response was sent via SSL/TLS connection or not.
    • orc_status_code - HTTP status code of given response.

ssl_certificate_days_remaining

Number of days until a SSL certificate expires.

  • Type: gauge
  • Labels:
    • domain - the common name of the domain that the certificate authenticates.
    • not_valid_after - the expiry time of the certificate.
    • not_valid_before - when the certificate starts being valid.
    • usable - if the certificate is usable to the router, see the ssl_certificate_usable_count metric for an explanation.

ssl_certificate_usable_count

Number of usable SSL certificates. A certificate is usable if it is valid and authenticates a domain name that points to the router.

  • Type: gauge

7.1 - Internal Metrics

Internal Metrics

A subrunner is an internal module of ESB3024 Router which handles routing requests. The subrunner metrics are technical and mainly of interest for Agile Content. These metrics will be briefly described here.

subrunner_async_queue

Number of queued events per subrunner, roughly corresponding to load.

  • Type: gauge
  • Labels:
    • subrunner_id - ID of given subrunner.

subrunner_client_conns

Number of currently open client connections per subrunner.

  • Type: gauge
  • Labels:
    • subrunner_id - ID of given subrunner.

subrunner_high_queue

Number of high priority events queued per subrunner.

  • Type: gauge
  • Labels:
    • subrunner_id - ID of given subrunner.

subrunner_io_autopause_sockets

Number of sockets that have been automatically paused. This happens when the work manager is under heavy load.

  • Type: counter
  • Labels:
    • subrunner_id - ID of given subrunner.

subrunner_io_send_data_fast_attempts

A fast data path was added that in many cases increases the performance of the router. This metric was added to verify that the fast data path is taken.

  • Type: counter
  • Labels:
    • subrunner_id - ID of given subrunner.

subrunner_io_wakeups

The number of times a subrunner has been waken up from sleep.

  • Type: counter
  • Labels:
    • subrunner_id - ID of given subrunner.

subrunner_low_queue

Number of low priority events queued per subrunner.

  • Type: gauge
  • Labels:
    • subrunner_id - ID of given subrunner.

subrunner_max_async_queue

Maximum number of events waiting in queue.

  • Type: gauge
  • Labels:
    • subrunner_id - ID of given subrunner.

subrunner_max_high_queue

Maximum number of events waiting in high priority queue.

  • Type: gauge
  • Labels:
    • subrunner_id - ID of given subrunner.

subrunner_max_low_queue

Maximum number of events waiting in low priority queue.

  • Type: gauge
  • Labels:
    • subrunner_id - ID of given subrunner.

subrunner_max_medium_queue

Maximum number of events waiting in medium priority queue.

  • Type: gauge
  • Labels:
    • subrunner_id - ID of given subrunner.

subrunner_medium_queue

Number of medium priority events queued per subrunner.

  • Type: gauge
  • Labels:
    • subrunner_id - ID of given subrunner.

subrunner_times_worker_overloaded

Number of times when queued events for given subrunner exceeded the tuning.overload_threshold value (defaults to 32).

  • Type: counter
  • Labels:
    • subrunner_id - ID of given subrunner.

subrunner_total_receive_data_blocks

Number of receive data blocks allocated per subrunner.

  • Type: gauge
  • Labels:
    • subrunner_id - ID of given subrunner.

subrunner_total_send_data_blocks

Number of send data blocks allocated per subrunner.

  • Type: gauge
  • Labels:
    • subrunner_id - ID of given subrunner.

subrunner_used_receive_data_blocks

Number of receive data blocks currently in use per subrunner. Same as subrunner_total_receive_data_blocks.

  • Type: gauge
  • Labels:
    • subrunner_id - ID of given subrunner.

subrunner_used_send_data_blocks

Number of send data blocks currently in use per subrunner. Same as subrunner_total_send_data_blocks.

  • Type: gauge
  • Labels:
    • subrunner_id - ID of given subrunner.