SLA Anomaly Detection & Correlation

The Lumeus AI/ML Engine proactively monitors application, device, and infrastructure components to provide meaningful insights into Day-2 network operations and application performance.

Some of the characteristics and functions of the AI/ML engine are:

Agent-less collection of data using APIs from network and security controllers, Application Performance Monitoring (APM) tools, and Netflow/SFlow records
Proactive monitoring using SLA Anomaly Detection + Correlation to identify critical application, device, and infrastructure performance issues
Optimize operations and troubleshooting by correlating logs across vendors and platforms to identify network vs service provider vs application issues
Reduces resolution time using context-based alerting to incident management systems when an issue needs to be escalated. Relevant information is attached to the tickets which reduces management costs
Greater visibility and collaboration across IT and Security teams by addressing gaps across application, platform, and cloud workflows

SLA Anomaly Detection and Correlation

Lumeus' proprietary Anomaly Detection algorithms help identify key issues in your applications, devices, and infrastructure.

Anomaly detection is done on top of metric data. Below are some of the metrics monitored:

Metric

Parameters

Source

Application Performance SLA

Latency, Jitter, Packet loss

SDWAN Gateways, Lumeus Gateways

Network/Cloud WAN Gateways usage

Sent/received rate, Bytes, Drops

CSP Gateway e.g. AWS Transit Gateway, Azure Express Gateway

Port, Interfaces, and Tunnel Metrics

Available Bandwidth, Port Counters

On-Prem firewall, Routers

Routing Protocol Metrics

BGP, OSPF Counters

On-Prem WAN Gateways

Application Metrics

Http5xx errors, Requests/Session received etc

APM tools e.g. Azure App Insights

User-based Metrics

Name, Source IP, Location etc

Lumeus Gateways, Vendor Firewalls

The algorithms use unsupervised learning to dynamically baseline normal behavior, meaning there is no need to set static thresholds or baselines.
The thresholds are calculated for each device or group of devices (segments) at a global level and correlated to find any deviations.
It uses historical data available from vendor controllers for the initial training. Therefore, Lumeus can produce meaningful insights in as soon as a few hours.
These are built on top of existing open-source algorithms with customizations and enhancements.

Escalations

One of the biggest challenges with any AI-based anomaly detection system is Alert Fatigue. At Lumeus, we have made dedicated efforts to reduce the number of alerts generated. An Alert is generated only when an anomaly turns into an Escalation.

The engine uses various rules to classify an anomaly as an escalation.
- Long running anomalies: These issues persist for a long duration. The duration can be tuned by the user to escalate by custom criteria.
- Periodic short-lived anomalies: These are anomalies which are short-lived but occur frequently during the day. e.g. a fault wiring or cellular network can cause periodic spikes in latency and require users' attention for further evaluation.
Whenever a user edits any of the above parameters, the systems show the expected future Escalations based on the historical data, which helps in fine-tuning the parameters.

Alerts

Whenever an escalation is detected, the engine automatically creates a ticket in your Incident Management System.

All contextual information is added to the ticket e.g. applications impacted, site(s) affected, service provider, relevant logs, etc.
Whenever anomalies are cleared, the ticket is automatically resolved with appropriate comments.
If the same issue occurs again within the same day, the existing ticket is re-opened instead of creating a new one. This way, the issue history is automatically available on the ticket.
Users can also subscribe to daily/weekly summary email reports for all alerts generated.

PreviousArchitecture NextAdaptive Log Management

Last updated 1 year ago