SLA Anomaly Detection & Correlation
The Lumeus AI/ML Engine proactively monitors application, device, and infrastructure components to provide meaningful insights into Day-2 network operations and application performance.
Some of the characteristics and functions of the AI/ML engine are:
Agent-less collection of data using APIs from network and security controllers, Application Performance Monitoring (APM) tools, and Netflow/SFlow records
Proactive monitoring using SLA Anomaly Detection + Correlation to identify critical application, device, and infrastructure performance issues
Optimize operations and troubleshooting by correlating logs across vendors and platforms to identify network vs service provider vs application issues
Reduces resolution time using context-based alerting to incident management systems when an issue needs to be escalated. Relevant information is attached to the tickets which reduces management costs
Greater visibility and collaboration across IT and Security teams by addressing gaps across application, platform, and cloud workflows
SLA Anomaly Detection and Correlation
Lumeus' proprietary Anomaly Detection algorithms help identify key issues in your applications, devices, and infrastructure.
Anomaly detection is done on top of metric data. Below are some of the metrics monitored:
Metric | Parameters | Source |
---|---|---|
Application Performance SLA | Latency, Jitter, Packet loss | SDWAN Gateways, Lumeus Gateways |
Network/Cloud WAN Gateways usage | Sent/received rate, Bytes, Drops | CSP Gateway e.g. AWS Transit Gateway, Azure Express Gateway |
Port, Interfaces, and Tunnel Metrics | Available Bandwidth, Port Counters | On-Prem firewall, Routers |
Routing Protocol Metrics | BGP, OSPF Counters | On-Prem WAN Gateways |
Application Metrics | Http5xx errors, Requests/Session received etc | APM tools e.g. Azure App Insights |
User-based Metrics | Name, Source IP, Location etc | Lumeus Gateways, Vendor Firewalls |
The algorithms use unsupervised learning to dynamically baseline normal behavior, meaning there is no need to set static thresholds or baselines.
The thresholds are calculated for each device or group of devices (segments) at a global level and correlated to find any deviations.
It uses historical data available from vendor controllers for the initial training. Therefore, Lumeus can produce meaningful insights in as soon as a few hours.
These are built on top of existing open-source algorithms with customizations and enhancements.
Escalations
One of the biggest challenges with any AI-based anomaly detection system is Alert Fatigue. At Lumeus, we have made dedicated efforts to reduce the number of alerts generated. An Alert is generated only when an anomaly turns into an Escalation.
The engine uses various rules to classify an anomaly as an escalation.
Long running anomalies: These issues persist for a long duration. The duration can be tuned by the user to escalate by custom criteria.
Periodic short-lived anomalies: These are anomalies which are short-lived but occur frequently during the day. e.g. a fault wiring or cellular network can cause periodic spikes in latency and require users' attention for further evaluation.
Whenever a user edits any of the above parameters, the systems show the expected future Escalations based on the historical data, which helps in fine-tuning the parameters.
Alerts
Whenever an escalation is detected, the engine automatically creates a ticket in your Incident Management System.
All contextual information is added to the ticket e.g. applications impacted, site(s) affected, service provider, relevant logs, etc.
Whenever anomalies are cleared, the ticket is automatically resolved with appropriate comments.
If the same issue occurs again within the same day, the existing ticket is re-opened instead of creating a new one. This way, the issue history is automatically available on the ticket.
Users can also subscribe to daily/weekly summary email reports for all alerts generated.
Last updated