SLA Anomaly Detection & Correlation
Last updated
Last updated
The Lumeus AI/ML Engine proactively monitors application, device, and infrastructure components to provide meaningful insights into Day-2 network operations and application performance.
Some of the characteristics and functions of the AI/ML engine are:
Agent-less collection of data using APIs from network and security controllers, Application Performance Monitoring (APM) tools, and Netflow/SFlow records
Proactive monitoring using SLA Anomaly Detection + Correlation to identify critical application, device, and infrastructure performance issues
Optimize operations and troubleshooting by correlating logs across vendors and platforms to identify network vs service provider vs application issues
Reduces resolution time using context-based alerting to incident management systems when an issue needs to be escalated. Relevant information is attached to the tickets which reduces management costs
Greater visibility and collaboration across IT and Security teams by addressing gaps across application, platform, and cloud workflows
Lumeus' proprietary Anomaly Detection algorithms help identify key issues in your applications, devices, and infrastructure.
Anomaly detection is done on top of metric data. Below are some of the metrics monitored:
Metric | Parameters | Source |
---|---|---|
The algorithms use unsupervised learning to dynamically baseline normal behavior, meaning there is no need to set static thresholds or baselines.
The thresholds are calculated for each device or group of devices (segments) at a global level and correlated to find any deviations.
It uses historical data available from vendor controllers for the initial training. Therefore, Lumeus can produce meaningful insights in as soon as a few hours.
These are built on top of existing open-source algorithms with customizations and enhancements.
One of the biggest challenges with any AI-based anomaly detection system is Alert Fatigue. At Lumeus, we have made dedicated efforts to reduce the number of alerts generated. An Alert is generated only when an anomaly turns into an Escalation.
The engine uses various rules to classify an anomaly as an escalation.
Long running anomalies: These issues persist for a long duration. The duration can be tuned by the user to escalate by custom criteria.
Periodic short-lived anomalies: These are anomalies which are short-lived but occur frequently during the day. e.g. a fault wiring or cellular network can cause periodic spikes in latency and require users' attention for further evaluation.
Whenever a user edits any of the above parameters, the systems show the expected future Escalations based on the historical data, which helps in fine-tuning the parameters.
Whenever an escalation is detected, the engine automatically creates a ticket in your Incident Management System.
All contextual information is added to the ticket e.g. applications impacted, site(s) affected, service provider, relevant logs, etc.
Whenever anomalies are cleared, the ticket is automatically resolved with appropriate comments.
If the same issue occurs again within the same day, the existing ticket is re-opened instead of creating a new one. This way, the issue history is automatically available on the ticket.
Users can also subscribe to daily/weekly summary email reports for all alerts generated.
Application Performance SLA
Latency, Jitter, Packet loss
SDWAN Gateways, Lumeus Gateways
Network/Cloud WAN Gateways usage
Sent/received rate, Bytes, Drops
CSP Gateway e.g. AWS Transit Gateway, Azure Express Gateway
Port, Interfaces, and Tunnel Metrics
Available Bandwidth, Port Counters
On-Prem firewall, Routers
Routing Protocol Metrics
BGP, OSPF Counters
On-Prem WAN Gateways
Application Metrics
Http5xx errors, Requests/Session received etc
APM tools e.g. Azure App Insights
User-based Metrics
Name, Source IP, Location etc
Lumeus Gateways, Vendor Firewalls