Intelligent Operation and Maintenance

With the rapid increase of the scale of Huawei Cloud, we observed the fastexpansion of the scale of Region, POD, and servers. With numerous servers going online, a large number of monitoring indicators, alarms and events have brought great challenges to O & M. Take the basic network as an example: tens of thousands of emergency critical alarms, and countless secondary alarms and reminder alarms per month, can no longer meet O & M needs by solely relying on manual work. It is becoming more and more urgent to collect and record various indicators through the network monitoring platform with the aim of improving the efficiency of fault diagnosis and shortening the duration of closed-loop problems.

To this end, the Algorithm Competence Center of the Ministry of Innovation has conducted research in the network domain for operation and maintenance algorithms such as logand PKI abnormal detection, alarm compression, and root cause recommendation. In this page, we will introduce the progress of the research one by one.

1. Alarm compression algorithm

Commonly used abnormal warning algorithms are based on a single curve and simple thresholding rules. This kind of scheme is not only easy to generate a large number of invalid alarms, but also difficult to expand management, and has incomplete coverage of abnormal scenarios and low accuracy. Therefore, it is necessary to consider the shape of the curve, customize a more accurate and robust anomaly detector, and correlate to form a human-oriented event. In order to balance the accuracy, robustness, real-time performance of the alarm and end-to-end manual processing, we cluster the alarm curves through the clustering algorithm to reduce the dimensionality, and achieve accurate detection cooperating with the anomaly detector,  and the global information is used to associate the independent abnormal alarms. This algorithm achieves an effective alarm compression rateof XX%, and reduces the false alarm rate by XX% , which successfully supports the improvement of operation and maintenance efficiency。

2. WAN root cause analysis algorithm

Public network quality detection has to cope with a large amount of data and corresponding alarms. It is time-consuming and labor-intensive to analyze each alarm manually. In addition, due to the particularity of the WAN, the positioning involves the entire cloud path, which has a long link and numerous devices. This requires to simultaneously check more than XX indicators dispersed in multiple systems, and the combination judgment logic of each indicator is complicated, resulting in very complicated root cause positioning. Through the analysis of the original data and alarms, and combining the characteristics of multiple indicators according to the characteristics of the business, we successfully reduced the positioning time by XX times based on the rule tree method, and greatly improved the efficiency of operation and maintenance.

3. Digital twins

With the continuous growth of public cloud services, the scale of infrastructure network of Huawei Cloud is also expanding. However, the data related to the physical network is scattered, fragmented, inconsistent or state-missing, and even contains manual maintained Excel information. No one actually knows the global network configuration information and status information. The network views of different departments or organizations are not even unified. Inconsistent data may cause problems in network planning, network changes, network operation and maintenance, and even network failures.

Digital twin aims to build a dynamic and accurate network digital twin model with the network model as the skeleton and network data as the blood. Digital twins based on physical networks can support network life cycle stages such as network planning, network changes, and network operation and maintenance.

The digital twin model has been applied to the following two scenarios, and the application of network planning scenarios is being implemented:

1) In view of the inaccuracy of some network data, based on the digital twin model, machine learning algorithms can be used to effectively find non-standard architectures and configurations, and abnormal data, and provide recommended repair suggestions, with an accuracy rate of alarm clustering up to XX % .

2) In intelligent O & M scenarios, refined anomaly discovery and root cause location must rely on associated knowledge. Therefore, combining the rich prior knowledge contained in the digital twin model with machine learning algorithms can effectively implement alarm aggregation and root cause recommendation, and the alarm aggregation compression rate can reach XX % .

4. Log analysis

As the expanding of the global scale of Huawei Cloud, the scale of the network data center equipment and the complexity of overall network traffic continue to increase. It is particularly urgent to reduce the operational complexity of O & M personnel and improve the use efficiency of physical equipment through intelligent log analysis and multi-source alarm correlation methods. We use AI data analysis methods to mine typical failure modes and build models from the logs and alarm data of the virtual network in the data center, and build a fault knowledge base to assist in online alarm aggregation and fault location.

1)  Virtual network: An online anomaly detection model based on a half-space forest is proposed, and multiple public data sets are evaluated. Compared with the popular isolation forest algorithm, the calculation accuracy of the new model is increased by an average of XX % ;  

2)   Physical network: A bidirectional LSTM semantic model based on NLP is proposed, andis evaluated on one public data set and two live network data sets. In the next state prediction task, the accuracies of the current model in the public data set, the first version of the live network data and the second version of the current network data are XX % , XX % , and XX % , respectively . It is expected that XX will launch a fault diagnosis platform to achieve quantitative analysis of global faults.

wuliwang.png

Physical network based on NLP semantic anomaly detection scheme