Introduction to Resource Management (1)

"The Romance of the Three Kingdoms" has a very famous story of Zhuge Liang's Northern Expedition-Lost Street Pavilion. There is a wonderful description: When Yi Sima went back to the base, he ordered soldiers to investigate who is the leader of the enemy. Then the reply is “Su Ma, the brother of Liang Ma”. Yi says with laughing: “He is nothing but a mediocrity.  How can Kongming win when using such a general?”. Then he asks, “Are there any army besides the Jieting”. The spy says,  “ Ping Wang deploy his army in about 10 kilometers from there”. Then, Yi order He Zhang take a team to block Ping Wang and also asks Dan Shen and Yi Shen take their own team to surround  the mountain to cut the road of water supply , and then defeat them When Shu’s soldiers break their own orders because of supply. This is the key scheduling strategy. In a few short words, the Secretary-General's operation to prepare the curtain, the deployment of troops will be the process of the description of the most vivid. This is also the ancient text of the early involved in the concept of "dispatching" description. And the next story, presumably more familiar, due to the failure of the military command, leading to Zhuge Liang, Ma Jian, Ma Jian paid the price of life. This is a typical case of a vivid contrast between success and failure in scheduling. Scheduling issues are ubiquitous. Today, I'd like to talk about scheduling issues in cloud computing. The core of cloud computing is to properly allocate, manage, schedule, and orchestrate data center resources and organize tens of millions of basic resources such as servers and networks to maximize the efficiency of these resources and achieve huge leaps in costs and efficiency. Before the development of public cloud, major Internet companies regarded their resource management and scheduling systems as the foundation and core of the entire service. Google has Borg/Omega, Microsoft has Apollo, Alibaba has Feitian/Fuxi/Sigma, and Baidu has Matrix. HUAWEI CLOUD has just released a new-generation operating system, Alkaid, for commercial use in the 5G+cloud+AI era. Alkaid has three of the five key capabilities: all-domain scheduling, dynamic negotiation and governance, and multi-objective optimization, resource management and scheduling technologies and algorithms are closely related. As mentioned in the preceding story, Sima Yi first inquired the general, performed precise profiling, collected deployment information, and made a series of deployments to complete a complete scheduling decision. Resource management and scheduling also need to focus on the entire resource lifecycle, which is similar to a chain, as shown in the following figure. In the future, we will introduce the algorithms for the entire scheduling cycle one by one.

IaaS resource offline scheduling algorithm:

Due to uneven sales of Flavor on the current network and the limitation of scheduling strategies, Huawei Cloud has caused some resource allocation rates to be low. When the first time it could not be placed, the average cluster fragmentation rate was greater than XX%. In addition, due to unpredictable tenant behaviors and dynamic creation and deletion of VMs, a large number of fragments are generated in the resource pool. Currently, defragmentation can only be performed manually in the background, and the migration sequence and migration cost need to be overcome. Based on Pareto optimization and operation research technologies, this algorithm is intended to implement unified resource pool management and dynamic self-adaptation and allocation, break through the low resource allocation rate of the cloud data center, and support the future public cloud service requirements for building the most profitable IaaS and most stable IaaS with X times of annual increment.

Multi-dimensional online scheduling algorithm for cloud resources:

Select the best available server to meet the real-time VM request of the user. When there are a large number of remaining resources, flavors of large specifications may fail to be provisioned. Scheduling scenarios are complex, many factors and many measurable dimensions need to be considered. How to build models for online scheduling and how to optimize and improve the system are the problems to be solved in cooperation projects. This project aims to build an online scheduling simulation platform for algorithm R&D and verification by analyzing HUAWEI CLOUD ECS request data and modeling live network scenarios. Improve the resource allocation rate in different scenarios by using algorithms such as optimization and reinforcement learning.

PaaS resource scheduling algorithm:

The traditional container scheduling solution for big data processing (Batch service) considers only the computing capacity of a single node, and ignores the impact of data transmission between containers and the execution relationship diagram of big data jobs on the job completion time. This project plans to optimize the container scheduling policy based on the data center network topology and big data operation features. The objective is to improve the running performance of tenants' big data jobs by 30%, improve user experience, increase the job throughput, and surpass competitors to build the most profitable container service.

Cloud Capacity Prediction:

Based on historical resource allocation and flavor sales on the live network, the capacity prediction function uses the prediction algorithm to plan the future capacity expansion of the infrastructure in advance. Scientific capacity prediction helps meet user requirements without wasting inventory. This solution integrates advantages of different algorithms such as ARIMA (autoregressive + moving average), Prophet (trend + period + holiday), LGBM (decision tree model), and deep learning LSTM (long- and short-term memory neural network), and designs and implements adaptive prediction architectures for different data forms, it includes data cleaning, noise reduction, pre-training, and evaluation modules, which improve the traditional Gaussian weighted linear algorithm's difficulty in accurately predicting the sharp fluctuation of the live network capacity.

Infrastructure capacity prediction service architecture

Algorithms for scheduling model evolution platform:

The intelligent scheduling algorithm iteration platform can simulate the scheduling and allocation of 1:1 physical host resources and VM allocation requests on the live network at a low cost and high speed when no real host machine is available, and evaluate and automatically optimize the algorithms on the live network and to-be-launched networks from multiple dimensions. Automatically iterates policy parameters based on the current policy, allocated result, score, and theoretical solution, gradually improves resource utilization, selects the optimal policy, releases the policy on the live network, and updates the online/offline scheduling policy in real time. The iterative platform must support quantitative evaluation of various decisions, training, release, and update of various algorithms, scheduling, capacity planning, and procurement selection.

Implementation:

The resource allocation algorithm on the live network is debugged through scheduling simulation. Based on this algorithm, the resource allocation rate increases by 6%.

The scheduling algorithm iteration platform has been used on HUAWEI CLOUD. It supports the playback and dynamic display of VM scheduling data in various granularities such as regions, AZs, and pods in various scenarios on the live network, helping explore and discover unreasonable points in the allocation process, support algorithm optimization and iteration, evaluate the allocation effect of the algorithm, and compare the effect of the multi-dimensional algorithm, support the evaluation of new algorithms.