Algorithm_Innovation_Lab_Research_yaoguang--HUAWEI CLOUD

Alkaid

As the seventh star of Beidou, Alkaid can be used since ancient times to judge the change of the four seasons, leading to timing. And as a newly released intelligent cloud operating system, how does the smart cloud brain dominate all kinds of resources on the cloud, and achieve the best match between tenant demand and resource supply? We found the answer with Huawei Cloud Alkaid Lab and Huawei Cloud Algorithm Innovation Lab.

# First Knowledge Resource Scheduling# Cloud OS: I'm too hard.

With virtualization technology, we have been able to make the data center's vast amount of computing and storage resources available as cloud services. As data centers scale and edge computing extend smaller, cloud operating systems that are responsible for efficient and accurate resource scheduling face three major challenges:

The first challenge is the resource consumption/sale model of cloud computing. Cloud computing resource applications usually arrive randomly, bill on demand, run out and release, and cannot be solved according to constant indicators;

The second challenge is brought by the rapid growth of HUAWEI CLOUD. According to Frost & Sullivan's research on China's public cloud market, HUAWEI CLOUD IaaS market share rose to No. 3 in Q3 2019, becoming the fastest-growing top vendor. The high-speed growth makes the distribution of user resource requests change with time, but the traditional research is mostly aimed at the stable request distribution to design the solution;

The third challenge is the architecture of the server itself. Different combinations of different server architectures can lead to differences in performance, so these different architecture designs are like adding a lot of different partitions between boxes in a boxing problem, allowing resources to be placed while also taking into account performance constraints.

# Tribute Classic # Why the Traditional Packing Model Does Not Work?

The problem of packing dates back to the layout of Gauss, which began to study in 1831, and is essentially a desire to pack as much as possible in a box. Cloud VM deployment is a process of allocating VMs with various resource requirements to physical machines, as shown in the following figure. The cloud operating system receives a resource creation request from the virtual machine at all times, and it needs to decide which physical machine to deploy the resource to ensure the lowest fragmentation rate.

Fig.1 Case when the boxing algorithm encounters resource scheduling on the cloud

From the process, it can be found that compared to the classic boxing problem, the cloud resource scheduling has a new constraint:

(1) In a real-time cloud environment, virtual machines are dynamically deployed on physical machines in sequence, and information about creation, deletion, and application of virtual machines and resource requirements is uncertain in advance;

(2) High resource usage of physical machines may cause service load fluctuation. Therefore, we should fully consider the resource and performance constraints of physical machines during resource scheduling and handle performance burst requirements;

(3) Depending on the online/offline attributes of the business, the scheduling process also needs us to consider the "disturbing neighbor" phenomenon that may occur with different virtual machines on the physical machine due to resource preemption and minimize the impact.

# Become a smart cloud brain# Learning and growing path of The Alkaid

Along the lines of the classic boxing problem, experts behind Alkaid tried operational research methods such as FirstFit and BestFit. A physical machine is used as an example. By comparing the matching degree between the requested resource amount and the available resource amount, that is, the cosine included angle value of the vector, the utilization of the available resource amount is determined, as shown in the following figure.

Fig.2 Scheduling method based on cosine angle

From the perspective of the resource pool, when random requests and resource pool scaling occur, the target function changes accordingly. At this time, the Alkaid introduced a strong search ability of reinforcement learning algorithm, through pre-simulation to try various strategies, and repeatedly strengthen the ultimate maximum benefits of the program. Reinforcement learning is based on data. You can play a maze game to understand the idea.

Fig.3 Simulation of optimal scheduling through reinforcement learning algorithm

In the search for bamboo, the pandas are rewarded with different levels of experience: "hit a wall", "pass" or "eat bamboo" at every step of the decision-making process. Through repeated simulation, try to select rewards for different actions (decisions) at different locations (states). In this case, the relationship between the state and rewards is the basis for the resource pool to select a machine to meet the request.

Furthermore, cloud servers of different architectures and QoS requirements between different tenants mean that the environment to which the reinforcement learning algorithm is applied keeps changing, just like the labyrinth in the preceding figure. The historical data for intensive learning and training is not general and confrontational, at which time the light began to complete self-learning and evolution based on historical data to deal with the problem of resource scheduling under the rapid scale development.

Fig.4 Self-learning scheduling strategy tuning based on the optical scheduling algorithm

To verify the feasibility of the solution, the Alkaid Lab conducted a simulation test on the Alkaid resource scheduling algorithm driven by both expert experience and model data based on the random request sequence (randomly generated based on the live network data of HUAWEI CLOUD).

Table 1 Simulation Test Scenario 1

Table 2 Simulation Test Scenario 2

The test result shows that the average fragment rate is improved by 30%, the number of servers is reduced by about 6%, and the resource pool defragmentation triggering period is prolonged by about 50% after the Alkaid resource scheduling algorithm is used.

Algorithm Powers Innovation

Alkaid