Solution Overview
This solution helps you quickly set up a scalable HPC environment on Huawei Cloud based on the open-source software Slurm and Huawei's open-source Gearbox. Slurm is configured to run in "configless" mode for cloud servers functioning as compute nodes. The Gearbox program interconnects with Huawei Cloud Auto Scaling and Cloud Eye to monitor the job status of a Slurm cluster and automatically scale in or out cloud servers in the Slurm cluster in real time. In addition, new cloud servers are automatically registered with and added to the cluster, or cloud servers are automatically deregistered from the cluster and then destroyed.
Solution Architecture
This solution helps you quickly set up a scalable HPC environment on Huawei Cloud.

Deploying a Scalable HPC Cluster with Slurm
Version: 2.0.0
Last Updated: April 2024
Built By: Huawei Cloud
Time Required for Deployment: About 40 minutes
Time Required for Uninstallation: About 10 minutes
Solution Description
Solution Description
-
Create two Linux Elastic Cloud Servers (ECSs), install the open-source software Slurm, install the Gearbox program on the scheduling node, and configure the Java environment.
-
Create one Elastic IP (EIP) for internal and external communication.
-
Create security groups and configure rules to control access to ECSs so as to secure the ECS environment.
-
Use Image Management Service (IMS) to prepare the initialization environment for compute node servers during auto scaling.
-
Use Auto Scaling (AS) to create and configure an auto scaling group as well as define scaling policies to automatically scale in or out cluster resources.
-
Use Cloud Eye (CES) for resource monitoring. The Gearbox program monitors the job status, calculates the workload value of custom metrics, and reports the metrics to Cloud Eye.
-
Use Scalable File Service (SFS) to mount SFS file systems to the ECSs to provide shared file storage for clusters.