检测到您已登录华为云国际站账号,为了您更好的体验,建议您访问国际站服务网站 https://www.huaweicloud.com/intl/zh-cn
不再显示此消息
version v1.23.8-r0, v1.25.3-r0, or later OS Huawei Cloud EulerOS 2.0 GPU type Tesla T4 and Tesla V100 Driver version 535.216.03, 470.57.02, 510.47.03, and 535.54.03 Runtime containerd Add-on The following add-ons must be installed in the cluster: Volcano Scheduler: 1.10.5 or later CCE AI
Enabling AI Performance-based Scheduling In AI and big data collaborative scheduling scenarios, Volcano Dominant Resource Fairness (DRF) and group can be used to improve training performance and resource utilization.
CCE AI Suite (NVIDIA GPU) (v2.1.8, v2.7.5 or later), Volcano Scheduler (v1.10.5 or later), and CCE Cluster Autoscaler (v1.27.150, v1.28.78, v1.29.41, or later) have been installed in the cluster.
Table 1 lists CCE AI Suite (NVIDIA GPU) exception events and isolation results.
Possible Cause For even scheduling on virtual GPUs, the cluster version must be compatible with the CCE AI Suite (NVIDIA GPU) add-on version.
When the CCE AI Suite (Ascend NPU) add-on reported information, only the chip logic IDs were updated, while the mapping between the chip logic IDs and NPU IDs remained unchanged.
It provides end users with computing frameworks from multiple domains such as AI, big data, gene, and rendering. It also offers job scheduling, job management, and queue management for computing applications. Kubernetes typically uses its default scheduler to schedule workloads.
The CCE AI Suite (Ascend NPU) add-on of v2.1.15 or later has been installed in the cluster. For details, see CCE AI Suite (Ascend NPU). An NPU driver has been installed on the NPU nodes, and the driver version is 23.0.1 or later. Uninstall the original NPU driver.
Offline jobs: Such jobs run for a short time, have high computing requirements, and can tolerate high latency, such as AI and big data services.
Notes and Constraints To support Kubernetes' default GPU scheduling on GPU nodes, the CCE AI Suite (NVIDIA GPU) add-on must be of v2.0.10 or later, and the Volcano Scheduler add-on must be of v1.10.5 or later. Example of Shared GPU Scheduling Use kubectl to access the cluster.
The CCE AI Suite (NVIDIA GPU) add-on has been installed in the cluster, and the add-on version is 2.0.10 or later. At least one NVIDIA GPU node is available in the cluster.
Therefore, the VPC network model applies to scenarios that have high requirements on performance, such as AI computing and big data computing.
(Optional) GPU Quota Configurable only when the cluster contains GPU nodes and the CCE AI Suite (NVIDIA GPU) add-on has been installed. Do not use: No GPU will be used. GPU card: The GPU is dedicated for the container.
(Optional) GPU Quota Configurable only when the cluster contains GPU nodes and the CCE AI Suite (NVIDIA GPU) add-on has been installed. Do not use: No GPU will be used. GPU card: The GPU is dedicated for the container.
(Optional) GPU Quota Configurable only when the cluster contains GPU nodes and the CCE AI Suite (NVIDIA GPU) add-on has been installed. Do not use: No GPU will be used. GPU card: The GPU is dedicated for the container.
(Optional) GPU Quota Configurable only when the cluster contains GPU nodes and the CCE AI Suite (NVIDIA GPU) add-on has been installed. Do not use: No GPU will be used. GPU card: The GPU is dedicated for the container.
(Optional) GPU Quota Configurable only when the cluster contains GPU nodes and the CCE AI Suite (NVIDIA GPU) add-on has been installed. Do not use: No GPU will be used. GPU card: The GPU is dedicated for the container.
NPU) CCE AI Suite (NVIDIA GPU) Cloud Native Cluster Monitoring Cloud Native Log Collection Grafana Cloud native cluster monitoring Monitoring Center: The Cloud Native Cluster Monitoring of 3.12.0 or later must be installed in the cluster.
Therefore, the VPC network model applies to scenarios that have high requirements on performance, such as AI computing and big data computing.
AI computing is 3 to 5 times better with NUMA BMSs and high-speed InfiniBand network cards. Highly Available and Secure HA: CCE supports three control plane nodes on the cluster management plane. These nodes run in different regions to ensure cluster HA.