检测到您已登录华为云国际站账号,为了您更好的体验,建议您访问国际站服务网站 https://www.huaweicloud.com/intl/zh-cn
不再显示此消息
Key CCE AI Suite (NVIDIA GPU) Parameters Check Items Check whether the configuration of CCE AI Suite (NVIDIA GPU) in a cluster has been intrusively modified. If so, upgrading the cluster may fail. Solution Use kubectl to access the cluster.
How Can I Drain a GPU Node After Upgrading or Rolling Back the CCE AI Suite (NVIDIA GPU) Add-on?
What Can I Do If Certain Alarms Are Displayed in the GPU Node Events After the CCE AI Suite (NVIDIA GPU) Add-on Is Upgraded?
Type CCE AI Suite (Ascend NPU) 1.x.x CCE AI Suite (Ascend NPU) 2.0.0 to 2.1.6 CCE AI Suite (Ascend NPU) 2.1.7 to the Latest Version 310 series card Driver version < 23.0.rc0 You must manually mount the drivers and npu-smi to a service pod.
Options: 0: The Ascend AI processor is unhealthy. 1: The Ascend AI processor is healthy. container_name: a container name String id: an NPU ID String model_name: name of an Ascend AI processor String namespace: a namespace name String pcie_bus_info: PCIe information of an Ascend AI
You have deployed an inference service using the AI Inference Framework add-on by referring to AI Inference Framework Add-on. Constraints kagent needs to be installed immediately after it is started. Ensure that the pods in the cluster can access the public network.
Parent Topic: AI Data Acceleration
The CCE AI Suite (Ascend NPU) add-on of v2.1.23 or later has been installed. For details about how to install the add-on, see CCE AI Suite (Ascend NPU). The Volcano Scheduler add-on has been installed. For details about the add-on version requirements, see Table 1.
In the AI task performance enhanced scheduling pane, select whether to enable DRF. This function helps you enhance the service throughput of the cluster and improve service running performance. Click Confirm. Parent Topic: AI Performance-based Scheduling
Gang is mainly used in scenarios that require multi-process collaboration, such as AI and big data scenarios.
The CCE AI Suite (Ascend NPU) add-on of v2.1.23 or later has been installed in the cluster. For details, see CCE AI Suite (Ascend NPU). Notes and Constraints In a single pod, only one container can request NPU resources, and init containers cannot request NPU resources.
In this case, upgrade the CCE AI Suite (NVIDIA GPU) driver to version 535.161.08 or later, and restart the node. Parent topic: Troubleshooting for Pre-upgrade Check Exceptions
In this case, upgrade the CCE AI Suite (NVIDIA GPU) driver to version 535.161.08 or later, and restart the node. Parent Topic: Troubleshooting for Pre-upgrade Check Exceptions
NPU services: NPU nodes with the add-on described in CCE AI Suite (Ascend NPU) installed GPU services: GPU nodes with the add-on described in CCE AI Suite (NVIDIA GPU) installed Notes and Constraints The LeaderWorkerSet add-on cannot be upgraded online.
If AI algorithm engineers want to run a model training task, they have to build an entire AI computing platform first. Imagine how time- and labor-consuming that is and how much knowledge and experience it requires.
If the CCE AI Suite (NVIDIA GPU) add-on version is earlier than 2.0.0, the driver installation directory is /opt/cloud/cce/nvidia.
GPU Metrics The CCE AI Suite (NVIDIA GPU) add-on provides GPU monitoring metrics. This add-on offers additional GPU observability options. This section describes the metrics provided by CCE AI Suite (NVIDIA GPU).
Each vNPU contains a specific number of AI cores, AI CPUs, and memory. For example, if one container requests four AI cores and another requests two AI cores, CCE will allocate two vNPUs to meet these requests. For details, see Figure 1.
Prerequisites The CCE AI Suite (Ascend NPU) add-on of a version later than v2.1.53 has been installed in the cluster. For details, see CCE AI Suite (Ascend NPU). An NPU driver has been installed on the NPU nodes, and the driver version is 23.0.1 or later.
In CCE AI Suite (NVIDIA GPU) of version 2.7.60, 2.1.44, or later, the gpu_pod_memory_used value may be about 100 KB higher than the actual value.