Search_HUAWEI CLOUD

Key CCE AI Suite (NVIDIA GPU) Parameters - Cloud Container Engine

Key CCE AI Suite (NVIDIA GPU) Parameters Check Items Check whether the configuration of CCE AI Suite (NVIDIA GPU) in a cluster has been intrusively modified. If so, upgrading the cluster may fail. Solution Use kubectl to access the cluster.

Help > Cloud Container Engine > User Guide > Clusters > Upgrading a Cluster > Troubleshooting for Pre-upgrade Check Exceptions
How Can I Drain a GPU Node After Upgrading or Rolling Back the CCE AI Suite (NVIDIA GPU) Add-on? - Cloud Container Engine

How Can I Drain a GPU Node After Upgrading or Rolling Back the CCE AI Suite (NVIDIA GPU) Add-on?

Help > Cloud Container Engine > FAQs > Chart and Add-on
What Can I Do If Certain Alarms Are Displayed in the GPU Node Events After the CCE AI Suite (NVIDIA GPU) Add-on Is Upgraded? - Cloud Container Engine

What Can I Do If Certain Alarms Are Displayed in the GPU Node Events After the CCE AI Suite (NVIDIA GPU) Add-on Is Upgraded?

Help > Cloud Container Engine > FAQs > Node > Node Running
What Can I Do If a Pod Cannot Be Started After the CCE AI Suite (Ascend NPU) Add-on Is Upgraded from 1.x.x to 2.x.x? - Cloud Container Engine

Type CCE AI Suite (Ascend NPU) 1.x.x CCE AI Suite (Ascend NPU) 2.0.0 to 2.1.6 CCE AI Suite (Ascend NPU) 2.1.7 to the Latest Version 310 series card Driver version < 23.0.rc0 You must manually mount the drivers and npu-smi to a service pod.

Help > Cloud Container Engine > FAQs > Chart and Add-on
NPU Metrics - Cloud Container Engine
NPU Metrics - Cloud Container Engine

Options: 0: The Ascend AI processor is unhealthy. 1: The Ascend AI processor is healthy. container_name: a container name String id: an NPU ID String model_name: name of an Ascend AI processor String namespace: a namespace name String pcie_bus_info: PCIe information of an Ascend AI

Help > Cloud Container Engine > User Guide > Scheduling > NPU Scheduling > NPU Monitoring
kagent Add-on - Cloud Container Engine
kagent Add-on - Cloud Container Engine

You have deployed an inference service using the AI Inference Framework add-on by referring to AI Inference Framework Add-on. Constraints kagent needs to be installed immediately after it is started. Ensure that the pods in the cluster can access the public network.

Help > Cloud Container Engine > User Guide > Cloud Native AI > AI Service Deployment
Fluid Overview - Cloud Container Engine
Fluid Overview - Cloud Container Engine

Parent Topic: AI Data Acceleration

Help > Cloud Container Engine > User Guide > Cloud Native AI > AI Data Acceleration
NPU Topology-aware Affinity Scheduling on a Single Node - Cloud Container Engine

CCE AI Suite (Ascend NPU) of v2.1.23 or later has been installed. For details about how to install the add-on, see CCE AI Suite (Ascend NPU). The Volcano Scheduler add-on has been installed. For details about the add-on version requirements, see Table 1.

Help > Cloud Container Engine > User Guide > Scheduling > NPU Scheduling > NPU Topology-aware Affinity Scheduling
DRF - Cloud Container Engine
DRF - Cloud Container Engine

In the AI task performance enhanced scheduling pane, select whether to enable DRF. This function helps you enhance the service throughput of the cluster and improve service running performance. Click Confirm. Parent Topic: AI Performance-based Scheduling

Help > Cloud Container Engine > User Guide > Scheduling > Volcano Scheduling > AI Performance-based Scheduling
Gang - Cloud Container Engine
Gang - Cloud Container Engine

Gang is mainly used in scenarios that require multi-process collaboration, such as AI and big data scenarios.

Help > Cloud Container Engine > User Guide > Scheduling > Volcano Scheduling > AI Performance-based Scheduling
Topology-aware Affinity Scheduling on Hypernodes - Cloud Container Engine

The CCE AI Suite (Ascend NPU) add-on of v2.1.23 or later has been installed in the cluster. For details, see CCE AI Suite (Ascend NPU). Notes and Constraints In a single pod, only one container can request NPU resources, and init containers cannot request NPU resources.

Help > Cloud Container Engine > User Guide > Scheduling > NPU Scheduling > NPU Topology-aware Affinity Scheduling
LeaderWorkerSet Add-on - Cloud Container Engine

NPU services: NPU nodes with the add-on described in CCE AI Suite (Ascend NPU) installed GPU services: GPU nodes with the add-on described in CCE AI Suite (NVIDIA GPU) installed Notes and Constraints The LeaderWorkerSet add-on cannot be upgraded online.

Help > Cloud Container Engine > User Guide > Cloud Native AI > AI Service Deployment
Compatibility Between the Ubuntu Kernel and GPU Driver - Cloud Container Engine

In this case, upgrade the CCE AI Suite (NVIDIA GPU) driver to version 535.161.08 or later, and restart the node. Parent topic: Troubleshooting for Pre-upgrade Check Exceptions

Help > Cloud Container Engine > User Guide (ME-Abu Dhabi Region) > Clusters > Upgrading a Cluster > Troubleshooting for Pre-upgrade Check Exceptions
Compatibility Between the Ubuntu Kernel and GPU Driver - Cloud Container Engine

In this case, upgrade the CCE AI Suite (NVIDIA GPU) driver to version 535.161.08 or later, and restart the node. Parent Topic: Troubleshooting for Pre-upgrade Check Exceptions

Help > Cloud Container Engine > User Guide > Clusters > Upgrading a Cluster > Troubleshooting for Pre-upgrade Check Exceptions
Deploying Kubeflow - Cloud Container Engine

If AI algorithm engineers want to run a model training task, they have to build an entire AI computing platform first. Imagine how time- and labor-consuming that is and how much knowledge and experience it requires.

Help > Cloud Container Engine > Best Practices > Batch Computing > Deploying and Using Kubeflow in a CCE Cluster
What Can I Do If a GPU Card Is Unavailable on a GPU Node? - Cloud Container Engine

If the CCE AI Suite (NVIDIA GPU) add-on version is earlier than 2.0.0, the driver installation directory is /opt/cloud/cce/nvidia.

Help > Cloud Container Engine > FAQs > Node > Node Running
GPU Metrics - Cloud Container Engine
GPU Metrics - Cloud Container Engine

GPU Metrics The CCE AI Suite (NVIDIA GPU) add-on provides GPU monitoring metrics. This add-on offers additional GPU observability options. This section describes the metrics provided by CCE AI Suite (NVIDIA GPU).

Help > Cloud Container Engine > User Guide (ME-Abu Dhabi Region) > Scheduling > GPU Scheduling
Automatic NPU Virtualization (Computing Segmentation) - Cloud Container Engine

Prerequisites The CCE AI Suite (Ascend NPU) add-on of a version later than 2.1.63 has been installed in the cluster. For details, see CCE AI Suite (Ascend NPU). An NPU driver has been installed on the NPU nodes, and the driver version is 23.0.1 or later.

Help > Cloud Container Engine > User Guide > Scheduling > NPU Scheduling > NPU Virtualization
Overview - Cloud Container Engine
Overview - Cloud Container Engine

Each vNPU contains a specific number of AI cores, AI CPUs, and memory. For example, if one container requests four AI cores and another requests two AI cores, CCE will allocate two vNPUs to meet these requests. For details, see Figure 1.

Help > Cloud Container Engine > User Guide > Scheduling > NPU Scheduling > NPU Virtualization
GPU Metrics - Cloud Container Engine
GPU Metrics - Cloud Container Engine

In CCE AI Suite (NVIDIA GPU) of version 2.7.60, 2.1.44, or later, the gpu_pod_memory_used value may be about 100 KB higher than the actual value.

Help > Cloud Container Engine > User Guide > Scheduling > GPU Scheduling > GPU Monitoring

Total results: 106

Was this helpful?

Feedback

/200

Submit Cancel