检测到您已登录华为云国际站账号,为了您更好的体验,建议您访问国际站服务网站 https://www.huaweicloud.com/intl/zh-cn
不再显示此消息
To avoid repeated loading, the platform allows the model package to be loaded from the local storage space of the node in the resource pool and keeps the loaded files valid even when the service is stopped or restarted (using the hash value to ensure data consistency).
Server Model Select a server model and select nodes in the node list. You can search for node information using keywords. Snt9b nodes and Snt9b23 supernodes are supported. Diagnosis Item You can select Parameter Plane Network Diagnosis, Ascend Device Diagnosis, or both.
For details, see Creating a Notebook Instance (New Page). The default resource specification is 5 GB, but you can expand it as needed.
Server Model Select a server model and select nodes in the node list. You can search for node information using keywords. Snt9b nodes and Snt9b23 supernodes are supported. Test Case You can select any of the following pressure test cases.
Why Can I Leave the IP Address of the Master Node Blank for DDP? The init method parameter in parser.add_argument('--init_method', default=None, help='tcp_port') contains the IP address and port number of the master node, which are automatically input by the platform.
Minimum Number of PUs and Sequence Length Supported by Each Model Model Training Time and Cluster Scale Prediction Training time and the number of PUs depend on the model, cluster specifications (Snt9b B3/B2/B1 or Snt9b23), and dataset size.
Synchronization for Existing Nodes (labels and taints) and Synchronization for Existing Nodes (labels) can be modified synchronously for existing nodes (by selecting the check boxes). The updated resource tag information in the node pool is synchronized to its nodes.
(Optional) Custom Instance Injection Use this function to configure Server nodes if you want to: Use scripts to simplify the Server node configuration. Use scripts to initialize OSs. Use existing scripts and upload them to the server when creating the Server node.
Server Model Select a server model and select nodes in the node list. You can search for node information using keywords. Snt9b nodes and Snt9b23 supernodes are supported.
Software Versions Required by Different Models A resource pool for elastic clusters can use either Elastic Bare Metal Servers (BMSs) or Elastic Cloud Servers (ECSs) as nodes. Each node model has its own operating system (OS) and compatible CCE cluster versions.
(If the port number is in use, change it to another one.) Access the snt9b23 container.
Parent topic: Managing Model Training Jobs
Table 7 Node management parameters Parameter Description Server Name Lite Server name, which can contain 1 to 64 characters. Only digit, letters, underscores (_), and hyphens (-) are allowed. CAUTION: The server name in the order will not be changed.
Figure 14 RoCE test result (receive end) Figure 15 RoCE test result (server) If the RoCE bandwidth test has been started for a NIC, the following error message is displayed when the task is started again.
Server Model Select a server model and select nodes in the node list. You can search for node information using keywords. Snt9b nodes and Snt9b23 supernodes are supported.
ak := os.Getenv("HUAWEICLOUD_SDK_AK") sk := os.Getenv("HUAWEICLOUD_SDK_SK") auth := basic.NewCredentialsBuilder(). WithAk(ak). WithSk(sk). Build() client := bms.NewBmsClient( bms.BmsClientBuilder(). WithRegion(region.ValueOf("cn-north-4")).
In the navigation pane, choose Model Training > Training Jobs. In the job list, click Export to export training job details in a certain time range as an Excel file. A maximum of 200 rows of data can be exported.
You need to import a model package. The new image is larger than 35 GB and needs to be created on a server such as ECS. For details, see Creating a Custom Image on ECS. Figure 1 Creating a custom image for a model Constraints No malicious code is allowed.
Template Parameters Enter the node directory on the Lite Server for storing logs. The default value is /root/log_collection. Server Model Snt9b and Snt9b23 supernodes are supported. Collection Items Select Device side log, Host side log, NPU environment log, or all of them.
): Multiple GPUs work together on one server to speed up training using data parallelism.