检测到您已登录华为云国际站账号,为了您更好的体验,建议您访问国际站服务网站 https://www.huaweicloud.com/intl/zh-cn
不再显示此消息
To avoid repeated loading, the platform allows the model package to be loaded from the local storage space of the node in the resource pool and keeps the loaded files valid even when the service is stopped or restarted (using the hash value to ensure data consistency).
Server Model Only Ascend Snt9b and Ascend Snt9b23 are supported. Type You can select Single node or Integrated rack, or search for a specific node by keyword. Diagnosis Item You can select Parameter Plane Network Diagnosis, Ascend Device Diagnosis, or both.
For details, see Creating a Notebook Instance (New Page). The default resource specification is 5 GB, but you can expand it as needed.
Server Model Only Ascend Snt9b and Ascend Snt9b23 are supported. Type You can select Single node or Integrated rack, or search for a specific node by keyword. Test Case You can select any of the following pressure test cases.
Server Model Only Ascend Snt9b and Ascend Snt9b23 are supported. Type You can select Single node or Integrated rack. Select Node Click Select Node. In the node list displayed on the right, select the nodes where the driver and firmware need to be upgraded.
Why Can I Leave the IP Address of the Master Node Blank for DDP? The init method parameter in parser.add_argument('--init_method', default=None, help='tcp_port') contains the IP address and port number of the master node, which are automatically input by the platform.
Minimum Number of PUs and Sequence Length Supported by Each Model Model Training Time and Cluster Scale Prediction Training time and the number of PUs depend on the model, cluster specifications (Snt9b B3/B2/B1 or Snt9b23), and dataset size.
Synchronization for Existing Nodes (labels and taints) and Synchronization for Existing Nodes (labels) can be modified synchronously for existing nodes (by selecting the check boxes). The updated resource tag information in the node pool is synchronized to its nodes.
Software Versions Required by Different Models A resource pool for elastic clusters can use either Elastic Bare Metal Servers (BMSs) or Elastic Cloud Servers (ECSs) as nodes. Each node model has its own operating system (OS) and compatible CCE cluster versions.
(If the port number is in use, change it to another one.) Access the snt9b23 container.
Parent topic: Managing Model Training Jobs
Table 7 Node management parameters Parameter Description Server Name Server name, which can contain 1 to 64 characters. Only digit, letters, underscores (_), and hyphens (-) are allowed. CAUTION: The server name in the order will not be changed.
Table 3 Parameters for resource configurations Parameter Description Server Server name, which can contain at most 64 characters, including letters, digits, hyphens (–), and underscores (_). CAUTION: The server name in the order will not be changed.
) on the media (optical) side of channel 0 in the NPU optical module dB N/A Natural number instance_id, npu telescope: 2.7.5.9 or later 81 npu_opt_media_snr_lane1 NPU Optical Module Channel 1 Optical SNR The signal-to-noise ratio (SNR) on the media (optical) side of channel 1 in the
Changing or Resetting the Lite Server OS Scenario You can change or reset the Lite Server node OS if a BMS is used. Change the OS in any of the following ways: (Recommended) Change or reset the OS on the server page of the ModelArts console. Change the OS on the BMS console.
In the navigation pane, choose Model Training > Training Jobs. In the job list, click Export to export training job details in a certain time range as an Excel file. A maximum of 200 rows of data can be exported.
You need to import a model package. The new image is larger than 35 GB and needs to be created on a server such as ECS. For details, see Creating a Custom Image on ECS. Figure 1 Creating a custom image for a model Constraints No malicious code is allowed.
FLAGS = tf.flags.FLAGS import moxing as mox TMP_CACHE_PATH = '/cache/data' mox.file.copy_parallel('FLAGS.data_url', TMP_CACHE_PATH) mnist = input_data.read_data_sets(TMP_CACHE_PATH, one_hot=True) Parent topic: ModelArts Standard Model Training
For the 300IDuo model, set is_300_iduo to True.
): Multiple GPUs work together on one server to speed up training using data parallelism.