检测到您已登录华为云国际站账号,为了您更好的体验,建议您访问国际站服务网站 https://www.huaweicloud.com/intl/zh-cn
不再显示此消息
Figure 1 Run logs of training jobs with GPU specifications (one compute node) Figure 2 Run logs of training jobs with GPU specifications (two compute nodes) Parent topic: Example: Creating a Custom Image for Training
After the upgrade starts, the nodes are isolated (new jobs cannot be delivered). After the existing jobs on the nodes are complete, the upgrade is performed. The secure upgrade may take a long time because the jobs must be completed first.
For details, see (Optional) Selecting a Training Mode. Add tags if you want to manage training jobs by group. For details, see (Optional) Adding Tags. Perform follow-up procedure. For details, see Follow-Up Operations.
Users cannot add pay-per-use nodes (including AutoScaler scenarios) in a yearly/monthly resource pool.
3 Preparing an Image Server Obtain a Linux x86_64 server running Ubuntu 18.04.
mm:ss (UTC) node_label String Node label os_type String OS type of a node name String Name of an edge node os_name String OS name of a node arch String Node architecture id String Edge node ID instance_status String Running status of a model instance on the node.
npu_id Ascend card ID, for example, davinci0 (to be discarded) device_id Physical ID of Ascend AI processors device_type Type of Ascend AI processors gpu_uuid UUID of a node GPU gpu_index Index of a node GPU gpu_type Type of a node GPU device_name Device name of an InfiniBand or
ip, port, body): infer_url = "{}://{}:{}" url = infer_url.format(schema, ip, port) response = requests.post(url, data=body) print(response.content) High-speed access does not support load balancing.
The current version supports modelarts.vm.cpu.2u, modelarts.vm.gpu.pnt004 (must be requested), modelarts.vm.ai1.snt3 (must be requested), and custom (available only when the service is deployed in a dedicated resource pool).
Using PyTorch to Create a Training Job (New-Version Training) This section describes how to train a model by calling ModelArts APIs.
For a single-node job (running on only one node), ModelArts starts a training container that exclusively uses the resources on the node. For a distributed job (running on more than one node), ModelArts starts a parameter server (PS) and a worker on the same node.
A model is deployed as a web service on an edge node through Intelligent EdgeFabric (IEF). that provides a real-time test UI and monitoring capabilities. The service keeps running. You need to create a node on IEF beforehand.]
NOTE: The notebook instances with remote SSH enabled have VS Code plug-ins (such as Python and Jupyter) and the VS Code server package pre-installed, which occupy about 1 GB persistent storage space. Key Pair Set a key pair after remote SSH is enabled.
If the model service (server) initiates a disconnection, but the connection is being used by ModelArts (client), a communication error occurs and this error code is returned.
It is recommended that the Linux server have sufficient memory (more than 8 GB) and hard disk (more than 100 GB).
Table 4 Elastic node server Application Scenario Dependent Service Dependent Policy Supported Function Elastic node server lifecycle management ModelArts modelarts:devserver:create modelarts:devserver:listByUser modelarts:devserver:list modelarts:devserver:get modelarts:devserver:
(%s).
Why Can I Leave the IP Address of the Master Node Blank for DDP? The init method parameter in parser.add_argument('--init_method', default=None, help='tcp_port') contains the IP address and port number of the master node, which are automatically input by the platform.
Only lowercase letters, digits, and hyphens (-) are allowed. The value must start with a lowercase letter and cannot end with a hyphen (-). Resource Pool Type You can select Physical or Logical.
MapReduce Service (MRS) MRS Administrator GaussDB(DWS) DWS Administrator Cloud Trace Service (CTS) CTS Administrator ModelArts ModelArts CommonOperations ModelArts Dependency Access Development environment notebook/Image management/Elastic node server OBS OBS Administrator Cloud