AI开发平台ModelArts-使用TensorFlow框架创建训练作业(旧版训练):操作步骤
操作步骤
- 调用认证鉴权接口获取用户的Token。
- 请求消息体:
URI格式:POST https://{iam_endpoint}/v3/auth/tokens
请求消息头:Content-Type →application/json
请求Body:{ "auth": { "identity": { "methods": ["password"], "password": { "user": { "name": "user_name", "password": "user_password", "domain": { "name": "domain_name" } } } }, "scope": { "project": { "name": "cn-north-1" } } }}
其中,加粗的斜体字段需要根据实际值填写:- iam_endpoint为IAM的终端节点。
- user_name为IAM用户名。
- user_password为用户登录密码。
- domain_name为用户所属的帐号名。
- cn-north-1为项目名,代表服务的部署区域。
- 返回状态码“201 Created”,在响应Header中获取“X-Subject-Token”的值即为Token,如下所示:
x-subject-token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
- 请求消息体:
- 调用查询作业资源规格接口获取训练作业支持的资源规格。
- 请求消息体:
URI格式:GET https://{ma_endpoint}/v1/{project_id}/job/resource-specs?job_type=train
请求消息头:X-auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
其中,加粗的斜体字段需要根据实际值填写:- ma_endpoint为ModelArts的终端节点。
- project_id为用户的项目ID。
- “X-auth-Token”的值是上一步获取到的Token值。
- 返回状态码“200 OK”,响应Body如下所示:
{ "specs": [ ...... { "spec_id": 7, "core": "2", "cpu": "8", "gpu_num": 0, "gpu_type": "", "spec_code": "modelarts.vm.cpu.2u", "unit_num": 1, "max_num": 1, "storage": "", "interface_type": 1, "no_resource": false }, { "spec_id": 27, "core": "8", "cpu": "32", "gpu_num": 0, "gpu_type": "", "spec_code": "modelarts.vm.cpu.8u", "unit_num": 1, "max_num": 1, "storage": "", "interface_type": 1, "no_resource": false } ], "is_success": true, "spec_total_count": 5}
- 根据“spec_code”字段选择并记录创建训练作业时需要的规格类型,本章以“modelarts.vm.cpu.8u”为例,并记录“max_num”字段的值为“1”。
- “no_resource”字段用于判断规格资源是否充足,“false”代表有资源。
- 请求消息体:
- 调用查询作业引擎规格接口查看训练作业的引擎类型和版本。
- 请求消息体:
URI格式:GET https://{ma_endpoint}/v1/{project_id}/job/ai-engines?job_type=train
请求消息头:X-auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
其中,加粗的斜体字段需要根据实际值填写。
- 返回状态码“200 OK”,响应Body如下所示:
{ "engines": [ { "engine_type": 13, "engine_name": "Ascend-Powered-Engine", "engine_id": 130, "engine_version": "TF-1.15-python3.7-aarch64" }, ...... { "engine_type": 1, "engine_name": "TensorFlow", "engine_id": 3, "engine_version": "TF-1.8.0-python2.7" }, { "engine_type": 1, "engine_name": "TensorFlow", "engine_id": 4, "engine_version": "TF-1.8.0-python3.6" }, ...... { "engine_type": 9, "engine_name": "XGBoost-Sklearn", "engine_id": 100, "engine_version": "XGBoost-0.80-Sklearn-0.18.1-python3.6" } ], "is_success": true}
根据“engine_name”和“engine_version”字段选择创建训练作业时需要的引擎规格,并记录对应的“engine_id”,本章以TensorFlow引擎为例创建作业,记录“engine_id”为“4”。
- 请求消息体:
- 调用创建训练作业接口创建一个基于TensorFlow框架的名称为“jobtest_TF”的训练作业。
- 请求消息体:
URI格式:POST https://{ma_endpoint}/v1/{project_id}/training-jobs
请求消息头:- X-auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
- Content-Type →application/json
请求Body:{ "job_name": "jobtest_TF", "job_desc": "TF识别手写数字", "config": { "worker_server_num": 1, "parameter": [], "flavor": { "code": "modelarts.vm.cpu.8u" }, "train_url": "/test-modelarts/mnist-model/output/", "engine_id": 4, "app_url": "/test-modelarts/mnist-tensorflow-code/", "boot_file_url": "/test-modelarts/mnist-tensorflow-code/train_mnist_tf.py", "data_source": [ { "type": "obs", "data_url": "/test-modelarts/dataset-mnist/" } ] }, "notification": { "topic_urn": "", "events": [] }, "workspace_id": "0"}
- 返回状态码“200 OK”,表示训练作业创建成功,响应Body如下所示:
{ "version_name": "V0001", "job_name": "jobtest_TF", "create_time": 1609121837000, "job_id": 567524, "resource_id": "jobaedef089", "version_id": 1108482, "is_success": true, "status": 1}
- 记录“job_id”(训练作业的任务ID)和“version_id”(训练作业的版本ID)字段的值便于后续步骤使用。
- “status”为“1”表示训练作业在初始化状态中。
- 请求消息体:
- 调用查询训练作业版本详情接口根据训练作业的ID查询训练作业的创建详情。
- 请求消息体:
URI格式:GET https://{ma_endpoint}/v1/{project_id}/training-jobs/{job_id}/versions/{version_id}
请求消息头:X-auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
- 返回状态码“200 OK”,响应Body如下所示:
{ "dataset_name": null, "duration": 1326, "spec_code": "modelarts.vm.cpu.8u", "parameter": [], "start_time": 1609121913000, "model_outputs": [], "engine_name": "TensorFlow", "error_result": null, "gpu_type": "", "user_frame_image": null, "gpu": null, "dataset_id": null, "nas_mount_path": null, "task_summary": {}, "max_num": 1, "model_metric_list": "{}", "is_zombie": null, "flavor_code": "modelarts.vm.cpu.8u", "gpu_num": 0, "train_url": "/test-modelarts/mnist-model/output/", "engine_type": 1, "job_name": "jobtest_TF", "nas_type": "efs", "outputs": null, "job_id": 567524, "data_url": "/test-modelarts/dataset-mnist/", "log_url": null, "boot_file_url": "/test-modelarts/mnist-tensorflow-code/train_mnist_tf.py", "volumes": null, "dataset_version_id": null, "algorithm_id": null, "worker_server_num": 1, "pool_type": "SYSTEM_DEFINED", "autosearch_config": null, "job_desc": "TF识别手写数字", "inputs": null, "model_id": null, "dataset_version_name": null, "pool_name": "hec-train-pub-cpu", "engine_version": "TF-1.8.0-python3.6", "system_metric_list": { "recvBytesRate": [ "0", "0" ], "cpuUsage": [ "0", "0" ], "sendBytesRate": [ "0", "0" ], "memUsage": [ "0", "0" ], "gpuUtil": [ "0", "0" ], "gpuMemUsage": [ "0", "0" ], "interval": 1, "diskWriteRate": [ "0", "0" ], "diskReadRate": [ "0", "0" ] }, "retrain_model_id": null, "version_name": "V0001", "pod_version": "1.8.0-cp36", "engine_id": 4, "status": 10, "cpu": "32", "user_image_url": null, "spec_id": 27, "is_success": true, "storage": "", "nas_share_addr": null, "version_id": 1108482, "no_resource": false, "user_command": null, "resource_id": "jobaedef089", "core": "8", "npu_info": null, "app_url": "/test-modelarts/mnist-tensorflow-code/", "data_source": [ { "type": "obs", "data_url": "/test-modelarts/dataset-mnist/" } ], "pre_version_id": null, "create_time": 1609121837000, "job_type": 1, "pool_id": "pool7d1e384a"}
根据响应可以了解训练作业的版本详情,其中“status”为“10”表示训练作业已经运行成功。
- 请求消息体:
- 调用获取训练作业日志的文件名接口获取训练作业日志的文件名。
- 请求消息体:
URI格式:GET https://{ma_endpoint}/v1/{project_id}/training-jobs/{job_id}/versions/{version_id}/log/file-names
请求消息头:X-auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
其中,加粗的斜体字段需要根据实际值填写。
- 返回状态码“200 OK”,响应Body如下所示:
{ "is_success": true, "log_file_list": [ "job-jobtest-tf.0" ]}
表示只存在一个名称为“job-jobtest-tf.0”的日志文件。
- 请求消息体:
- 调用查询训练作业日志向下查询8行训练作业日志文件的详细信息。
- 请求消息体:
URI格式:GET https://{ma_endpoint}/v1/{project_id}/training-jobs/{job_id}/versions/{version_id}/aom-log?log_file=job-jobtest-tf.0&lines=8&order=desc
请求消息头:X-auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
其中,加粗的斜体字段需要根据实际值填写:- “log_file”填写6获取的日志文件名。
- “lines”填写需要获取的日志长度。
- “order”填写日志查询方向。
- 返回状态码“200 OK”,响应Body如下所示:
{ "start_line": "1609121886518240330", "lines": 8, "is_success": true, "end_line": "1609121900042593083", "content": "Done exporting!\n\n[Modelarts Service Log]Training completed.\n\n[ModelArts Service Log]modelarts-pipe: will create log file /tmp/log/jobtest_TF.log\n\n[ModelArts Service Log]modelarts-pipe: will create log file /tmp/log/jobtest_TF.log\n\n[ModelArts Service Log]modelarts-pipe: will write log file /tmp/log/jobtest_TF.log\n\n[ModelArts Service Log]modelarts-pipe: param for max log length: 1073741824\n\n[ModelArts Service Log]modelarts-pipe: param for whether exit on overflow: 0\n\n[ModelArts Service Log]modelarts-pipe: total length: 23303\n"}
- 请求消息体:
- 当训练作业使用完成或不再需要时,调用删除训练作业接口删除训练作业。
- 请求消息体:
URI格式:GET https://{ma_endpoint}/v1/{project_id}/training-jobs/{job_id}
请求消息头:X-auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
其中,加粗的斜体字段需要根据实际值填写。
- 返回状态码“200 OK”表示作业删除成功,响应示例如下:
{ "is_success": true}
- 请求消息体:
- ModelArts模型训练_创建训练作业_如何创建训练作业
- ModelArts分布式训练_分布式训练介绍_分布式调测
- ModelArts模型训练_模型训练简介_如何训练模型
- ModelArts推理部署_模型_AI应用来源-华为云
- ModelArts模型训练_超参搜索简介_超参搜索算法
- ModelArts自定义镜像_自定义镜像简介_如何使用自定义镜像
- ModelArts推理部署_纳管Atlas 500_边缘服务-华为云
- ModelArts推理部署_服务_访问公网-华为云
- ModelArts计费说明_计费简介_ModelArts怎么计费
- 华为云ModelArts_ModelArts开发_AI全流程开发