AI开发平台MODELARTS-准备镜像:步骤六:编写Config.yaml文件
时间:2025-06-24 10:36:33
步骤六:编写Config.yaml文件
首先给出单个节点训练的config.yaml文件模板,用于配置pod。而在训练中,需要按照参数说明修改${}中的参数值。该模板使用SFS Turbo挂载方案。
apiVersion: v1 kind: ConfigMap metadata: name: configmap1980-vcjob # 前缀使用“configmap1980-”不变,后接vcjob的名字 namespace: default # 命名空间自选,需要和下边的vcjob处在同一命名空间 labels: ring-controller.cce: ascend-1980 # 保持不动 data: # data内容保持不动,初始化完成,会被volcano插件自动修改 jobstart_hccl.json: | { "status":"initializing" } --- apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: vcjob # job名字,需要和configmap中名字保持联系 namespace: default # 和configmap保持一致 labels: ring-controller.cce: ascend-1980 # 保持不动 fault-scheduling: "force" spec: minAvailable: 1 schedulerName: volcano # 保持不动 policies: - event: PodEvicted action: RestartJob plugins: configmap1980: - --rank-table-version=v2 # 保持不动,生成v2版本ranktablefile env: [] svc: - --publish-not-ready-addresses=true maxRetry: 5 queue: default tasks: - name: main replicas: 1 template: metadata: name: training labels: app: ascendspeed ring-controller.cce: ascend-1980 # 保持不动 spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: volcano.sh/job-name operator: In values: - vcjob topologyKey: kubernetes.io/hostname hostNetwork: true # 采用宿主机网络模式 containers: - image: ${image_name} # 镜像地址 imagePullPolicy: IfNotPresent # IfNotPresent:默认值,镜像在宿主机上不存在时才拉取;Always:每次创建Pod都会重新拉取一次镜像;Never:Pod永远不会主动拉取这个镜像 name: ${container_name} # 容器名称 securityContext: allowPrivilegeEscalation: false runAsUser: 0 # 容器内权限设置,[0:root,1000:ma-user] env: - name: name valueFrom: fieldRef: fieldPath: metadata.name - name: ip valueFrom: fieldRef: fieldPath: status.hostIP - name: framework value: "PyTorch" command: ["/bin/sh", "-c"] args: - ${command} resources: requests: huawei.com/ascend-1980: "8" # 需求卡数,key保持不变. memory: ${requests_memory} # 容器请求的最小内存 cpu: ${requests_cpu} # 容器请求的最小 CPU limits: huawei.com/ascend-1980: "8" # 限制卡数,key保持不变 memory: ${limits_memory} # 容器可使用的最大内存 cpu: ${limits_cpu} # 容器可使用的最大 CPU volumeMounts: # 容器内部映射路径 - name: shared-memory-volume mountPath: /dev/shm - name: ascend-driver # 驱动挂载,保持不动 mountPath: /usr/local/Ascend/driver - name: ascend-add-ons # 驱动挂载,保持不动 mountPath: /usr/local/Ascend/add-ons - name: localtime mountPath: /etc/localtime - name: hccn # 驱动hccn配置,保持不动 mountPath: /etc/hccn.conf - name: npu-smi # npu-smi mountPath: /usr/local/sbin/npu-smi - name: ascend-install mountPath: /etc/ascend_install.info - name: log mountPath: /var/log/npu/ - name: sfs-volume mountPath: /mnt/sfs_turbo nodeSelector: accelerator/huawei-npu: ascend-1980 volumes: # 物理机外部路径 - name: shared-memory-volume # 共享内存 emptyDir: medium: Memory sizeLimit: "200Gi" - name: ascend-driver hostPath: path: /usr/local/Ascend/driver - name: ascend-add-ons hostPath: path: /usr/local/Ascend/add-ons - name: localtime hostPath: path: /etc/localtime - name: hccn hostPath: path: /etc/hccn.conf - name: npu-smi hostPath: path: /usr/local/sbin/npu-smi - name: ascend-install hostPath: path: /etc/ascend_install.info - name: log hostPath: path: /usr/slog - name: sfs-volume persistentVolumeClaim: claimName: ${pvc_name} #已创建的PVC名称 restartPolicy: OnFailure
双节点或多节点训练的config.yaml文件模板,用于实现双机分布式训练,与单节点yaml模板相比task区块中新加一个或多个name小区块内容,样例截图如下:
双节点config.yaml文件模板如下:
apiVersion: v1 kind: ConfigMap metadata: name: configmap1980-vcjob # 前缀使用“configmap1980-”不变,后接vcjob的名字 namespace: default # 命名空间自选,需要和下边的vcjob处在同一命名空间 labels: ring-controller.cce: ascend-1980 # 保持不动 data: #data内容保持不动,初始化完成,会被volcano插件自动修改 jobstart_hccl.json: | { "status":"initializing" } --- apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: vcjob # job名字,需要和configmap中名字保持联系 namespace: default # 和configmap保持一致 labels: ring-controller.cce: ascend-1980 # 保持不动 fault-scheduling: "force" spec: minAvailable: 1 schedulerName: volcano # 保持不动 policies: - event: PodEvicted action: RestartJob plugins: configmap1980: - --rank-table-version=v2 # 保持不动,生成v2版本ranktablefile env: [] svc: - --publish-not-ready-addresses=true maxRetry: 5 queue: default tasks: - name: main replicas: 1 template: metadata: name: training labels: app: ascendspeed ring-controller.cce: ascend-1980 # 保持不动 spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: volcano.sh/job-name operator: In values: - vcjob topologyKey: kubernetes.io/hostname hostNetwork: true # 采用宿主机网络模式 containers: - image: ${image_name} # 镜像地址 imagePullPolicy: IfNotPresent # IfNotPresent:默认值,镜像在宿主机上不存在时才拉取;Always:每次创建Pod都会重新拉取一次镜像;Never:Pod永远不会主动拉取这个镜像 name: ${container_name} securityContext: # 容器内 root 权限 allowPrivilegeEscalation: false runAsUser: 0 env: - name: name valueFrom: fieldRef: fieldPath: metadata.name - name: ip valueFrom: fieldRef: fieldPath: status.hostIP - name: framework value: "PyTorch" command: ["/bin/sh", "-c"] args: - ${command} resources: requests: huawei.com/ascend-1980: "8" # 需求卡数,key保持不变. memory: ${requests_memory} # 容器请求的最小内存 cpu: ${requests_cpu} # 容器请求的最小 CPU limits: huawei.com/ascend-1980: "8" # 限制卡数,key保持不变。 memory: ${limits_memory} # 容器可使用的最大内存 cpu: ${limits_cpu} # 容器可使用的最大 CPU volumeMounts: # 容器内部映射路径 - name: shared-memory-volume mountPath: /dev/shm - name: ascend-driver # 驱动挂载,保持不动 mountPath: /usr/local/Ascend/driver - name: ascend-add-ons # 驱动挂载,保持不动 mountPath: /usr/local/Ascend/add-ons - name: localtime mountPath: /etc/localtime - name: hccn # 驱动hccn配置,保持不动 mountPath: /etc/hccn.conf - name: npu-smi # npu-smi mountPath: /usr/local/sbin/npu-smi - name: ascend-install mountPath: /etc/ascend_install.info - name: log mountPath: /var/log/npu/ - name: sfs-volume mountPath: /mnt/sfs_turbo nodeSelector: accelerator/huawei-npu: ascend-1980 volumes: # 物理机外部路径 - name: shared-memory-volume # 共享内存 emptyDir: medium: Memory sizeLimit: "200Gi" - name: ascend-driver hostPath: path: /usr/local/Ascend/driver - name: ascend-add-ons hostPath: path: /usr/local/Ascend/add-ons - name: localtime hostPath: path: /etc/localtime - name: hccn hostPath: path: /etc/hccn.conf - name: npu-smi hostPath: path: /usr/local/sbin/npu-smi - name: ascend-install hostPath: path: /etc/ascend_install.info - name: log hostPath: path: /usr/slog - name: sfs-volume persistentVolumeClaim: claimName: ${pvc_name} #已创建的PVC名称 restartPolicy: OnFailure - name: work replicas: 1 template: metadata: name: training labels: app: ascendspeed ring-controller.cce: ascend-1980 # 保持不动 spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: volcano.sh/job-name operator: In values: - vcjob topologyKey: kubernetes.io/hostname hostNetwork: true # 采用宿主机网络模式 containers: - image: ${image_name} # 镜像地址 imagePullPolicy: IfNotPresent # IfNotPresent:默认值,镜像在宿主机上不存在时才拉取;Always:每次创建Pod都会重新拉取一次镜像;Never:Pod永远不会主动拉取这个镜像 name: ${container_name} securityContext: # 容器内 root 权限 allowPrivilegeEscalation: false runAsUser: 0 env: - name: name valueFrom: fieldRef: fieldPath: metadata.name - name: ip valueFrom: fieldRef: fieldPath: status.hostIP - name: framework value: "PyTorch" command: ["/bin/sh", "-c"] args: - ${command} resources: requests: huawei.com/ascend-1980: "8" # 需求卡数,key保持不变. memory: ${requests_memory} # 容器请求的最小内存 cpu: ${requests_cpu} # 容器请求的最小 CPU limits: huawei.com/ascend-1980: "8" # 限制卡数,key保持不变。 memory: ${limits_memory} # 容器可使用的最大内存 cpu: ${limits_cpu} # 容器可使用的最大 CPU volumeMounts: # 容器内部映射路径 - name: shared-memory-volume mountPath: /dev/shm - name: ascend-driver # 驱动挂载,保持不动 mountPath: /usr/local/Ascend/driver - name: ascend-add-ons # 驱动挂载,保持不动 mountPath: /usr/local/Ascend/add-ons - name: localtime mountPath: /etc/localtime - name: hccn # 驱动hccn配置,保持不动 mountPath: /etc/hccn.conf - name: npu-smi # npu-smi mountPath: /usr/local/sbin/npu-smi - name: ascend-install mountPath: /etc/ascend_install.info - name: log mountPath: /var/log/npu/ - name: sfs-volume mountPath: /mnt/sfs_turbo nodeSelector: accelerator/huawei-npu: ascend-1980 volumes: # 物理机外部路径 - name: shared-memory-volume # 共享内存 emptyDir: medium: Memory sizeLimit: "200Gi" - name: ascend-driver hostPath: path: /usr/local/Ascend/driver - name: ascend-add-ons hostPath: path: /usr/local/Ascend/add-ons - name: localtime hostPath: path: /etc/localtime - name: hccn hostPath: path: /etc/hccn.conf - name: npu-smi hostPath: path: /usr/local/sbin/npu-smi - name: ascend-install hostPath: path: /etc/ascend_install.info - name: log hostPath: path: /usr/slog - name: sfs-volume persistentVolumeClaim: claimName: ${pvc_name} #已创建的PVC名称 restartPolicy: OnFailure
参数说明:
- ${container_name} 容器名称,此处可以自己定义一个容器名称,例如ascendspeed。
- ${image_name} 为步骤五:修改并上传镜像中,上传至SWR上的镜像链接。
- ${command} 使用config.yaml文件创建pod后,在容器内自动运行的命令。在进行训练任务中会给出替换命令。
- /mnt/sfs_turbo 为宿主机中默认挂载SFS Turbo的工作目录,目录下存放着训练所需代码、数据等文件。
- 同样,/mnt/sfs_turbo 也可以映射至容器中,作为容器中挂载宿主机的目录。宿主机和容器使用不同的文件系统。为方便访问两个地址可以相同。
- ${pvc_name} 为在CCE集群关联SFS Turbo步骤中创建的PVC名称。
- 在设置容器中需要的CPU与内存大小时,可通过运行以下命令查看申请的节点机器中具体的CPU与内存信息。
kubectl describe node
- ${requests_cpu} 指在容器中请求的最小CPU核心数量,可使用Requests中的值,例如2650m。
- ${requests_memory} 指在容器中请求的最小内存空间大小,可使用Requests中的值,例如3200Mi。
- ${limits_cpu} 指在容器中可使用的最大CPU核心数量,例如192。
- ${limits_memory} 指在容器中可使用的最大内存空间大小,例如换算成1500Gi。
support.huaweicloud.com/bestpractice-modelarts/modelarts_llm_train_5905038.html