主流开源大模型基于Lite Cluster适配PyTorch NPU推理指导（6.3.911）-华为云

AI开发平台MODELARTS-Eagle投机小模型训练:步骤五：训练生成权重转换成可以支持vLLM推理的格式

步骤五：训练生成权重转换成可以支持vLLM推理的格式将训练完成后的权重文件（.bin文件或. safetensors文件），移动到下载好的开源权重目录下（即步骤4中，config文件所在目录）。然后在llm_tools/spec_decode/EAGLE文件夹，执行 python convert_eagle_ckpt_to_vllm_compatible.py --base-path 大模型权重地址 --draft-path 小模型权重地址 --base-weight-name 大模型包含lm_head的权重文件名 --draft-weight-name 小模型权重文件名 --base-path：为大模型权重地址，例如 ./llama2-7b-chat --draft-path：小模型权重地址，即步骤四中config文件所在目录，例如 ./eagle_llama2-7b-chat --base-weight-name：为大模型包含lm_head的权重文件名，可以在base-path目录下的 model.safetensors.index.json 文件获取，例如llama2-7b-chat的权重名为pytorch_model-00001-of-00002.bin 图3 权重文件名 --draft-weight-name 为小模型权重文件名，即刚才移动的.bin文件或者.safetensors文件。

AI开发平台MODELARTS 主流开源大模型基于Lite Cluster适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-Eagle投机小模型训练:步骤二：非sharegpt格式数据集转换（可选）

步骤二：非sharegpt格式数据集转换（可选）如果数据集json文件不是sharegpt格式，而是常见的如下格式，则需要执行convert_to_sharegpt.py 文件将数据集转换为share gpt格式。 { "prefix": "AAA" "input": "BBB", "output": "CCC" } 执行convert_to_sharegpt.py 文件。 python convert_to_sharegpt.py \ --input_file_path data_test.json \ --out_file_name ./data_for_sharegpt.json \ --prefix_name instruction \ --input_name input \ --output_name output \ --code_type utf-8 其中： input_file_path：预训练json文件地址。 out_file_name：输出的sharegpt格式文件地址。 prefix_name：预训练json文件的前缀字段名称，例如：您是一个xxx专家，您需要回答下面问题。prefix_name可设置为None，此时预训练数据集只有input和output两段输入。 input_name：预训练json文件的指令输入字段名称，例如：请问苹果是什么颜色。 output_name output：预训练json文件的output字段名称，例如：苹果是红色的。 code_type：预训练json文件编码，默认utf-8。当转换为sharegpt格式时，prefix和input会拼接成一段文字，作为human字段，提出问题，而output字段会作为gpt字段，做出回答。

AI开发平台MODELARTS 主流开源大模型基于Lite Cluster适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-Eagle投机小模型训练:步骤四：执行训练

步骤四：执行训练安装完成后，执行： accelerate launch -m --mixed_precision=bf16 eagle.train.main \ --tmpdir [path of data] \ --cpdir [path of checkpoints] \ --configpath [path of config file] \ --basepath [path of base_model] --bs [batch size] tmpdir：即为步骤三中的outdir，训练data地址 cpdir：为训练生成权重的地址 configpath：为模型config文件的地址 basepath：为大模型权重地址 bs：为batch大小其中，要获取模型config文件，首先到https://github.com/SafeAILab/EAGLE/页找到对应eagle模型地址。图1 EAGLE Weights 以llama2-chat-7B为例，单击进入后，如下图所示config文件，即为对应模型的eagle config文件。图2 eagle config文件

AI开发平台MODELARTS 主流开源大模型基于Lite Cluster适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-部署推理服务:Step4 创建pod

Step4 创建pod 在节点自定义目录${node_path}下执行如下命令创建pod。 kubectl apply -f config.yaml 检查pod启动情况，执行下述命令。如果显示“1/1 running”状态代表启动成功。 kubectl get pod -A 图1 启动pod成功执行如下命令查看pod日志，如果打印类似下图信息表示服务启动成功。 kubectl logs -f ${pod_name} 参数说明： ${pod_name}：pod名，例如图1${pod_name}为yourapp-87d9b5b46-c46bk。图2 启动服务成功

AI开发平台MODELARTS 主流开源大模型基于Lite Cluster适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-部署推理服务:Step2 配置pod

Step2 配置pod 在节点自定义目录${node_path}下创建config.yaml文件 apiVersion: apps/v1 kind: Deployment metadata: name: yourapp labels: app: infers spec: replicas: 1 selector: matchLabels: app: infers template: metadata: labels: app: infers spec: schedulerName: volcano nodeSelector: accelerator/huawei-npu: ascend-1980 containers: - image: ${image_name} # 推理镜像名称 imagePullPolicy: IfNotPresent name: ${container_name} securityContext: runAsUser: 0 ports: - containerPort: 8080 command: ["/bin/bash", "-c"] args: ["${node-path}/run_vllm.sh"] # 节点自定义目录，该目录下包含pod配置文件config.yaml和推理服务启动脚本run_vllm.sh resources: requests: huawei.com/ascend-1980: "8" # 需求卡数，key保持不变。 limits: huawei.com/ascend-1980: "8" # 限制卡数，key保持不变。 volumeMounts: # 容器内部映射路径 - name: ascend-driver #驱动挂载，保持不动 mountPath: /usr/local/Ascend/driver - name: ascend-add-ons #驱动挂载，保持不动 mountPath: /usr/local/Ascend/add-ons - name: hccn #驱动hccn配置，保持不动 mountPath: /etc/hccn.conf - name: localtime mountPath: /etc/localtime - name: npu-smi # npu-smi mountPath: /usr/local/sbin/npu-smi - name: model-path # 模型权重路径 mountPath: ${model-path} - name: node-path mountPath: ${node-path} volumes: # 物理机外部路径 - name: ascend-driver hostPath: path: /usr/local/Ascend/driver - name: ascend-add-ons hostPath: path: /usr/local/Ascend/add-ons - name: hccn hostPath: path: /etc/hccn.conf - name: localtime hostPath: path: /etc/localtime - name: npu-smi hostPath: path: /usr/local/sbin/npu-smi - name: model-path hostPath: path: ${model-path} - name: node-path hostPath: path: ${node-path} 参数说明： ${container_name}：容器名称，此处可以自己定义一个容器名称，例如ascend-vllm。 ${image_name}：Step3 制作推理镜像构建的推理镜像名称。 ${node-path}：节点自定义目录，该目录下包含pod配置文件config.yaml和推理服务启动脚本run_vllm.sh，run_vllm.sh内容见Step3 创建服务启动脚本。 ${model-path}：Step1 上传权重文件中上传的模型权重路径。

AI开发平台MODELARTS 主流开源大模型基于Lite Cluster适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-附录：大模型推理常见问题:问题2：在推理预测过程中遇到ValueError:User-specified max_model_len is greater than the drived max_model_len

问题2：在推理预测过程中遇到ValueError:User-specified max_model_len is greater than the drived max_model_len 解决方法：修改config.json文件中的"seq_length"的值，"seq_length"需要大于等于 --max-model-len的值。config.json存在模型对应的路径下，例如：/data/nfs/benchmark/tokenizer/chatglm3-6b/config.json

AI开发平台MODELARTS 主流开源大模型基于Lite Cluster适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-附录：大模型推理常见问题:问题8：使用benchmark-tools对GLM系列模型进行性能测试报错

问题8：使用benchmark-tools对GLM系列模型进行性能测试报错使用benchmark-tools对GLM系列模型进行性能测试报错TypeError: _pad() got an unexpected keyword argument 'padding_side' 解决方法： 1、下载最新的tokenization_chatglm.py，替换原来权重里的tokenization_chatglm.py。 https://huggingface.co/THUDM/glm-4-9b-chat/blob/main/tokenization_chatglm.py https://huggingface.co/THUDM/chatglm3-6b/blob/main/tokenization_chatglm.py 或者2、修改tokenization_chatglm.py，在266行增加padding_side: str = "left"，如图1所示。图1 tokenization_chatglm.py

AI开发平台MODELARTS 主流开源大模型基于Lite Cluster适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-附录：大模型推理常见问题:问题9：使用benchmark-tools访问推理服务返回报错

问题9：使用benchmark-tools访问推理服务返回报错使用benchmark-tools访问推理服务时，输入输出的token和大于max_model_len，服务端返回报错Response payload is not completed，见图2。再次设置输入输出的token和小于max_model_len访问推理服务，服务端响应200，见图3。客户端仍返回报错Response payload is not completed，见图4。图2 服务端返回报错Response payload is not completed 图3 服务端响应200 图4 仍返回报错Response payload is not completed 解决方法：安装brotlipy后返回正确报错 pip install brotlipy

AI开发平台MODELARTS 主流开源大模型基于Lite Cluster适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-附录：大模型推理常见问题:问题13：使用SmoothQuant做权重转换时报错

问题13：使用SmoothQuant做权重转换时报错图8 权重转换报错涉及模型：qwen2-1.5b, qwen2-0.5b 解决方法：修改AscendCloud/AscendCloud-LLM/llm_tools/AutoSmoothQuant/autosmoothquant/examples/smoothquant_model.py中的main函数，保存模型时将safe_serialization指定为False int8_model.save_pretrained(output_path,safe_serialization=False)

AI开发平台MODELARTS 主流开源大模型基于Lite Cluster适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-附录：大模型推理常见问题:问题3：使用llama3.1系列模型进行推理时报错

问题3：使用llama3.1系列模型进行推理时报错使用llama3.1系模型进行推理时报错：ValueError: 'rope_scaling' must be a dictionary with two fields, 'type' and 'factor', got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}

AI开发平台MODELARTS 主流开源大模型基于Lite Cluster适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-附录：大模型推理常见问题:问题4：使用SmoothQuant进行W8A8进行模型量化时报错

问题4：使用SmoothQuant进行W8A8进行模型量化时报错使用SmoothQuant进行W8A8进行模型量化时报错：AttributeError: type object 'LlamaAttention' has no attribute '_init_rope' 解决方法：降低transformers版本到4.42 pip install transformers==4.42 --upgrade

AI开发平台MODELARTS 主流开源大模型基于Lite Cluster适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-附录：大模型推理常见问题:问题5：使用AWQ转换llama3.1系列模型权重出现报错

问题5：使用AWQ转换llama3.1系列模型权重出现报错使用AWQ转换llama3.1系列模型权重出现报错：ValueError: 'rope_scaling' must be a dictionary with two fields, 'type' and 'factor' 解决方法：该问题通过将transformers升级到4.44.0，修改对应transformers中的transformers/models/llama/modeling_llama.py，在class LlamaRotaryEmbedding中的forward函数中增加self.inv_freq = self.inv_freq.npu()

AI开发平台MODELARTS 主流开源大模型基于Lite Cluster适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-附录：大模型推理常见问题:问题12：使用SmoothQuant做权重转换时，scale显示为nan或推理时精度异常

问题12：使用SmoothQuant做权重转换时，scale显示为nan或推理时精度异常图7 权重转换scale显示为nan 涉及模型：qwen2-1.5b, qwen2-7b 解决方法：修改AscendCloud/AscendCloud-LLM/llm_tools/AutoSmoothQuant/autosmoothquant/utils/utils.py中的build_model_and_tokenizer函数，将torch_dtype类型从torch.float16改成torch.bfloat16 kwargs = {"torch_dtype": torch.bfloat16, "device_map": "auto"}

AI开发平台MODELARTS 主流开源大模型基于Lite Cluster适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-推理精度测试:步骤一：配置精度测试环境

步骤一：配置精度测试环境精度评测可以在原先conda环境，进入到一个固定目录下，执行如下命令。 rm -rf lm-evaluation-harness/ git clone https://github.com/EleutherAI/lm-evaluation-harness.git cd lm-evaluation-harness git checkout 383bbd54bc621086e05aa1b030d8d4d5635b25e6 pip install -e . 执行如下精度测试命令，可以根据参数说明修改参数。 lm_eval --model vllm --model_args pretrained=${vllm_path},dtype=auto,tensor_parallel_size=${tensor_parallel_size},gpu_memory_utilization=${gpu_memory_utilization},add_bos_token=True,max_model_len=${max_model_len},quantization=${quantization} \ --tasks ${task} --batch_size ${batch_size} --log_samples --cache_requests true --trust_remote_code --output_path ${output_path} 参数说明: model_args：标志向模型构造函数提供额外参数，比如指定运行模型的数据类型； vllm_path是模型权重路径； max_model_len 是最大模型长度，默认设置为4096； gpu_memory_utilization是gpu利用率，如果模型出现oom报错，调小参数； tensor_parallel_size是使用的卡数； quantization是量化参数，使用非量化权重，去掉quantization参数；如果使用awq、smoothquant或者gptq加载的量化权重，根据量化方式选择对应参数，可选awq，smoothquant，gptq。 model：模型启动模式，可选vllm，openai或hf，hf代表huggingface。 tasks：评测数据集任务，比如openllm。 batch_size：输入的batch_size大小，不影响精度，只影响得到结果速度，默认使用auto，代表自动选择batch大小。 output_path：结果保存路径。使用lm-eval，比如加载非量化或者awq量化，llama3.2-1b模型的权重，参考命令： lm_eval --model vllm --model_args pretrained="/data/nfs/benchmark/tokenizer/Llama-3.2-1B-Instruct/",dtype=auto,tensor_parallel_size=1,gpu_memory_utilization=0.7,add_bos_token=True,max_model_len=4096 \ --tasks openllm --batch_size auto --log_samples --cache_requests true --trust_remote_code --output_path ./ 使用lm-eval，比如smoothquant量化，llama3.1-70b模型的权重，参考命令： lm_eval --model vllm --model_args pretrained="/data/nfs/benchmark/tokenizer_w8a8/llama3.1-70b/",dtype=auto,tensor_parallel_size=4,gpu_memory_utilization=0.7,add_bos_token=True,max_model_len=4096,quantization="smoothquant" \ --tasks openllm --batch_size auto --log_samples --cache_requests true --trust_remote_code --output_path ./

AI开发平台MODELARTS 主流开源大模型基于Lite Cluster适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-推理精度测试:约束限制

约束限制确保容器可以访问公网。当前的精度测试仅适用于语言模型精度验证，不适用于多模态模型的精度验证。多模态模型的精度验证，建议使用开源MME数据集和工具（GitHub - BradyFU/Awesome-Multimodal-Large-Language-Models at Evaluation）。配置需要使用的NPU卡，例如：实际使用的是第1张和第2张卡，此处填写为“0,1”，以此类推。 export ASCEND_RT_VISIBLE_DEVI CES =0,1

AI开发平台MODELARTS 主流开源大模型基于Lite Cluster适配PyTorch NPU推理指导（6.3.911）

云服务器内容精选

主流开源大模型基于Lite Cluster适配PyTorch NPU推理指导（6.3.911）