主流开源大模型基于Lite Server适配PyTorch NPU推理指导（6.3.911）-华为云

AI开发平台MODELARTS-Eagle投机小模型训练:步骤五：训练生成权重转换成可以支持vLLM推理的格式

步骤五：训练生成权重转换成可以支持vLLM推理的格式将训练完成后的权重文件（.bin文件或. safetensors文件），移动到下载好的开源权重目录下（即步骤4中，config文件所在目录）。然后在llm_tools/spec_decode/EAGLE文件夹，执行 python convert_eagle_ckpt_to_vllm_compatible.py --base-path 大模型权重地址 --draft-path 小模型权重地址 --base-weight-name 大模型包含lm_head的权重文件名 --draft-weight-name 小模型权重文件名 --base-path：为大模型权重地址，例如 ./llama2-7b-chat --draft-path：小模型权重地址，即步骤四中config文件所在目录，例如 ./eagle_llama2-7b-chat --base-weight-name：为大模型包含lm_head的权重文件名，可以在base-path目录下的 model.safetensors.index.json 文件获取，例如llama2-7b-chat的权重名为pytorch_model-00001-of-00002.bin 图3 权重文件名 --draft-weight-name 为小模型权重文件名，即刚才移动的.bin文件或者.safetensors文件。

AI开发平台MODELARTS 主流开源大模型基于Lite Server适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-Eagle投机小模型训练:步骤二：非sharegpt格式数据集转换（可选）

步骤二：非sharegpt格式数据集转换（可选）如果数据集json文件不是sharegpt格式，而是常见的如下格式，则需要执行convert_to_sharegpt.py 文件将数据集转换为share gpt格式。 { "prefix": "AAA" "input": "BBB", "output": "CCC" } 执行convert_to_sharegpt.py 文件。 python convert_to_sharegpt.py \ --input_file_path data_test.json \ --out_file_name ./data_for_sharegpt.json \ --prefix_name instruction \ --input_name input \ --output_name output \ --code_type utf-8 其中： input_file_path：预训练json文件地址。 out_file_name：输出的sharegpt格式文件地址。 prefix_name：预训练json文件的前缀字段名称，例如：您是一个xxx专家，您需要回答下面问题。prefix_name可设置为None，此时预训练数据集只有input和output两段输入。 input_name：预训练json文件的指令输入字段名称，例如：请问苹果是什么颜色。 output_name output：预训练json文件的output字段名称，例如：苹果是红色的。 code_type：预训练json文件编码，默认utf-8。当转换为sharegpt格式时，prefix和input会拼接成一段文字，作为human字段，提出问题，而output字段会作为gpt字段，做出回答。

AI开发平台MODELARTS 主流开源大模型基于Lite Server适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-Eagle投机小模型训练:步骤四：执行训练

步骤四：执行训练安装完成后，执行： accelerate launch -m --mixed_precision=bf16 eagle.train.main \ --tmpdir [path of data] \ --cpdir [path of checkpoints] \ --configpath [path of config file] \ --basepath [path of base_model] --bs [batch size] tmpdir：即为步骤三中的outdir，训练data地址 cpdir：为训练生成权重的地址 configpath：为模型config文件的地址 basepath：为大模型权重地址 bs：为batch大小其中，要获取模型config文件，首先到https://github.com/SafeAILab/EAGLE/页找到对应eagle模型地址。图1 EAGLE Weights 以llama2-chat-7B为例，单击进入后，如下图所示config文件，即为对应模型的eagle config文件。图2 eagle config文件

AI开发平台MODELARTS 主流开源大模型基于Lite Server适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-推理场景介绍:模型软件包结构说明

模型软件包结构说明本教程需要使用到的AscendCloud-6.3.911中的AscendCloud-LLM-xxx.zip软件包和算子包AscendCloud-OPP，AscendCloud-LLM关键文件介绍如下。 |——AscendCloud-LLM ├──llm_inference # 推理代码 ├──ascend_vllm ├── vllm_npu # 推理源码 ├── ascend_vllm-0.6.3-py3-none-any.whl # 推理安装包 ├── build.sh # 推理构建脚本 ├── vllm_install.patch # 社区昇腾适配的补丁包 ├── Dockerfile # 推理构建镜像dockerfile ├── build_image.sh # 推理构建镜像启动脚本 ├──llm_tools # 推理工具包 ├──AutoSmoothQuant # W8A8量化工具 ├── ascend_autosmoothquant_adapter # 昇腾量化使用的算子模块 ├── autosmoothquant_ascend # 量化代码 ├── build.sh # 安装量化模块的脚本 ├──AutoAWQ # W4A16量化工具 ├──convert_awq_to_npu.py # awq权重转换脚本 ├──quantize.py # 昇腾适配的量化转换脚本 ├──build.sh # 安装量化模块的脚本 ├──llm_evaluation # 推理评测代码包 ├──benchmark_tools #性能评测 ├── benchmark.py # 可以基于默认的参数跑完静态benchmark和动态benchmark ├── benchmark_parallel.py # 评测静态性能脚本 ├── benchmark_serving.py # 评测动态性能脚本 ├── benchmark_utils.py # 抽离的工具集 ├── generate_datasets.py # 生成自定义数据集的脚本 ├── requirements.txt # 第三方依赖 ├──benchmark_eval #精度评测 ├──opencompass.sh #运行opencompass脚本 ├──install.sh #安装opencompass脚本 ├──vllm_api.py #启动vllm api服务器 ├──vllm.py #构造vllm评测配置脚本名字

AI开发平台MODELARTS 主流开源大模型基于Lite Server适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-推理场景介绍:支持的模型列表和权重文件

支持的模型列表和权重文件本方案支持vLLM的v0.6.3版本。不同vLLM版本支持的模型列表有差异，具体如表3所示。表3 支持的模型列表和权重获取地址序号模型名称是否支持fp16/bf16推理是否支持W4A16量化是否支持W8A8量化是否支持W8A16量化是否支持 kv-cache-int8量化开源权重获取地址 1 llama-7b √ √ √ √ √ https://huggingface.co/huggyllama/llama-7b 2 llama-13b √ √ √ √ √ https://huggingface.co/huggyllama/llama-13b 3 llama-65b √ √ √ √ √ https://huggingface.co/huggyllama/llama-65b 4 llama2-7b √ √ √ √ √ https://huggingface.co/meta-llama/Llama-2-7b-chat-hf 5 llama2-13b √ √ √ √ √ https://huggingface.co/meta-llama/Llama-2-13b-chat-hf 6 llama2-70b √ √ √ √ √ https://huggingface.co/meta-llama/Llama-2-70b-hf https://huggingface.co/meta-llama/Llama-2-70b-chat-hf (推荐) 7 llama3-8b √ √ √ √ √ https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct 8 llama3-70b √ √ √ √ √ https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct 9 yi-6b √ √ √ √ √ https://huggingface.co/01-ai/Yi-6B-Chat 10 yi-9b √ √ √ √ √ https://huggingface.co/01-ai/Yi-9B 11 yi-34b √ √ √ √ √ https://huggingface.co/01-ai/Yi-34B-Chat 12 deepseek-llm-7b √ x x x x https://huggingface.co/deepseek-ai/deepseek-llm-7b-chat 13 deepseek-coder-33b-instruct √ x x x x https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct 14 deepseek-llm-67b √ x x x x https://huggingface.co/deepseek-ai/deepseek-llm-67b-chat 15 qwen-7b √ √ √ √ x https://huggingface.co/Qwen/Qwen-7B-Chat 16 qwen-14b √ √ √ √ x https://huggingface.co/Qwen/Qwen-14B-Chat 17 qwen-72b √ √ √ √ x https://huggingface.co/Qwen/Qwen-72B-Chat 18 qwen1.5-0.5b √ √ √ √ x https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat 19 qwen1.5-7b √ √ √ √ x https://huggingface.co/Qwen/Qwen1.5-7B-Chat 20 qwen1.5-1.8b √ √ √ √ x https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat 21 qwen1.5-14b √ √ √ √ x https://huggingface.co/Qwen/Qwen1.5-14B-Chat 22 qwen1.5-32b √ √ √ √ x https://huggingface.co/Qwen/Qwen1.5-32B/tree/main 23 qwen1.5-72b √ √ √ √ x https://huggingface.co/Qwen/Qwen1.5-72B-Chat 24 qwen1.5-110b √ √ √ √ x https://huggingface.co/Qwen/Qwen1.5-110B-Chat 25 qwen2-0.5b √ √ √ √ x https://huggingface.co/Qwen/Qwen2-0.5B-Instruct 26 qwen2-1.5b √ √ √ √ x https://huggingface.co/Qwen/Qwen2-1.5B-Instruct 27 qwen2-7b √ √ x √ x https://huggingface.co/Qwen/Qwen2-7B-Instruct 28 qwen2-72b √ √ √ √ x https://huggingface.co/Qwen/Qwen2-72B-Instruct 29 qwen2.5-0.5b √ √ √ √ x https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct 30 qwen2.5-1.5b √ √ √ √ x https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct 31 qwen2.5-3b √ √ √ √ x https://huggingface.co/Qwen/Qwen2.5-3B-Instruct 32 qwen2.5-7b √ √ x √ x https://huggingface.co/Qwen/Qwen2.5-7B-Instruct 33 qwen2.5-14b √ √ √ √ x https://huggingface.co/Qwen/Qwen2.5-14B-Instruct 34 qwen2.5-32b √ √ √ √ x https://huggingface.co/Qwen/Qwen2.5-32B-Instruct 35 qwen2.5-72b √ √ √ √ x https://huggingface.co/Qwen/Qwen2.5-72B-Instruct 36 baichuan2-7b √ x x √ x https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat 37 baichuan2-13b √ x x √ x https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat 38 gemma-2b √ x x x x https://huggingface.co/google/gemma-2b 39 gemma-7b √ x x x x https://huggingface.co/google/gemma-7b 40 chatglm2-6b √ x x x x https://huggingface.co/THUDM/chatglm2-6b 41 chatglm3-6b √ x x x x https://huggingface.co/THUDM/chatglm3-6b 42 glm-4-9b √ x x x x https://huggingface.co/THUDM/glm-4-9b-chat 43 mistral-7b √ x x x x https://huggingface.co/mistralai/Mistral-7B-v0.1 44 mixtral-8x7b √ x x x x https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 45 falcon-11b √ x x x x https://huggingface.co/tiiuae/falcon-11B/tree/main 46 qwen2-57b-a14b √ x x x x https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct 47 llama3.1-8b √ √ √ √ x https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct 48 llama3.1-70b √ √ √ √ x https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct 49 llama-3.1-405B √ √ x x x https://huggingface.co/hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 50 llama-3.2-1B √ x x x x Llama-3.2-1B-Instruct · 模型库 (modelscope.cn) 51 llama-3.2-3B √ x x x x Llama-3.2-3B-Instruct · 模型库 (modelscope.cn) 52 llava-1.5-7b √ x x x x https://huggingface.co/llava-hf/llava-1.5-7b-hf/tree/main 53 llava-1.5-13b √ x x x x https://huggingface.co/llava-hf/llava-1.5-13b-hf/tree/main 54 llava-v1.6-7b √ x x x x https://huggingface.co/llava-hf/llava-v1.6-vicuna-7b-hf/tree/main 55 llava-v1.6-13b √ x x x x https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf/tree/main 56 llava-v1.6-34b √ x x x x https://huggingface.co/llava-hf/llava-v1.6-34b-hf/tree/main 57 internvl2-8B √ x x x x https://huggingface.co/OpenGVLab/InternVL2-8B/tree/main 58 internvl2-26B √ x x x x https://huggingface.co/OpenGVLab/InternVL2-26B/tree/main 59 internvl2-40B √ x x x x https://huggingface.co/OpenGVLab/InternVL2-40B/tree/main 60 internVL2-Llama3-76B √ x x x x https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B/tree/main 61 MiniCPM-v2.6 √ x x x x https://huggingface.co/openbmb/MiniCPM-V-2_6/tree/main 62 deepseek-v2-236b x x √ x x https://huggingface.co/deepseek-ai/DeepSeek-V2 63 deepseek-v2-lite-16b √ x √ x x https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite 64 qwen2-vl-2B √ x x x x https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct/tree/main 65 qwen2-vl-7B √ x x x x https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct/tree/main 66 qwen2-vl-72B √ x x x x https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct/tree/main 67 qwen-vl √ x x x x https://huggingface.co/Qwen/Qwen-VL 68 qwen-vl-chat √ x x x x https://huggingface.co/Qwen/Qwen-VL-Chat 69 MiniCPM-v2 √ x x x x https://huggingface.co/HwwwH/MiniCPM-V-2 注意：需要修改源文件site-packages/timm/layers/pos_embed.py，在第46行上面新增一行代码，如下： posemb = posemb.contiguous() #新增 posemb = F.interpolate(posemb, size=new_size, mode=interpolation, antialias=antialias) 各模型支持的卡数请参见附录：基于vLLM不同模型推理支持最小卡数和最大序列说明章节。

AI开发平台MODELARTS 主流开源大模型基于Lite Server适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-推理场景介绍:软件配套版本

软件配套版本本方案支持的软件配套版本和依赖包获取地址如表2所示。表2 软件配套版本和获取地址软件名称说明下载地址 AscendCloud-6.3.911-xxx.zip 说明：软件包名称中的xxx表示时间戳。包含了本教程中使用到的推理部署代码和推理评测代码、推理依赖的算子包。代码包具体说明请参见模型软件包结构说明。获取路径：Support-E，在此路径中查找下载ModelArts 6.3.911 版本。说明：如果上述软件获取路径打开后未显示相应的软件信息，说明您没有下载权限，请联系您所在企业的华为方技术支持下载获取。

AI开发平台MODELARTS 主流开源大模型基于Lite Server适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-推理场景介绍:约束限制

约束限制本方案目前仅适用于部分企业客户。本文档适配昇腾云ModelArts 6.3.911版本，请参考软件配套版本获取配套版本的软件包，请严格遵照版本配套关系使用本文档。资源规格推荐使用“西南-贵阳一”Region上的DevServer和昇腾Snt9B资源。推理部署使用的服务框架是vLLM。vLLM支持v0.6.3版本。支持FP16和BF16数据类型推理。适配的CANN版本是cann_8.0.rc3。 DevServer驱动版本要求23.0.6。

AI开发平台MODELARTS 主流开源大模型基于Lite Server适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-推理场景介绍:镜像版本

镜像版本本教程中用到基础镜像地址和配套版本关系如下表所示，请提前了解。表1 基础容器镜像地址镜像用途镜像地址配套版本基础镜像 swr.cn-southwest-2.myhuaweicloud.com/atelier/pytorch_2_1_ascend:pytorch_2.1.0-cann_8.0.rc3-py_3.9-hce_2.0.2409-aarch64-snt9b-20241112192643-c45ac6b cann_8.0.rc3

AI开发平台MODELARTS 主流开源大模型基于Lite Server适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-推理场景介绍:资源规格要求

资源规格要求本文档中的模型运行环境是ModelArts Lite的DevServer。推荐使用“西南-贵阳一”Region上的资源和Ascend Snt9B。如果使用DevServer资源，请参考DevServer资源开通，购买DevServer资源，并确保机器已开通，密码已获取，能通过SSH登录，不同机器之间网络互通。当容器需要提供服务给多个用户，或者多个用户共享使用该容器时，应限制容器访问Openstack的管理地址（169.254.169.254），以防止容器获取宿主机的元数据。具体操作请参见禁止容器获取宿主机元数据。

AI开发平台MODELARTS 主流开源大模型基于Lite Server适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-附录：大模型推理常见问题:问题12：使用SmoothQuant做权重转换时，scale显示为nan或推理时精度异常

问题12：使用SmoothQuant做权重转换时，scale显示为nan或推理时精度异常图7 权重转换scale显示为nan 涉及模型：qwen2-1.5b, qwen2-7b 解决方法：修改AscendCloud/AscendCloud-LLM/llm_tools/AutoSmoothQuant/autosmoothquant/utils/utils.py中的build_model_and_tokenizer函数，将torch_dtype类型从torch.float16改成torch.bfloat16 kwargs = {"torch_dtype": torch.bfloat16, "device_map": "auto"}

AI开发平台MODELARTS 主流开源大模型基于Lite Server适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-附录：大模型推理常见问题:问题13：使用SmoothQuant做权重转换时报错

问题13：使用SmoothQuant做权重转换时报错图8 权重转换报错涉及模型：qwen2-1.5b, qwen2-0.5b 解决方法：修改AscendCloud/AscendCloud-LLM/llm_tools/AutoSmoothQuant/autosmoothquant/examples/smoothquant_model.py中的main函数，保存模型时将safe_serialization指定为False int8_model.save_pretrained(output_path,safe_serialization=False)

AI开发平台MODELARTS 主流开源大模型基于Lite Server适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-附录：大模型推理常见问题:问题4：使用SmoothQuant进行W8A8进行模型量化时报错

问题4：使用SmoothQuant进行W8A8进行模型量化时报错使用SmoothQuant进行W8A8进行模型量化时报错：AttributeError: type object 'LlamaAttention' has no attribute '_init_rope' 解决方法：降低transformers版本到4.42 pip install transformers==4.42 --upgrade

AI开发平台MODELARTS 主流开源大模型基于Lite Server适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-附录：大模型推理常见问题:问题9：使用benchmark-tools访问推理服务返回报错

问题9：使用benchmark-tools访问推理服务返回报错使用benchmark-tools访问推理服务时，输入输出的token和大于max_model_len，服务端返回报错Response payload is not completed，见图2。再次设置输入输出的token和小于max_model_len访问推理服务，服务端响应200，见图3。客户端仍返回报错Response payload is not completed，见图4。图2 服务端返回报错Response payload is not completed 图3 服务端响应200 图4 仍返回报错Response payload is not completed 解决方法：安装brotlipy后返回正确报错 pip install brotlipy

AI开发平台MODELARTS 主流开源大模型基于Lite Server适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-附录：大模型推理常见问题:问题3：使用llama3.1系列模型进行推理时报错

问题3：使用llama3.1系列模型进行推理时报错使用llama3.1系模型进行推理时报错：ValueError: 'rope_scaling' must be a dictionary with two fields, 'type' and 'factor', got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}

AI开发平台MODELARTS 主流开源大模型基于Lite Server适配PyTorch NPU推理指导（6.3.911）

AI开发平台MODELARTS-附录：大模型推理常见问题:问题2：在推理预测过程中遇到ValueError:User-specified max_model_len is greater than the drived max_model_len

问题2：在推理预测过程中遇到ValueError:User-specified max_model_len is greater than the drived max_model_len 解决方法：修改config.json文件中的"seq_length"的值，"seq_length"需要大于等于 --max-model-len的值。config.json存在模型对应的路径下，例如：/data/nfs/benchmark/tokenizer/chatglm3-6b/config.json

AI开发平台MODELARTS 主流开源大模型基于Lite Server适配PyTorch NPU推理指导（6.3.911）

云服务器内容精选

主流开源大模型基于Lite Server适配PyTorch NPU推理指导（6.3.911）