vllm操作指南

一、构建基础镜像

docker pull vllm/vllm-openai:v0.5.3

二. 下载模型参数

在电脑可以访问外网的前提下，建议使用 pycrawlers ，从 huggingface 下载模型参数，下载稳定

from pycrawlers import huggingface

token= 'qfOJkIITIpQSKXaUENdhhgxAVPyAxgjRPFJokpveoWYYJjNckibajPdZyuLzMhOITZUYvFyAkGyVhlsQyZEsYcQkUJzevLlmbYZnjYAhSrHwSrKKkyezozABvgUYurza'
hg = huggingface(token=token)

url = 'https://huggingface.co/Qwen/Qwen2-7B-Instruct/tree/main'
hg.get_data(url)

如果不能访问外网，建议使用 SDK，从 modelscope 下载模型参数，相对稳定

from modelscope import snapshot_download 
model_dir = snapshot_download('qwen/Qwen2-7B-Instruct')

三. 启动API服务

建议将常用命令写入shell脚本

run.sh

docker run -d --gpus '"device=0"' -p 11434:8000 --name Qwen2-7B-Instruct -v E:\models\Qwen2-7B-Instruct:/models/main vllm/vllm-openai:v0.4.1 --served-model-name Qwen2-7B-Instruct --model /models/main --max-model-len 4096 --gpu-memory-utilization 0.95 --port 8000 --trust-remote-code

log.sh

docker rm -f Qwen2-7B-Instruct

log.sh

docker logs -f Qwen2-7B-Instruct

四、调用服务

shell命令

curl -X POST http://0.0.0.0:8000/v1/chat/completions \
-H "User-Agent: Apifox/1.0.0 (https://apifox.com)" \
-H "Content-Type: application/json" \
-d '{
    "model": "Qwen2-7B-Instruct",
    "messages": [
        {   "role": "system",
            "content": "你是一名善解人意的小管家"
        },
        {
            "role": "user",
            "content": "去银川旅行怎么安排行程？"
        }
    ],
    "temperature": 0
}'

python脚本

import requests

url = 'http://0.0.0.0:8000/v1/chat/completions'
# 定义请求头
headers = {
    'User-Agent': 'Apifox/1.0.0 (https://apifox.com)',
    'Content-Type': 'application/json'
}
# 定义请求的数据
data = {
    "model": "Qwen2-7B-Instruct",
    "messages": [
        {   "role": "system", 
            "content": "你是一名善解人意的小管家"
        },
        {
            "role": "user",
            "content": "去银川旅行怎么安排行程？"
        }
    ],
    "temperature": 0
}

# 发送 POST 请求
response = requests.post(url, headers=headers, json=data)
# 打印响应内容
print(response.status_code)
print(response.text)