在线服务¶
先按照xllm启动文档启动xllm服务。下面给出LLM和VLM的客户端调用示例,需要根据实际情况修改其中的参数。
LLM 客户端调用¶
HTTP 调用¶
chat模式:
curl http://localhost:9977/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2-7B-Instruct",
"max_tokens": 10,
"temperature": 0,
"stream": true,
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "hello xllm"
}
]
}'
completions模式:
curl http://127.0.0.1:9977/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2-7B-Instruct",
"prompt": "hello xllm",
"max_tokens": 10,
"temperature": 0,
"stream": true
}'
sample模式:
curl http://127.0.0.1:9977/v1/sample \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2-7B-Instruct",
"prompt": "问题:<emb_0> 是否命中。结论:<emb_0>",
"selector": {
"type": "literal",
"value": "<emb_0>"
},
"logprobs": 5,
"request_id": "sample-demo-001"
}'
典型响应:
{
"id": "sample-demo-001",
"object": "sample_completion",
"created": 1773369600,
"model": "Qwen2-7B-Instruct",
"choices": [
{
"index": 0,
"text": "True",
"logprobs": {
"tokens": ["True", "False"],
"token_ids": [3456, 7890],
"token_logprobs": [-0.12, -2.31]
},
"finish_reason": "selector_match"
},
{
"index": 1,
"text": "",
"logprobs": {
"tokens": [],
"token_ids": [],
"token_logprobs": []
},
"finish_reason": "empty_logprobs"
}
],
"usage": {
"prompt_tokens": 20,
"completion_tokens": 2,
"total_tokens": 22
}
}
/v1/sample 使用说明:
- 仅支持
--backend=llm,当前不支持 VLM/DiT/Rec。 selector.type当前固定为literal,selector.value按 prompt 文本顺序全文匹配。logprobs默认值为5,允许范围为[1, 5]。choices[i].index即该命中的sample_id,与 prompt 中命中顺序一一对应。- selector 无命中时返回
200且choices=[];某命中位点无可用 logprobs 时返回finish_reason="empty_logprobs"。 - 服务日志只记录
request_id、sample_id、match_count、model等摘要字段,不记录完整 prompt。
/v1/sample 常见错误语义:
- 缺少
model/prompt/selector/selector.value、selector.type != literal或logprobs越界时返回INVALID_ARGUMENT。 - 模型不存在或后端不是
llm时返回UNKNOWN。 - 并发达到上限时返回
RESOURCE_EXHAUSTED。 - 模型处于 sleep 状态时返回
UNAVAILABLE。
Python调用¶
import requests
import json
url = f"http://localhost:9977/v1/chat/completions"
messages = [
{'role': 'user', 'content': "列出三个国家和他的首都。"}
]
request_data = {
"model": "Qwen2-7B-Instruct",
"messages": messages,
"stream": False,
"temperature": 0.6,
"max_tokens": 2048,
}
response = requests.post(url, json=request_data)
if response.status_code != 200:
print(response.status_code, response.text)
else:
ans = json.loads(response.text)["choices"]
print(ans[0]['message'])
VLM 客户端调用¶
HTTP API¶
import base64
import requests
api_url = "http://localhost:12345/v1/chat/completions"
image_url = ""
def encode_image(url: str) -> str:
with requests.get(url) as response:
response.raise_for_status()
result = base64.b64encode(response.content).decode("utf-8")
return result
image_base64 = encode_image(image_url)
payload = {
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "介绍下这张图片"},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_base64}"},
},
],
}
],
"model": "Qwen2.5-VL-7B-Instruct",
"max_completion_tokens": 128,
}
response = requests.post(
api_url,
json=payload,
headers={"Content-Type": "application/json"}
)
print(response.json())
OpenAI API¶
from openai import OpenAI
import base64
import requests
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:12345/v1"
image_url = ""
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
def encode_image(url: str) -> str:
with requests.get(url) as response:
response.raise_for_status()
result = base64.b64encode(response.content).decode("utf-8")
return result
image_base64 = encode_image(image_url)
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "介绍下这张图片"},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_base64}"},
},
],
}
],
model="Qwen2.5-VL-7B-Instruct",
max_completion_tokens=128,
)
result = chat_completion.choices[0].message.content
print("Chat completion output:", result)