Output on 2nd request
#1
by ryanstout - opened
First, thanks so much for uploading an AWQ version of Qwen3-VL (what a great model also!)
So this is a weird one, and I'm guessing a bug in VLLM, but I'm posting here incase anyone else runs into it. The first time I make a request, I get a valid response, the 2nd time and onward I get what seems like random tokens. (once I change the prompt so it's not hitting the cache)
I'm running with pipeline parallel (which may be related)
uv run vllm serve \
QuantTrio/Qwen3-VL-235B-A22B-Thinking-AWQ \
--served-model-name My_Model \
--enable-expert-parallel \
--swap-space 16 \
--max-num-seqs 1 \
--max-model-len 32768 \
--gpu-memory-utilization 0.9 \
--pipeline-parallel-size 7 \
--trust-remote-code \
--disable-log-requests \
--host 0.0.0.0 \
--port 8000
Here's an example broken response. (prompt was "Describe this image")
{"id":"chatcmpl-7d3ffdece5914c5db967af956950b1ea","object":"chat.completion","created":1759244626,"model":"My_Model","choices":[{"index":0,"message":{"role":"assistant","content":"```沙滩```","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":151643,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":2770,"total_tokens":2774,"completion_tokens":4,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
The response content is:沙滩
If I restart vllm and run the request again, it works until I do a new prompt.
Any ideas on this one? Thanks!