Output on 2nd request

#1
by ryanstout - opened

First, thanks so much for uploading an AWQ version of Qwen3-VL (what a great model also!)

So this is a weird one, and I'm guessing a bug in VLLM, but I'm posting here incase anyone else runs into it. The first time I make a request, I get a valid response, the 2nd time and onward I get what seems like random tokens. (once I change the prompt so it's not hitting the cache)

I'm running with pipeline parallel (which may be related)

uv run vllm serve \
    QuantTrio/Qwen3-VL-235B-A22B-Thinking-AWQ \
    --served-model-name My_Model \
    --enable-expert-parallel \
    --swap-space 16 \
    --max-num-seqs 1 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9 \
    --pipeline-parallel-size 7 \
    --trust-remote-code \
    --disable-log-requests \
    --host 0.0.0.0 \
    --port 8000

Here's an example broken response. (prompt was "Describe this image")

{"id":"chatcmpl-7d3ffdece5914c5db967af956950b1ea","object":"chat.completion","created":1759244626,"model":"My_Model","choices":[{"index":0,"message":{"role":"assistant","content":"```沙滩```","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":151643,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":2770,"total_tokens":2774,"completion_tokens":4,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

The response content is:
沙滩

If I restart vllm and run the request again, it works until I do a new prompt.

Any ideas on this one? Thanks!

Sign up or log in to comment