Instructions to use Qwen/Qwen3.5-27B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Qwen/Qwen3.5-27B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Qwen/Qwen3.5-27B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-27B")
model = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen3.5-27B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Qwen/Qwen3.5-27B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Qwen/Qwen3.5-27B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.5-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Qwen/Qwen3.5-27B

SGLang

How to use Qwen/Qwen3.5-27B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Qwen/Qwen3.5-27B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.5-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Qwen/Qwen3.5-27B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.5-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Qwen/Qwen3.5-27B with Docker Model Runner:
```
docker model run hf.co/Qwen/Qwen3.5-27B
```

what the best engine to run this model.

#40

by Shimon324 - opened Mar 24

Discussion

Shimon324

Mar 24

Hi everyone,
I am looking to deploy the Qwen 3.5 27B model and my primary optimization goal is maximizing generation throughput (TPS) while keeping latency reasonable.
Hardware Context:
2x NVIDIA RTX A5000 (Ampere architecture, 24GB VRAM each -> 48GB total).
High-end local workstation.
The Constraints:
Since the unquantized 27B model (~54GB in BF16/FP16) exceeds my 48GB VRAM limit, I must use a quantized version with Tensor Parallelism (TP=2). I am aiming for a purely text-based use case (the vision pipeline will be disabled).
My Questions:
Engine Performance: Between SGLang and vLLM, which engine currently has the best optimized kernels and lowest overhead for the Qwen 3.5 architecture on an Ampere multi-GPU setup? Are there any benchmark differences regarding RadixAttention (SGLang) vs PagedAttention (vLLM) for this specific family?
TensorRT-LLM: Should I be looking into TensorRT-LLM instead for bare-metal maximum speed, or is the compilation overhead not worth the TPS gain compared to SGLang for a 27B model?
Quantization Synergy: Which quantization format (FP8, AWQ, EXL2, or GPTQ) pairs best with your recommended engine to avoid compute bottlenecks and maximize tokens per second on Ampere?
Any insights, recent benchmark experiences, or specific CLI flag recommendations (like multi-token prediction) would be highly appreciated.
Thanks

CompactAI

Mar 31

Vllm

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment