Instructions to use QuantTrio/Qwen3.5-27B-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use QuantTrio/Qwen3.5-27B-AWQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="QuantTrio/Qwen3.5-27B-AWQ") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("QuantTrio/Qwen3.5-27B-AWQ") model = AutoModelForImageTextToText.from_pretrained("QuantTrio/Qwen3.5-27B-AWQ") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use QuantTrio/Qwen3.5-27B-AWQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "QuantTrio/Qwen3.5-27B-AWQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantTrio/Qwen3.5-27B-AWQ", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/QuantTrio/Qwen3.5-27B-AWQ
- SGLang
How to use QuantTrio/Qwen3.5-27B-AWQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "QuantTrio/Qwen3.5-27B-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantTrio/Qwen3.5-27B-AWQ", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "QuantTrio/Qwen3.5-27B-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantTrio/Qwen3.5-27B-AWQ", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use QuantTrio/Qwen3.5-27B-AWQ with Docker Model Runner:
docker model run hf.co/QuantTrio/Qwen3.5-27B-AWQ
AWQ 4-bit version of this Opus-Distilled-v2 model?
Hi,
Thank you for your excellent AWQ quantizations.
I'm using Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 (the v2 version with 14k Opus samples). It's currently the best reasoning model I have for coding and agent tasks - shorter CoT, better efficiency than base Qwen3.5-27B.
However, I'm on a single RTX 5090 and really want to run it with vLLM + FlashInfer to get MTP, continuous batching and higher speed.
Would you consider making an AWQ 4-bit version of this Opus-Distilled-v2 model?
The distillation dataset is public, so the data is already available. Many users with 40/50-series cards are waiting for a good AWQ quant of this specific model.
Thanks in advance!
Best regards
let me see
Hi,
Thank you for your excellent AWQ quantizations.
I'm using Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 (the v2 version with 14k Opus samples). It's currently the best reasoning model I have for coding and agent tasks - shorter CoT, better efficiency than base Qwen3.5-27B.However, I'm on a single RTX 5090 and really want to run it with vLLM + FlashInfer to get MTP, continuous batching and higher speed.
Would you consider making an AWQ 4-bit version of this Opus-Distilled-v2 model?
The distillation dataset is public, so the data is already available. Many users with 40/50-series cards are waiting for a good AWQ quant of this specific model.
Thanks in advance!Best regards
Some of the quant repos here (mainly qwen3.5 awq series thus far) utilize data-free quantization technique.
We can give it a try
I see the description mentions requiring CUDA 12.8.
I'm using vllm in docker with "vllm/vllm-openai:cu130-nightly".
QuantTrio/Qwen3.5-27B-AWQ works perfectly
but with QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ I get:
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299] █ █ █▄ ▄█
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.18.1rc1.dev227+gc133f3374
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299] █▄█▀ █ █ █ █ model QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299]
(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:233] non-default args: {'model_tag': 'QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'model': 'QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ', 'trust_remote_code': True, 'max_model_len': 196608, 'served_model_name': ['Qwen3.5'], 'reasoning_parser': 'qwen3', 'tensor_parallel_size': 2, 'max_num_seqs': 32, 'enable_chunked_prefill': True, 'speculative_config': {'method': 'qwen3_next_mtp', 'num_speculative_tokens': 2}}
(APIServer pid=1) WARNING 03-30 10:50:29 [envs.py:1733] Unknown vLLM environment variable detected: VLLM_ATTENTION_BACKEND
(APIServer pid=1) INFO 03-30 10:50:34 [model.py:549] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1) INFO 03-30 10:50:34 [model.py:1678] Using max model len 196608
(APIServer pid=1) INFO 03-30 10:50:35 [awq_marlin.py:245] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
(APIServer pid=1) WARNING 03-30 10:50:35 [speculative.py:368] method `qwen3_next_mtp` is deprecated and replaced with mtp.
(APIServer pid=1) INFO 03-30 10:50:39 [model.py:549] Resolved architecture: Qwen3_5MTP
(APIServer pid=1) INFO 03-30 10:50:39 [model.py:1678] Using max model len 262144
(APIServer pid=1) INFO 03-30 10:50:39 [awq_marlin.py:245] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
(APIServer pid=1) WARNING 03-30 10:50:39 [speculative.py:512] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1) INFO 03-30 10:50:39 [config.py:228] Setting attention block size to 800 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 03-30 10:50:39 [config.py:259] Padding mamba page size by 0.88% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1) INFO 03-30 10:50:39 [vllm.py:786] Asynchronous scheduling is enabled.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1) sys.exit(main())
(APIServer pid=1) ^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1) args.dispatch_function(args)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 135, in __init__
(APIServer pid=1) self.renderer = renderer = renderer_from_config(self.vllm_config)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/renderers/registry.py", line 83, in renderer_from_config
(APIServer pid=1) tokenizer = cached_tokenizer_from_config(model_config, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/registry.py", line 227, in cached_tokenizer_from_config
(APIServer pid=1) return cached_get_tokenizer(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/registry.py", line 210, in get_tokenizer
(APIServer pid=1) tokenizer = tokenizer_cls_.from_pretrained(tokenizer_name, *args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/hf.py", line 110, in from_pretrained
(APIServer pid=1) raise e
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained
(APIServer pid=1) tokenizer = AutoTokenizer.from_pretrained(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretrained
(APIServer pid=1) raise ValueError(
(APIServer pid=1) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported
I see the description mentions requiring CUDA 12.8.
I'm using vllm in docker with "vllm/vllm-openai:cu130-nightly".QuantTrio/Qwen3.5-27B-AWQ works perfectly
but with QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ I get:(APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299] (APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299] █ █ █▄ ▄█ (APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.18.1rc1.dev227+gc133f3374 (APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299] █▄█▀ █ █ █ █ model QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ (APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=1) INFO 03-30 10:50:29 [utils.py:299] (APIServer pid=1) INFO 03-30 10:50:29 [utils.py:233] non-default args: {'model_tag': 'QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'model': 'QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ', 'trust_remote_code': True, 'max_model_len': 196608, 'served_model_name': ['Qwen3.5'], 'reasoning_parser': 'qwen3', 'tensor_parallel_size': 2, 'max_num_seqs': 32, 'enable_chunked_prefill': True, 'speculative_config': {'method': 'qwen3_next_mtp', 'num_speculative_tokens': 2}} (APIServer pid=1) WARNING 03-30 10:50:29 [envs.py:1733] Unknown vLLM environment variable detected: VLLM_ATTENTION_BACKEND (APIServer pid=1) INFO 03-30 10:50:34 [model.py:549] Resolved architecture: Qwen3_5ForConditionalGeneration (APIServer pid=1) INFO 03-30 10:50:34 [model.py:1678] Using max model len 196608 (APIServer pid=1) INFO 03-30 10:50:35 [awq_marlin.py:245] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel. (APIServer pid=1) WARNING 03-30 10:50:35 [speculative.py:368] method `qwen3_next_mtp` is deprecated and replaced with mtp. (APIServer pid=1) INFO 03-30 10:50:39 [model.py:549] Resolved architecture: Qwen3_5MTP (APIServer pid=1) INFO 03-30 10:50:39 [model.py:1678] Using max model len 262144 (APIServer pid=1) INFO 03-30 10:50:39 [awq_marlin.py:245] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel. (APIServer pid=1) WARNING 03-30 10:50:39 [speculative.py:512] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate (APIServer pid=1) INFO 03-30 10:50:39 [config.py:228] Setting attention block size to 800 tokens to ensure that attention page size is >= mamba page size. (APIServer pid=1) INFO 03-30 10:50:39 [config.py:259] Padding mamba page size by 0.88% to ensure that mamba page size and attention page size are exactly equal. (APIServer pid=1) INFO 03-30 10:50:39 [vllm.py:786] Asynchronous scheduling is enabled. (APIServer pid=1) Traceback (most recent call last): (APIServer pid=1) File "/usr/local/bin/vllm", line 10, in <module> (APIServer pid=1) sys.exit(main()) (APIServer pid=1) ^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main (APIServer pid=1) args.dispatch_function(args) (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd (APIServer pid=1) uvloop.run(run_server(args)) (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run (APIServer pid=1) return __asyncio.run( (APIServer pid=1) ^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run (APIServer pid=1) return runner.run(main) (APIServer pid=1) ^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run (APIServer pid=1) return self._loop.run_until_complete(task) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper (APIServer pid=1) return await main (APIServer pid=1) ^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server (APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker (APIServer pid=1) async with build_async_engine_client( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__ (APIServer pid=1) return await anext(self.gen) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client (APIServer pid=1) async with build_async_engine_client_from_engine_args( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__ (APIServer pid=1) return await anext(self.gen) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args (APIServer pid=1) async_llm = AsyncLLM.from_vllm_config( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config (APIServer pid=1) return cls( (APIServer pid=1) ^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 135, in __init__ (APIServer pid=1) self.renderer = renderer = renderer_from_config(self.vllm_config) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/renderers/registry.py", line 83, in renderer_from_config (APIServer pid=1) tokenizer = cached_tokenizer_from_config(model_config, **kwargs) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/registry.py", line 227, in cached_tokenizer_from_config (APIServer pid=1) return cached_get_tokenizer( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/registry.py", line 210, in get_tokenizer (APIServer pid=1) tokenizer = tokenizer_cls_.from_pretrained(tokenizer_name, *args, **kwargs) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/hf.py", line 110, in from_pretrained (APIServer pid=1) raise e (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained (APIServer pid=1) tokenizer = AutoTokenizer.from_pretrained( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretrained (APIServer pid=1) raise ValueError( (APIServer pid=1) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported
Could you try installing the vllm official release in a clean venv/image? Yours not recognizing the Tokenizer class . This is not cuda issue, but rather some high level vllm issue.
I get the same error as before
Created a new workspace:
uv init
uv add vllm
CUDA_VISIBLE_DEVICES=0,1 VLLM_ATTENTION_BACKEND=FLASH_ATTN uv run vllm serve \
--model QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ \
--served-model-name Qwen3.5 \
--tensor-parallel-size 2 \
--max-model-len 196608 \
--max-num-seqs 32 \
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--enable-chunked-prefill \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
--host 0.0.0.0 \
--port 8000
...
(APIServer pid=990732) return renderer_cls.from_config(config, tokenizer_kwargs)
(APIServer pid=990732) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 625, in from_config
(APIServer pid=990732) cached_get_tokenizer(
(APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/registry.py", line 210, in get_tokenizer
(APIServer pid=990732) tokenizer = tokenizer_cls_.from_pretrained(tokenizer_name, *args, **kwargs)
(APIServer pid=990732) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 110, in from_pretrained
(APIServer pid=990732) raise e
(APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained
(APIServer pid=990732) tokenizer = AutoTokenizer.from_pretrained(
(APIServer pid=990732) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretrained
(APIServer pid=990732) raise ValueError(
(APIServer pid=990732) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.
I get the same error as before
Created a new workspace:
uv init uv add vllm CUDA_VISIBLE_DEVICES=0,1 VLLM_ATTENTION_BACKEND=FLASH_ATTN uv run vllm serve \ --model QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ \ --served-model-name Qwen3.5 \ --tensor-parallel-size 2 \ --max-model-len 196608 \ --max-num-seqs 32 \ --gpu-memory-utilization 0.9 \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --enable-chunked-prefill \ --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \ --host 0.0.0.0 \ --port 8000 ... (APIServer pid=990732) return renderer_cls.from_config(config, tokenizer_kwargs) (APIServer pid=990732) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 625, in from_config (APIServer pid=990732) cached_get_tokenizer( (APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/registry.py", line 210, in get_tokenizer (APIServer pid=990732) tokenizer = tokenizer_cls_.from_pretrained(tokenizer_name, *args, **kwargs) (APIServer pid=990732) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 110, in from_pretrained (APIServer pid=990732) raise e (APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained (APIServer pid=990732) tokenizer = AutoTokenizer.from_pretrained( (APIServer pid=990732) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretrained (APIServer pid=990732) raise ValueError( (APIServer pid=990732) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.
Use pip install vllm==0.18.0 instead of uv, see if it works? I guess uv add is just reusing the module, maybe uv pip install is more appropriate here if insisting using uv.
This repo is literally just a qwen3.5 dense model in awq format. Your python/vllm environment should have recognized it.
let me see
can you also quantilize 4B & 9B model, thank you!!!!
I get the same error as before
Created a new workspace:
uv init uv add vllm CUDA_VISIBLE_DEVICES=0,1 VLLM_ATTENTION_BACKEND=FLASH_ATTN uv run vllm serve \ --model QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ \ --served-model-name Qwen3.5 \ --tensor-parallel-size 2 \ --max-model-len 196608 \ --max-num-seqs 32 \ --gpu-memory-utilization 0.9 \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --enable-chunked-prefill \ --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \ --host 0.0.0.0 \ --port 8000 ... (APIServer pid=990732) return renderer_cls.from_config(config, tokenizer_kwargs) (APIServer pid=990732) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 625, in from_config (APIServer pid=990732) cached_get_tokenizer( (APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/registry.py", line 210, in get_tokenizer (APIServer pid=990732) tokenizer = tokenizer_cls_.from_pretrained(tokenizer_name, *args, **kwargs) (APIServer pid=990732) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 110, in from_pretrained (APIServer pid=990732) raise e (APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained (APIServer pid=990732) tokenizer = AutoTokenizer.from_pretrained( (APIServer pid=990732) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=990732) File "/home/ai-server/qwen/vllm-workspace/.venv/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretrained (APIServer pid=990732) raise ValueError( (APIServer pid=990732) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.
You ever resolve this? I;m getting the exact same thing :(