No way to run it?

by RaulBSM - opened Sep 12, 2025

Sep 12, 2025

•

edited Sep 12, 2025

Hi, so liek other users pointed out, i've tried to run it in vllm but got the same problem as issue #3 (cuda OOM on a 48gb vram card then on a A100).

Also tried on text generation webui by oobabooga but got the same issue with running out of memory and trying to offload to cpu with an error. I tried downgrading transformers, installing autoawq loader but nothing worked for me.

I wanted to ask what is the best way to run this model then ?

isevendays

Nov 5, 2025

You need to modify the config

   76      "torch_dtype": "float16",
   77      "transformers_version": "4.55.1",
   78      "use_cache": true,
   79 -    "vocab_size": 201088
   79 +    "vocab_size": 201088,
   80 +    "quantization_config": {
   81 +      "quant_method": "awq",
   82 +      "zero_point": true,
   83 +      "group_size": 128,
   84 +      "bits": 4,
   85 +      "version": "gemm"
   86 +    }

here is my command
docker run --gpus all
-p 8080:8000
--ipc=host
-e CUDA_VISIBLE_DEVICES=0,1
-e TORCH_CUDA_ARCH_LIST="8.9"
-e VLLM_GPT_OSS_HARMONY_SYSTEM_INSTRUCTIONS=1
-e VLLM_TOOL_JSON_ERROR_AUTOMATIC_RETRY=1
-e NCCL_P2P_DISABLE=1
-e VLLM_USE_V1=0
-v /root/huggingface_cache:/root/.cache/huggingface
-v /var/log/vllm:/var/log/vllm
vllm/vllm-openai:nightly
--model twhitworth/gpt-oss-120b-awq-w4a16
--quantization awq
--gpu-memory-utilization 0.94
--max-num-batched-tokens 8192
--max-model-len 131072
--enforce-eager
--enable-auto-tool-choice
--tool-call-parser openai
--load-format safetensors
--dtype float16
--enable-chunked-prefill
--tensor-parallel-size 2
--reasoning-parser openai_gptoss
--enable-expert-parallel
--trust-remote-code
--enable-prefix-caching

I'm still getting some errors like
(Worker_TP1_EP1 pid=93) WARNING 11-05 11:13:56 [awq.py:114] Layer 'model.layers.35.mlp.experts' is not supported by AWQMoeMarlin. Falling back to Moe WNA16 kernels.

isevendays

Nov 5, 2025

I still couldn't make it work, so until there is official fix, I wouldn't use this model

RaulBSM

Nov 25, 2025

I still couldn't make it work, so until there is official fix, I wouldn't use this model

did some digging on the autor and i think the model is made to run with his fork of vllm: https://github.com/moreWax/tiny-vllm

I'll give it a try this week, but if someody can do it sooner i will appreciate that - this seems to be the way to run this model

rezabyt

Mar 20

Hi, were you able to run it?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment