No way to run it?

#6
by RaulBSM - opened

Hi, so liek other users pointed out, i've tried to run it in vllm but got the same problem as issue #3 (cuda OOM on a 48gb vram card then on a A100).

Also tried on text generation webui by oobabooga but got the same issue with running out of memory and trying to offload to cpu with an error. I tried downgrading transformers, installing autoawq loader but nothing worked for me.

I wanted to ask what is the best way to run this model then ?

You need to modify the config

   76      "torch_dtype": "float16",
   77      "transformers_version": "4.55.1",
   78      "use_cache": true,
   79 -    "vocab_size": 201088
   79 +    "vocab_size": 201088,
   80 +    "quantization_config": {
   81 +      "quant_method": "awq",
   82 +      "zero_point": true,
   83 +      "group_size": 128,
   84 +      "bits": 4,
   85 +      "version": "gemm"
   86 +    }

here is my command
docker run --gpus all
-p 8080:8000
--ipc=host
-e CUDA_VISIBLE_DEVICES=0,1
-e TORCH_CUDA_ARCH_LIST="8.9"
-e VLLM_GPT_OSS_HARMONY_SYSTEM_INSTRUCTIONS=1
-e VLLM_TOOL_JSON_ERROR_AUTOMATIC_RETRY=1
-e NCCL_P2P_DISABLE=1
-e VLLM_USE_V1=0
-v /root/huggingface_cache:/root/.cache/huggingface
-v /var/log/vllm:/var/log/vllm
vllm/vllm-openai:nightly
--model twhitworth/gpt-oss-120b-awq-w4a16
--quantization awq
--gpu-memory-utilization 0.94
--max-num-batched-tokens 8192
--max-model-len 131072
--enforce-eager
--enable-auto-tool-choice
--tool-call-parser openai
--load-format safetensors
--dtype float16
--enable-chunked-prefill
--tensor-parallel-size 2
--reasoning-parser openai_gptoss
--enable-expert-parallel
--trust-remote-code
--enable-prefix-caching

I'm still getting some errors like
(Worker_TP1_EP1 pid=93) WARNING 11-05 11:13:56 [awq.py:114] Layer 'model.layers.35.mlp.experts' is not supported by AWQMoeMarlin. Falling back to Moe WNA16 kernels.

I still couldn't make it work, so until there is official fix, I wouldn't use this model

I still couldn't make it work, so until there is official fix, I wouldn't use this model

did some digging on the autor and i think the model is made to run with his fork of vllm: https://github.com/moreWax/tiny-vllm

I'll give it a try this week, but if someody can do it sooner i will appreciate that - this seems to be the way to run this model

Hi, were you able to run it?

Sign up or log in to comment