No way to run it?
Hi, so liek other users pointed out, i've tried to run it in vllm but got the same problem as issue #3 (cuda OOM on a 48gb vram card then on a A100).
Also tried on text generation webui by oobabooga but got the same issue with running out of memory and trying to offload to cpu with an error. I tried downgrading transformers, installing autoawq loader but nothing worked for me.
I wanted to ask what is the best way to run this model then ?
You need to modify the config
76 "torch_dtype": "float16",
77 "transformers_version": "4.55.1",
78 "use_cache": true,
79 - "vocab_size": 201088
79 + "vocab_size": 201088,
80 + "quantization_config": {
81 + "quant_method": "awq",
82 + "zero_point": true,
83 + "group_size": 128,
84 + "bits": 4,
85 + "version": "gemm"
86 + }
here is my command
docker run --gpus all
-p 8080:8000
--ipc=host
-e CUDA_VISIBLE_DEVICES=0,1
-e TORCH_CUDA_ARCH_LIST="8.9"
-e VLLM_GPT_OSS_HARMONY_SYSTEM_INSTRUCTIONS=1
-e VLLM_TOOL_JSON_ERROR_AUTOMATIC_RETRY=1
-e NCCL_P2P_DISABLE=1
-e VLLM_USE_V1=0
-v /root/huggingface_cache:/root/.cache/huggingface
-v /var/log/vllm:/var/log/vllm
vllm/vllm-openai:nightly
--model twhitworth/gpt-oss-120b-awq-w4a16
--quantization awq
--gpu-memory-utilization 0.94
--max-num-batched-tokens 8192
--max-model-len 131072
--enforce-eager
--enable-auto-tool-choice
--tool-call-parser openai
--load-format safetensors
--dtype float16
--enable-chunked-prefill
--tensor-parallel-size 2
--reasoning-parser openai_gptoss
--enable-expert-parallel
--trust-remote-code
--enable-prefix-caching
I'm still getting some errors like
(Worker_TP1_EP1 pid=93) WARNING 11-05 11:13:56 [awq.py:114] Layer 'model.layers.35.mlp.experts' is not supported by AWQMoeMarlin. Falling back to Moe WNA16 kernels.
I still couldn't make it work, so until there is official fix, I wouldn't use this model
I still couldn't make it work, so until there is official fix, I wouldn't use this model
did some digging on the autor and i think the model is made to run with his fork of vllm: https://github.com/moreWax/tiny-vllm
I'll give it a try this week, but if someody can do it sooner i will appreciate that - this seems to be the way to run this model
Hi, were you able to run it?