Updated Readme, no need of PR since base nightly works on Blackwell pretty well(PR is merged now))
Verified myself on RTX 6000 Pro
(APIServer pid=1) INFO 06-26 06:55:48 [api_utils.py:339]
(APIServer pid=1) INFO 06-26 06:55:48 [api_utils.py:339] β β ββ ββ
(APIServer pid=1) INFO 06-26 06:55:48 [api_utils.py:339] ββ ββ β β β βββ β version 0.23.1rc1.dev471+ge312c5cb2
(APIServer pid=1) INFO 06-26 06:55:48 [api_utils.py:339] ββββ β β β β model /app/efs/models/minimax-m3-nvfp4
(APIServer pid=1) INFO 06-26 06:55:48 [api_utils.py:339] ββ βββββ βββββ β β
(APIServer pid=1) INFO 06-26 06:55:48 [api_utils.py:339]
(APIServer pid=1) INFO 06-26 06:55:48 [api_utils.py:273] non-default args: {'model_tag': '/app/efs/models/minimax-m3-nvfp4', 'enable_auto_tool_choice': True, 'tool_call_parser': 'minimax_m3', 'model': '/app/efs/models/minimax-m3-nvfp4', 'trust_remote_code': True, 'max_model_len': -1, 'served_model_name': ['minimax-m3'], 'reasoning_parser': 'minimax_m3', 'tensor_parallel_size': 4, 'block_size': 128, 'gpu_memory_utilization': 0.95}
(APIServer pid=1) WARNING 06-26 06:55:48 [envs.py:2019] Unknown vLLM environment variable detected: VLLM_BUILD_URL
(APIServer pid=1) WARNING 06-26 06:55:48 [envs.py:2019] Unknown vLLM environment variable detected: VLLM_IMAGE_TAG
(APIServer pid=1) WARNING 06-26 06:55:48 [envs.py:2019] Unknown vLLM environment variable detected: VLLM_BUILD_PIPELINE
(APIServer pid=1) WARNING 06-26 06:55:48 [envs.py:2019] Unknown vLLM environment variable detected: VLLM_BUILD_COMMIT
(APIServer pid=1) INFO 06-26 06:56:00 [model.py:598] Resolved architecture: MiniMaxM3SparseForConditionalGeneration
(APIServer pid=1) INFO 06-26 06:56:00 [model.py:1725] Using max model len 1048576
(APIServer pid=1) INFO 06-26 06:56:01 [scheduler.py:252] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1) WARNING 06-26 06:56:01 [modelopt.py:384] Detected ModelOpt fp8 checkpoint (quant_algo=FP8). Please note that the format is experimental and could change.
(APIServer pid=1) WARNING 06-26 06:56:01 [modelopt.py:1028] Detected ModelOpt NVFP4 checkpoint (quant_algo=NVFP4). Please note that the format is experimental and could change in future.
(APIServer pid=1) WARNING 06-26 06:56:01 [modelopt.py:1028] Detected ModelOpt NVFP4 checkpoint (quant_algo=W4A16_NVFP4). Please note that the format is experimental and could change in future.
(APIServer pid=1) INFO 06-26 06:56:01 [vllm.py:1006] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 06-26 06:56:01 [vllm.py:1094] Auto-enabling VLLM_USE_BREAKABLE_CUDAGRAPH=1. Set VLLM_USE_BREAKABLE_CUDAGRAPH=0 to opt out.
(APIServer pid=1) WARNING 06-26 06:56:01 [vllm.py:1100] VLLM_USE_BREAKABLE_CUDAGRAPH is set, disabling vLLM's torch.compile pipeline. Equivalent to -cc.mode=none.
(APIServer pid=1) WARNING 06-26 06:56:01 [vllm.py:1110] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 06-26 06:56:01 [kernel.py:276] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(APIServer pid=1) INFO 06-26 06:56:02 [compilation.py:310] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=1654) INFO 06-26 06:56:13 [core.py:114] Initializing a V1 LLM engine (v0.23.1rc1.dev471+ge312c5cb2) with config: model='/app/efs/models/minimax-m3-nvfp4', speculative_config=None, tokenizer='/app/efs/models/minimax-m3-nvfp4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1048576, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=modelopt_mixed, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='minimax_m3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False, jit_monitor_mode='warn', jit_monitor_verbose=False), seed=0, served_model_name=minimax-m3, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native']), enable_flashinfer_autotune=True, moe_backend='auto', linear_backend='auto')
(EngineCore pid=1654) WARNING 06-26 06:56:13 [multiproc_executor.py:1063] Reducing Torch parallelism from 48 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=1654) INFO 06-26 06:56:13 [multiproc_executor.py:140] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=10.0.21.126 (local), world_size=4, local_world_size=4
(Worker pid=1830) INFO 06-26 06:56:21 [parallel_state.py:1588] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:60377 backend=nccl
(Worker pid=1836) INFO 06-26 06:56:26 [parallel_state.py:1588] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:60377 backend=nccl
(Worker pid=1848) INFO 06-26 06:56:31 [parallel_state.py:1588] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:60377 backend=nccl
(Worker pid=1863) INFO 06-26 06:56:35 [parallel_state.py:1588] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:60377 backend=nccl
(Worker pid=1830) INFO 06-26 06:56:35 [pynccl.py:113] vLLM is using nccl==2.28.9
(Worker pid=1830) WARNING 06-26 06:56:36 [symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=1836) WARNING 06-26 06:56:36 [symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=1848) WARNING 06-26 06:56:36 [symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=1863) WARNING 06-26 06:56:36 [symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=1836) WARNING 06-26 06:56:36 [custom_all_reduce.py:151] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(Worker pid=1863) WARNING 06-26 06:56:36 [custom_all_reduce.py:151] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(Worker pid=1848) WARNING 06-26 06:56:36 [custom_all_reduce.py:151] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(Worker pid=1830) WARNING 06-26 06:56:36 [custom_all_reduce.py:151] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(Worker pid=1830) INFO 06-26 06:56:36 [cuda_communicator.py:245] Using ['PYNCCL'] all-reduce backends (in dispatch order) for group 'tp:0' out of potential backends: ['NCCL_SYMM_MEM', 'QUICK_REDUCE', 'FLASHINFER', 'CUSTOM', 'SYMM_MEM', 'PYNCCL'].
(Worker pid=1830) INFO 06-26 06:56:36 [cuda_communicator.py:245] Using ['PYNCCL'] all-reduce backends (in dispatch order) for group 'ep:0' out of potential backends: ['NCCL_SYMM_MEM', 'QUICK_REDUCE', 'FLASHINFER', 'CUSTOM', 'SYMM_MEM', 'PYNCCL'].
(Worker pid=1830) INFO 06-26 06:56:36 [parallel_state.py:1923] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(Worker pid=1830) INFO 06-26 06:56:37 [topk_topp_sampler.py:55] Using FlashInfer for top-p & top-k sampling.
(Worker_TP0 pid=1830) INFO 06-26 06:56:42 [gpu_model_runner.py:5160] Starting to load model /app/efs/models/minimax-m3-nvfp4...
(Worker_TP0 pid=1830) INFO 06-26 06:56:42 [cuda.py:542] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(Worker_TP0 pid=1830) INFO 06-26 06:56:42 [mm_encoder_attention.py:373] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker_TP1 pid=1836) WARNING 06-26 06:56:43 [vllm.py:1110] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(Worker_TP1 pid=1836) INFO 06-26 06:56:43 [kernel.py:276] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(Worker_TP0 pid=1830) INFO 06-26 06:56:43 [vllm.py:1006] Asynchronous scheduling is enabled.
(Worker_TP0 pid=1830) WARNING 06-26 06:56:43 [vllm.py:1100] VLLM_USE_BREAKABLE_CUDAGRAPH is set, disabling vLLM's torch.compile pipeline. Equivalent to -cc.mode=none.
(Worker_TP0 pid=1830) WARNING 06-26 06:56:43 [vllm.py:1110] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(Worker_TP0 pid=1830) INFO 06-26 06:56:43 [kernel.py:276] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(Worker_TP0 pid=1830) INFO 06-26 06:56:43 [compilation.py:310] Enabled custom fusions: norm_quant, act_quant
(Worker_TP2 pid=1848) WARNING 06-26 06:56:43 [vllm.py:1110] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(Worker_TP2 pid=1848) INFO 06-26 06:56:43 [kernel.py:276] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(Worker_TP0 pid=1830) INFO 06-26 06:56:43 [cuda.py:483] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(Worker_TP0 pid=1830) INFO 06-26 06:56:43 [flash_attn.py:670] Using FlashAttention version 2
(Worker_TP3 pid=1863) WARNING 06-26 06:56:43 [vllm.py:1110] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(Worker_TP3 pid=1863) INFO 06-26 06:56:43 [kernel.py:276] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(Worker_TP0 pid=1830) INFO 06-26 06:56:43 [sparse_attention.py:419] MiniMax M3 sparse attention selected Triton (kv_cache_dtype=auto, topk_blocks=16)
(Worker_TP0 pid=1830) INFO 06-26 06:56:43 [indexer.py:509] MiniMax M3 indexer: selected Triton (no fmha_sm100) [topk_blocks=16, indexer_kv_dtype=bf16, sm100=False]
(Worker_TP0 pid=1830) INFO 06-26 06:56:43 [nvfp4.py:270] Using 'MARLIN' NvFp4 MoE backend out of potential backends: ['FLASHINFER_TRTLLM', 'FLASHINFER_CUTLASS', 'MARLIN'].
(Worker_TP2 pid=1848) INFO 06-26 06:56:44 [weight_utils.py:811] Prefetching checkpoint files into page cache started (in background, num_threads=8, block_size=16777216 bytes)
(Worker_TP1 pid=1836) INFO 06-26 06:56:44 [weight_utils.py:811] Prefetching checkpoint files into page cache started (in background, num_threads=8, block_size=16777216 bytes)
(Worker_TP0 pid=1830) INFO 06-26 06:56:44 [weight_utils.py:849] Filesystem type for checkpoints: NFS4. Checkpoint size: 232.93 GiB. Available RAM: 953.98 GiB.
(Worker_TP0 pid=1830) INFO 06-26 06:56:44 [weight_utils.py:811] Prefetching checkpoint files into page cache started (in background, num_threads=8, block_size=16777216 bytes)
Loading safetensors checkpoint shards: 0% Completed | 0/88 [00:00<?, ?it/s]
(Worker_TP3 pid=1863) INFO 06-26 06:56:44 [weight_utils.py:811] Prefetching checkpoint files into page cache started (in background, num_threads=8, block_size=16777216 bytes)
(Worker_TP1 pid=1836) INFO 06-26 06:56:55 [weight_utils.py:783] Prefetching checkpoint files: 10% (3/22)
(Worker_TP2 pid=1848) INFO 06-26 06:56:55 [weight_utils.py:783] Prefetching checkpoint files: 10% (3/22)
(Worker_TP0 pid=1830) INFO 06-26 06:56:55 [weight_utils.py:783] Prefetching checkpoint files: 10% (3/22)
(Worker_TP0 pid=1830) INFO 06-26 06:56:55 [weight_utils.py:783] Prefetching checkpoint files: 20% (5/22)
(Worker_TP2 pid=1848) INFO 06-26 06:56:55 [weight_utils.py:783] Prefetching checkpoint files: 20% (5/22)
(Worker_TP3 pid=1863) INFO 06-26 06:56:55 [weight_utils.py:783] Prefetching checkpoint files: 10% (3/22)
(Worker_TP2 pid=1848) INFO 06-26 06:56:55 [weight_utils.py:783] Prefetching checkpoint files: 30% (7/22)
(Worker_TP1 pid=1836) INFO 06-26 06:56:55 [weight_utils.py:783] Prefetching checkpoint files: 20% (5/22)
(Worker_TP1 pid=1836) INFO 06-26 06:56:56 [weight_utils.py:783] Prefetching checkpoint files: 30% (7/22)
(Worker_TP3 pid=1863) INFO 06-26 06:56:56 [weight_utils.py:783] Prefetching checkpoint files: 20% (5/22)
(Worker_TP3 pid=1863) INFO 06-26 06:56:56 [weight_utils.py:783] Prefetching checkpoint files: 30% (7/22)
Loading safetensors checkpoint shards: 1% Completed | 1/88 [00:36<53:22, 36.81s/it]
(Worker_TP0 pid=1830) INFO 06-26 06:57:21 [weight_utils.py:783] Prefetching checkpoint files: 30% (7/22)
Loading safetensors checkpoint shards: 2% Completed | 2/88 [00:38<22:46, 15.89s/it]
Loading safetensors checkpoint shards: 3% Completed | 3/88 [00:39<12:58, 9.16s/it]
Loading safetensors checkpoint shards: 5% Completed | 4/88 [00:40<08:19, 5.95s/it]
Loading safetensors checkpoint shards: 6% Completed | 5/88 [00:41<05:52, 4.24s/it]
Loading safetensors checkpoint shards: 7% Completed | 6/88 [00:42<04:19, 3.17s/it]
Loading safetensors checkpoint shards: 8% Completed | 7/88 [00:43<03:22, 2.50s/it]
Loading safetensors checkpoint shards: 9% Completed | 8/88 [00:44<02:47, 2.10s/it]
Loading safetensors checkpoint shards: 10% Completed | 9/88 [00:45<02:18, 1.75s/it]
Loading safetensors checkpoint shards: 11% Completed | 10/88 [00:47<02:03, 1.58s/it]
Loading safetensors checkpoint shards: 12% Completed | 11/88 [00:48<01:57, 1.53s/it]
Loading safetensors checkpoint shards: 14% Completed | 12/88 [00:50<01:58, 1.57s/it]
Loading safetensors checkpoint shards: 15% Completed | 13/88 [00:52<02:21, 1.89s/it]
Loading safetensors checkpoint shards: 16% Completed | 14/88 [00:54<02:05, 1.69s/it]
Loading safetensors checkpoint shards: 17% Completed | 15/88 [00:55<01:52, 1.54s/it]
Loading safetensors checkpoint shards: 18% Completed | 16/88 [00:56<01:43, 1.43s/it]
Loading safetensors checkpoint shards: 19% Completed | 17/88 [00:57<01:43, 1.46s/it]
Loading safetensors checkpoint shards: 20% Completed | 18/88 [00:59<01:40, 1.44s/it]
Loading safetensors checkpoint shards: 22% Completed | 19/88 [01:00<01:35, 1.39s/it]
Loading safetensors checkpoint shards: 23% Completed | 20/88 [01:01<01:35, 1.40s/it]
Loading safetensors checkpoint shards: 24% Completed | 21/88 [01:03<01:38, 1.47s/it]
Loading safetensors checkpoint shards: 25% Completed | 22/88 [01:05<01:35, 1.45s/it]
Loading safetensors checkpoint shards: 26% Completed | 23/88 [01:06<01:32, 1.42s/it]
Loading safetensors checkpoint shards: 27% Completed | 24/88 [01:07<01:30, 1.41s/it]
Loading safetensors checkpoint shards: 28% Completed | 25/88 [01:09<01:28, 1.40s/it]
Loading safetensors checkpoint shards: 30% Completed | 26/88 [01:10<01:28, 1.43s/it]
Loading safetensors checkpoint shards: 31% Completed | 27/88 [01:12<01:26, 1.42s/it]
Loading safetensors checkpoint shards: 32% Completed | 28/88 [01:13<01:22, 1.38s/it]
Loading safetensors checkpoint shards: 33% Completed | 29/88 [01:14<01:25, 1.45s/it]
Loading safetensors checkpoint shards: 34% Completed | 30/88 [01:16<01:20, 1.39s/it]
Loading safetensors checkpoint shards: 35% Completed | 31/88 [01:17<01:14, 1.30s/it]
Loading safetensors checkpoint shards: 36% Completed | 32/88 [01:34<05:34, 5.98s/it]
Loading safetensors checkpoint shards: 38% Completed | 33/88 [01:36<04:33, 4.98s/it]
(Worker_TP0 pid=1830) INFO 06-26 06:58:21 [weight_utils.py:783] Prefetching checkpoint files: 40% (9/22)
(Worker_TP3 pid=1863) INFO 06-26 06:58:22 [weight_utils.py:783] Prefetching checkpoint files: 40% (9/22)
Loading safetensors checkpoint shards: 39% Completed | 34/88 [01:38<03:40, 4.08s/it]
(Worker_TP1 pid=1836) INFO 06-26 06:58:23 [weight_utils.py:783] Prefetching checkpoint files: 40% (9/22)
(Worker_TP0 pid=1830) INFO 06-26 06:58:24 [weight_utils.py:783] Prefetching checkpoint files: 50% (11/22)
Loading safetensors checkpoint shards: 40% Completed | 35/88 [01:39<02:48, 3.17s/it]
(Worker_TP0 pid=1830) INFO 06-26 06:58:24 [weight_utils.py:783] Prefetching checkpoint files: 60% (14/22)
Loading safetensors checkpoint shards: 41% Completed | 36/88 [01:40<02:04, 2.39s/it]
(Worker_TP2 pid=1848) INFO 06-26 06:58:24 [weight_utils.py:783] Prefetching checkpoint files: 40% (9/22)
(Worker_TP1 pid=1836) INFO 06-26 06:58:24 [weight_utils.py:783] Prefetching checkpoint files: 50% (11/22)
(Worker_TP3 pid=1863) INFO 06-26 06:58:25 [weight_utils.py:783] Prefetching checkpoint files: 50% (11/22)
(Worker_TP1 pid=1836) INFO 06-26 06:58:25 [weight_utils.py:783] Prefetching checkpoint files: 60% (14/22)
Loading safetensors checkpoint shards: 42% Completed | 37/88 [01:41<01:38, 1.93s/it]
(Worker_TP3 pid=1863) INFO 06-26 06:58:25 [weight_utils.py:783] Prefetching checkpoint files: 60% (14/22)
(Worker_TP2 pid=1848) INFO 06-26 06:58:25 [weight_utils.py:783] Prefetching checkpoint files: 50% (11/22)
Loading safetensors checkpoint shards: 43% Completed | 38/88 [01:42<01:20, 1.60s/it]
(Worker_TP2 pid=1848) INFO 06-26 06:58:27 [weight_utils.py:783] Prefetching checkpoint files: 60% (14/22)
Loading safetensors checkpoint shards: 44% Completed | 39/88 [01:43<01:09, 1.43s/it]
(Worker_TP1 pid=1836) INFO 06-26 06:58:28 [weight_utils.py:783] Prefetching checkpoint files: 70% (16/22)
(Worker_TP2 pid=1848) INFO 06-26 06:58:28 [weight_utils.py:783] Prefetching checkpoint files: 70% (16/22)
Loading safetensors checkpoint shards: 45% Completed | 40/88 [01:44<01:01, 1.28s/it]
Loading safetensors checkpoint shards: 47% Completed | 41/88 [01:44<00:55, 1.17s/it]
Loading safetensors checkpoint shards: 48% Completed | 42/88 [01:45<00:50, 1.10s/it]
Loading safetensors checkpoint shards: 49% Completed | 43/88 [01:46<00:47, 1.05s/it]
Loading safetensors checkpoint shards: 50% Completed | 44/88 [01:47<00:44, 1.01s/it]
Loading safetensors checkpoint shards: 51% Completed | 45/88 [01:48<00:42, 1.02it/s]
Loading safetensors checkpoint shards: 52% Completed | 46/88 [01:49<00:40, 1.04it/s]
Loading safetensors checkpoint shards: 53% Completed | 47/88 [01:50<00:39, 1.05it/s]
Loading safetensors checkpoint shards: 55% Completed | 48/88 [01:51<00:38, 1.04it/s]
Loading safetensors checkpoint shards: 56% Completed | 49/88 [01:52<00:36, 1.06it/s]
Loading safetensors checkpoint shards: 57% Completed | 50/88 [01:53<00:35, 1.08it/s]
Loading safetensors checkpoint shards: 58% Completed | 51/88 [01:54<00:37, 1.02s/it]
Loading safetensors checkpoint shards: 59% Completed | 52/88 [01:55<00:35, 1.02it/s]
Loading safetensors checkpoint shards: 60% Completed | 53/88 [01:56<00:35, 1.02s/it]
Loading safetensors checkpoint shards: 61% Completed | 54/88 [01:57<00:34, 1.02s/it]
Loading safetensors checkpoint shards: 62% Completed | 55/88 [01:58<00:32, 1.03it/s]
Loading safetensors checkpoint shards: 64% Completed | 56/88 [01:59<00:30, 1.04it/s]
Loading safetensors checkpoint shards: 65% Completed | 57/88 [02:02<00:46, 1.51s/it]
Loading safetensors checkpoint shards: 66% Completed | 58/88 [02:03<00:39, 1.32s/it]
Loading safetensors checkpoint shards: 67% Completed | 59/88 [02:03<00:34, 1.18s/it]
Loading safetensors checkpoint shards: 68% Completed | 60/88 [02:04<00:30, 1.10s/it]
(Worker_TP0 pid=1830) INFO 06-26 06:58:55 [weight_utils.py:783] Prefetching checkpoint files: 70% (16/22)
Loading safetensors checkpoint shards: 69% Completed | 61/88 [02:11<01:15, 2.80s/it]
Loading safetensors checkpoint shards: 70% Completed | 62/88 [02:12<00:57, 2.20s/it]
Loading safetensors checkpoint shards: 72% Completed | 63/88 [02:13<00:44, 1.78s/it]
Loading safetensors checkpoint shards: 73% Completed | 64/88 [02:43<04:05, 10.24s/it]
(Worker_TP3 pid=1863) INFO 06-26 06:59:27 [weight_utils.py:783] Prefetching checkpoint files: 70% (16/22)
(Worker_TP2 pid=1848) INFO 06-26 06:59:28 [weight_utils.py:783] Prefetching checkpoint files: 80% (18/22)
(Worker_TP2 pid=1848) INFO 06-26 06:59:28 [weight_utils.py:783] Prefetching checkpoint files: 90% (20/22)
(Worker_TP2 pid=1848) INFO 06-26 06:59:29 [weight_utils.py:783] Prefetching checkpoint files: 100% (22/22)
(Worker_TP2 pid=1848) INFO 06-26 06:59:29 [weight_utils.py:806] Prefetching checkpoint files into page cache finished in 164.89s
(Worker_TP1 pid=1836) INFO 06-26 06:59:29 [weight_utils.py:783] Prefetching checkpoint files: 80% (18/22)
(Worker_TP1 pid=1836) INFO 06-26 06:59:29 [weight_utils.py:783] Prefetching checkpoint files: 90% (20/22)
Loading safetensors checkpoint shards: 74% Completed | 65/88 [02:45<02:59, 7.81s/it]
(Worker_TP1 pid=1836) INFO 06-26 06:59:29 [weight_utils.py:783] Prefetching checkpoint files: 100% (22/22)
(Worker_TP1 pid=1836) INFO 06-26 06:59:29 [weight_utils.py:806] Prefetching checkpoint files into page cache finished in 165.47s
Loading safetensors checkpoint shards: 75% Completed | 66/88 [02:45<02:03, 5.63s/it]
Loading safetensors checkpoint shards: 76% Completed | 67/88 [02:46<01:26, 4.10s/it]
Loading safetensors checkpoint shards: 77% Completed | 68/88 [02:46<01:00, 3.02s/it]
(Worker_TP0 pid=1830) INFO 06-26 06:59:31 [weight_utils.py:783] Prefetching checkpoint files: 80% (18/22)
(Worker_TP0 pid=1830) INFO 06-26 06:59:31 [weight_utils.py:783] Prefetching checkpoint files: 90% (20/22)
Loading safetensors checkpoint shards: 78% Completed | 69/88 [02:47<00:43, 2.28s/it]
(Worker_TP0 pid=1830) INFO 06-26 06:59:31 [weight_utils.py:783] Prefetching checkpoint files: 100% (22/22)
(Worker_TP0 pid=1830) INFO 06-26 06:59:31 [weight_utils.py:806] Prefetching checkpoint files into page cache finished in 167.43s
(Worker_TP3 pid=1863) INFO 06-26 06:59:31 [weight_utils.py:783] Prefetching checkpoint files: 80% (18/22)
(Worker_TP3 pid=1863) INFO 06-26 06:59:31 [weight_utils.py:783] Prefetching checkpoint files: 90% (20/22)
Loading safetensors checkpoint shards: 80% Completed | 70/88 [02:47<00:30, 1.70s/it]
(Worker_TP3 pid=1863) INFO 06-26 06:59:31 [weight_utils.py:783] Prefetching checkpoint files: 100% (22/22)
(Worker_TP3 pid=1863) INFO 06-26 06:59:31 [weight_utils.py:806] Prefetching checkpoint files into page cache finished in 167.73s
Loading safetensors checkpoint shards: 81% Completed | 71/88 [02:48<00:21, 1.29s/it]
Loading safetensors checkpoint shards: 82% Completed | 72/88 [02:48<00:16, 1.00s/it]
Loading safetensors checkpoint shards: 83% Completed | 73/88 [02:48<00:12, 1.25it/s]
Loading safetensors checkpoint shards: 84% Completed | 74/88 [02:49<00:09, 1.52it/s]
Loading safetensors checkpoint shards: 85% Completed | 75/88 [02:49<00:07, 1.78it/s]
Loading safetensors checkpoint shards: 86% Completed | 76/88 [02:49<00:05, 2.04it/s]
Loading safetensors checkpoint shards: 88% Completed | 77/88 [02:50<00:04, 2.27it/s]
Loading safetensors checkpoint shards: 89% Completed | 78/88 [02:50<00:04, 2.46it/s]
Loading safetensors checkpoint shards: 90% Completed | 79/88 [02:50<00:03, 2.61it/s]
Loading safetensors checkpoint shards: 91% Completed | 80/88 [02:51<00:02, 2.73it/s]
Loading safetensors checkpoint shards: 92% Completed | 81/88 [02:51<00:02, 2.83it/s]
Loading safetensors checkpoint shards: 93% Completed | 82/88 [02:51<00:02, 2.90it/s]
Loading safetensors checkpoint shards: 94% Completed | 83/88 [02:52<00:01, 2.95it/s]
Loading safetensors checkpoint shards: 95% Completed | 84/88 [02:52<00:01, 2.98it/s]
Loading safetensors checkpoint shards: 97% Completed | 85/88 [02:52<00:00, 3.00it/s]
Loading safetensors checkpoint shards: 98% Completed | 86/88 [02:52<00:00, 3.02it/s]
Loading safetensors checkpoint shards: 99% Completed | 87/88 [02:53<00:00, 3.03it/s]
Loading safetensors checkpoint shards: 100% Completed | 88/88 [02:53<00:00, 3.04it/s]
Loading safetensors checkpoint shards: 100% Completed | 88/88 [02:53<00:00, 1.97s/it]
(Worker_TP0 pid=1830)
(Worker_TP0 pid=1830) INFO 06-26 06:59:37 [default_loader.py:430] Loading weights took 173.67 seconds
(Worker_TP0 pid=1830) WARNING 06-26 06:59:37 [marlin_utils_fp4.py:321] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(Worker_TP0 pid=1830) INFO 06-26 06:59:38 [nvfp4.py:482] Using MoEPrepareAndFinalizeNoDPEPModular
(Worker_TP0 pid=1830) INFO 06-26 06:59:43 [gpu_model_runner.py:5255] Model loading took 60.92 GiB memory and 179.987373 seconds
(Worker_TP0 pid=1830) INFO 06-26 06:59:43 [breakable_cudagraph.py:288] Breakable CUDA graph enabled
(Worker_TP0 pid=1830) INFO 06-26 06:59:43 [gpu_model_runner.py:6271] Encoder cache will be initialized with a budget of 192000 tokens, and profiled with 1 video items of the maximum feature size.
(Worker_TP1 pid=1836) INFO 06-26 07:00:23 [gpu_model_runner.py:6483] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=51 (largest=512)
(Worker_TP0 pid=1830) INFO 06-26 07:00:23 [gpu_model_runner.py:6483] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=51 (largest=512)
(Worker_TP2 pid=1848) INFO 06-26 07:00:23 [gpu_model_runner.py:6483] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=51 (largest=512)
(Worker_TP3 pid=1863) INFO 06-26 07:00:23 [gpu_model_runner.py:6483] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=51 (largest=512)
(EngineCore pid=1654) INFO 06-26 07:00:44 [shm_broadcast.py:705] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP1 pid=1836) INFO 06-26 07:01:39 [gpu_model_runner.py:6588] Estimated CUDA graph memory: 2.47 GiB total
(Worker_TP3 pid=1863) INFO 06-26 07:01:39 [gpu_model_runner.py:6588] Estimated CUDA graph memory: 2.47 GiB total
(Worker_TP2 pid=1848) INFO 06-26 07:01:39 [gpu_model_runner.py:6588] Estimated CUDA graph memory: 2.47 GiB total
(Worker_TP0 pid=1830) INFO 06-26 07:01:39 [gpu_model_runner.py:6588] Estimated CUDA graph memory: 2.47 GiB total
(Worker_TP3 pid=1863) INFO 06-26 07:01:39 [gpu_worker.py:523] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.9500 is equivalent to --gpu-memory-utilization=0.9239 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9761. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.
(Worker_TP1 pid=1836) INFO 06-26 07:01:39 [gpu_worker.py:523] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.9500 is equivalent to --gpu-memory-utilization=0.9239 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9761. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.
(Worker_TP2 pid=1848) INFO 06-26 07:01:39 [gpu_worker.py:523] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.9500 is equivalent to --gpu-memory-utilization=0.9239 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9761. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.
(Worker_TP0 pid=1830) INFO 06-26 07:01:39 [gpu_worker.py:508] Available KV cache memory: 3.37 GiB
(Worker_TP0 pid=1830) INFO 06-26 07:01:39 [gpu_worker.py:523] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.9500 is equivalent to --gpu-memory-utilization=0.9239 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9761. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.
(EngineCore pid=1654) INFO 06-26 07:01:39 [kv_cache_utils.py:1954] Auto-fit max_model_len: reduced from 1048576 to 78976 to fit in available GPU memory (3.34 GiB available for KV cache)
(EngineCore pid=1654) INFO 06-26 07:01:39 [kv_cache_utils.py:2146] GPU KV cache size: 78,976 tokens
(EngineCore pid=1654) INFO 06-26 07:01:39 [kv_cache_utils.py:2147] Maximum concurrency for 78,976 tokens per request: 1.00x
(Worker_TP0 pid=1830) INFO 06-26 07:01:39 [deep_gemm.py:175] deep_gemm not found in site-packages, trying vendored vllm.third_party.deep_gemm
(Worker_TP0 pid=1830) INFO 06-26 07:01:39 [deep_gemm.py:202] DeepGEMM PDL enabled on vllm.third_party.deep_gemm.
(Worker_TP1 pid=1836) 2026-06-26 07:01:39,906 - INFO - autotuner.py:622 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP2 pid=1848) 2026-06-26 07:01:39,906 - INFO - autotuner.py:622 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP0 pid=1830) 2026-06-26 07:01:39,906 - INFO - autotuner.py:622 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP3 pid=1863) 2026-06-26 07:01:39,906 - INFO - autotuner.py:622 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP1 pid=1836) 2026-06-26 07:01:40,265 - INFO - autotuner.py:641 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker_TP0 pid=1830) 2026-06-26 07:01:40,265 - INFO - autotuner.py:641 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker_TP3 pid=1863) 2026-06-26 07:01:40,265 - INFO - autotuner.py:641 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker_TP2 pid=1848) 2026-06-26 07:01:40,265 - INFO - autotuner.py:641 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|ββββββββββ| 51/51 [00:07<00:00, 6.91it/s]
Capturing CUDA graphs (decode, FULL): 100%|ββββββββββ| 51/51 [00:18<00:00, 2.69it/s]
(Worker_TP3 pid=1863) INFO 06-26 07:02:07 [gpu_worker.py:667] CUDA graph pool memory: 3.14 GiB (actual), 2.47 GiB (estimated), difference: 0.66 GiB (21.2%).
(Worker_TP1 pid=1836) INFO 06-26 07:02:07 [gpu_worker.py:667] CUDA graph pool memory: 3.14 GiB (actual), 2.47 GiB (estimated), difference: 0.66 GiB (21.2%).
(Worker_TP2 pid=1848) INFO 06-26 07:02:07 [gpu_worker.py:667] CUDA graph pool memory: 3.14 GiB (actual), 2.47 GiB (estimated), difference: 0.66 GiB (21.2%).
(Worker_TP0 pid=1830) INFO 06-26 07:02:07 [gpu_model_runner.py:6656] Graph capturing finished in 27 secs, took 3.14 GiB
(Worker_TP0 pid=1830) INFO 06-26 07:02:07 [gpu_worker.py:667] CUDA graph pool memory: 3.14 GiB (actual), 2.47 GiB (estimated), difference: 0.66 GiB (21.2%).
(Worker_TP3 pid=1863) INFO 06-26 07:02:07 [jit_monitor.py:71] Kernel JIT monitor activated; monitored JIT compilations during inference will use mode=warn.
(Worker_TP2 pid=1848) INFO 06-26 07:02:07 [jit_monitor.py:71] Kernel JIT monitor activated; monitored JIT compilations during inference will use mode=warn.
(Worker_TP1 pid=1836) INFO 06-26 07:02:07 [jit_monitor.py:71] Kernel JIT monitor activated; monitored JIT compilations during inference will use mode=warn.
(Worker_TP0 pid=1830) INFO 06-26 07:02:07 [jit_monitor.py:71] Kernel JIT monitor activated; monitored JIT compilations during inference will use mode=warn.
(EngineCore pid=1654) INFO 06-26 07:02:07 [core.py:344] init engine (profile, create kv cache, warmup model) took 144.49 s
(EngineCore pid=1654) INFO 06-26 07:02:13 [vllm.py:1006] Asynchronous scheduling is enabled.
(EngineCore pid=1654) WARNING 06-26 07:02:13 [vllm.py:1100] VLLM_USE_BREAKABLE_CUDAGRAPH is set, disabling vLLM's torch.compile pipeline. Equivalent to -cc.mode=none.
(EngineCore pid=1654) WARNING 06-26 07:02:13 [vllm.py:1110] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=1654) INFO 06-26 07:02:13 [kernel.py:276] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(EngineCore pid=1654) INFO 06-26 07:02:13 [compilation.py:310] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=1) INFO 06-26 07:02:13 [api_server.py:619] Supported tasks: ['generate']
(APIServer pid=1) INFO 06-26 07:02:13 [parser_manager.py:37] "auto" tool choice has been enabled.
(APIServer pid=1) WARNING 06-26 07:02:13 [model.py:1477] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1.0, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 06-26 07:02:14 [hf.py:548] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) INFO 06-26 07:02:20 [base.py:223] Multi-modal warmup completed in 5.555s
(APIServer pid=1) INFO 06-26 07:02:21 [base.py:223] Readonly multi-modal warmup completed in 0.773s
(APIServer pid=1) INFO 06-26 07:02:21 [api_server.py:623] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:37] Available routes are:
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /v1/chat/completions/derender, Methods: POST
(APIServer pid=1) INFO 06-26 07:02:21 [launcher.py:46] Route: /v1/completions/derender, Methods: POST
(APIServer pid=1) INFO: Started server process [1]
(APIServer pid=1) INFO: Waiting for application startup.
(APIServer pid=1) INFO: Application startup complete.
Can you share your docker - compose or docker run command please? And which exactly version of vllm are you using and whats the prefill and output speed we should expect?
What about the kvcache usage?
Can you share your docker - compose or docker run command please? And which exactly version of vllm are you using and whats the prefill and output speed we should expect?
What about the kvcache usage?
@asher9972
I was testing this on 4X Blackwell RTX 6000 Pro
KV Cache usage is pretty heavy cause it first allots space for 192000 ecoder tokens for the vision tower. Running this at around 30 tok/s. Have not load tested this for prefill yet. this is vllm nightly 0.23.1rc1.dev471+ge312c5cb2. You can fit in some more tokens by diabling multimodal and using --enforce-eager flag(hurts throughput though)
My current args are :
- "models/minimax-m3-nvfp4"
- "--served-model-name"
- "minimax-m3"
- "--block-size"
- "128"
- "--max-model-len"
- "auto"
- "--gpu-memory-utilization"
- "0.97"
- "--tensor-parallel-size"
- "4"
- "--tool-call-parser"
- "minimax_m3"
- "--reasoning-parser"
- "minimax_m3"
- "--enable-auto-tool-choice"
- "--trust-remote-code"
Thanks for sharing this. We currently only have 2 RTX 6000s running StepFun 2.7, but we plan to upgrade to 4 once there's a compelling reason to do so.
256k context length with 5-10 parallel sessions and multimodality would be enough for us at the moment. If you have time, benchmarking this would be awesome