Qwen3.6-27B-Omnimerge-v4 — MLX 4-bit (Vision-Language)

Full multimodal 4-bit MLX quantization of ManniX-ITA/Qwen3.6-27B-Omnimerge-v4: text + image + video, runnable natively on Apple Silicon via mlx-vlm.

This is the VL build. There is also a text-only MLX-4bit release for slightly smaller-footprint, language-only inference via mlx-lm.

The base model is a same-base DARE-TIES (Omnimerge_v2 method) merge of Qwen/Qwen3.6-27B with three Qwen3.6 fine-tunes (rico03, Esper3.1, kai-os Opus-Reasoning-anchor), plus an MLP-passthrough surgery that fixes Qwen3.6's reasoning-tag-emission fragility. Method, benchmark numbers, and forensic write-up live on the base model card.

Quantization

  • Type: MLX 4-bit (-q --q-bits 4 --q-group-size 64) via mlx_vlm.convert
  • Group size: 64
  • Effective bits/weight: 4.695 (slightly higher than the text-only 4.501 — mlx_vlm keeps the vision tower in higher precision by default; only the LM weights are 4-bit quantized)
  • Shape on disk: 3 safetensors shards, ~16 GB total
  • What is preserved:
    • vision_tower.* weights — full vision encoder
    • multi_modal_projector.* weights — vision → LM connector
    • preprocessor_config.json — image preprocessing
    • video_preprocessor_config.json — video preprocessing
    • processor_config.json — chat-time processor wiring
    • chat_template.jinja — Qwen3.5/3.6 chat template with image/video roles
  • Build env (verified 2026-05-11 on Linux + RTX 3090 + CUDA 12.1):
    mlx==0.30.0
    mlx-cuda==0.30.0      ← ABI-coupled, must match mlx
    mlx-lm==0.30.7        ← Qwen3.5/3.6 model_type support
    mlx-vlm==0.3.12 (--no-deps)   ← last version that doesn't transitively bump mlx
    torch==2.11.0+cpu     ← satisfies Qwen3VLVideoProcessor's torchvision dep
                              without disturbing mlx-cuda's nvidia-cublas pin
    
    CUDA backend used only for the conversion step on this Linux box; end users on Apple Silicon use the native mlx runtime, which has no CUDA dependency.

Conversion recipe: omnimergekit/scripts/mlx_convert.sh (auto-detects vision_config and routes through mlx_vlm.convert). See MLX_CONVERT.md for the full pin rationale.

Usage

pip install -U mlx-vlm
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

repo = "ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-VL-4bit"
model, processor = load(repo)
config = load_config(repo)

# Pure-text generation
prompt = apply_chat_template(processor, config,
    "Write a Rust function that returns the n-th Fibonacci number iteratively.")
print(generate(model, processor, prompt, max_tokens=512, verbose=True))

# Vision (with an image)
prompt = apply_chat_template(processor, config,
    "Describe the image in detail, then state what's likely happening.",
    num_images=1)
print(generate(model, processor, prompt,
    max_tokens=512, verbose=True, image=["path/to/image.png"]))

# Video (with a clip)
prompt = apply_chat_template(processor, config,
    "Summarize what happens in this video.", num_videos=1)
print(generate(model, processor, prompt,
    max_tokens=512, verbose=True, video=["path/to/clip.mp4"]))

The base model emits Qwen3.6 reasoning tags (<think>...</think>). Strip them in post-processing or use a chat template wrapper that handles them appropriately.

Memory & speed

Empirically (M-series, 32 GB+ recommended):

  • Resident memory: ~17–18 GB (vs ~16-17 GB for the text-only build — vision tower adds ~1 GB at higher precision)
  • Speed: comparable to other Qwen3-VL 27B 4-bit MLX builds; depends on chip generation
  • Context length: inherits the base model's 256k context (RAM permitting)

Choosing between the two MLX builds

text-only VL (this build)
Loader mlx_lm.load mlx_vlm.load
Image / video input
Disk size ~15 GB ~16 GB
Resident RAM ~16–17 GB ~17–18 GB
Quality on text-only tasks identical LM weights identical LM weights

Pick text-only if you don't need vision and want a marginally smaller download. Pick VL for anything multimodal — same language model behind it.

Related

License

Apache 2.0 — inherits from Qwen3.6 base. See the base model card for the full attribution list (Qwen team, rico03, ValiantLabs, kai-os, mergekit community).

Downloads last month
7
Safetensors
Model size
5B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-VL-4bit

Quantized
(3)
this model