Qwen3.6-27B-Omnimerge-v4 — MLX 4-bit (Vision-Language)

Full multimodal 4-bit MLX quantization of ManniX-ITA/Qwen3.6-27B-Omnimerge-v4: text + image + video, runnable natively on Apple Silicon via mlx-vlm.

This is the VL build. There is also a text-only MLX-4bit release for slightly smaller-footprint, language-only inference via mlx-lm.

The base model is a same-base DARE-TIES (Omnimerge_v2 method) merge of Qwen/Qwen3.6-27B with three Qwen3.6 fine-tunes (rico03, Esper3.1, kai-os Opus-Reasoning-anchor), plus an MLP-passthrough surgery that fixes Qwen3.6's reasoning-tag-emission fragility. Method, benchmark numbers, and forensic write-up live on the base model card.

Quantization

Type: MLX 4-bit (-q --q-bits 4 --q-group-size 64) via mlx_vlm.convert
Group size: 64
Effective bits/weight: 4.695 (slightly higher than the text-only 4.501 — mlx_vlm keeps the vision tower in higher precision by default; only the LM weights are 4-bit quantized)
Shape on disk: 3 safetensors shards, ~16 GB total
What is preserved:
- vision_tower.* weights — full vision encoder
- multi_modal_projector.* weights — vision → LM connector
- preprocessor_config.json — image preprocessing
- video_preprocessor_config.json — video preprocessing
- processor_config.json — chat-time processor wiring
- chat_template.jinja — Qwen3.5/3.6 chat template with image/video roles

Build env (verified 2026-05-11 on Linux + RTX 3090 + CUDA 12.1):

mlx==0.30.0
mlx-cuda==0.30.0      ← ABI-coupled, must match mlx
mlx-lm==0.30.7        ← Qwen3.5/3.6 model_type support
mlx-vlm==0.3.12 (--no-deps)   ← last version that doesn't transitively bump mlx
torch==2.11.0+cpu     ← satisfies Qwen3VLVideoProcessor's torchvision dep
                          without disturbing mlx-cuda's nvidia-cublas pin

CUDA backend used only for the conversion step on this Linux box; end users on Apple Silicon use the native mlx runtime, which has no CUDA dependency.

Conversion recipe: omnimergekit/scripts/mlx_convert.sh (auto-detects vision_config and routes through mlx_vlm.convert). See MLX_CONVERT.md for the full pin rationale.

Usage

pip install -U mlx-vlm

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

repo = "ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-VL-4bit"
model, processor = load(repo)
config = load_config(repo)

# Pure-text generation
prompt = apply_chat_template(processor, config,
    "Write a Rust function that returns the n-th Fibonacci number iteratively.")
print(generate(model, processor, prompt, max_tokens=512, verbose=True))

# Vision (with an image)
prompt = apply_chat_template(processor, config,
    "Describe the image in detail, then state what's likely happening.",
    num_images=1)
print(generate(model, processor, prompt,
    max_tokens=512, verbose=True, image=["path/to/image.png"]))

# Video (with a clip)
prompt = apply_chat_template(processor, config,
    "Summarize what happens in this video.", num_videos=1)
print(generate(model, processor, prompt,
    max_tokens=512, verbose=True, video=["path/to/clip.mp4"]))

The base model emits Qwen3.6 reasoning tags (<think>...</think>). Strip them in post-processing or use a chat template wrapper that handles them appropriately.

Memory & speed

Empirically (M-series, 32 GB+ recommended):

Resident memory: ~17–18 GB (vs ~16-17 GB for the text-only build — vision tower adds ~1 GB at higher precision)
Speed: comparable to other Qwen3-VL 27B 4-bit MLX builds; depends on chip generation
Context length: inherits the base model's 256k context (RAM permitting)

Choosing between the two MLX builds

	text-only	VL (this build)
Loader	`mlx_lm.load`	`mlx_vlm.load`
Image / video input	❌	✅
Disk size	~15 GB	~16 GB
Resident RAM	~16–17 GB	~17–18 GB
Quality on text-only tasks	identical LM weights	identical LM weights

Pick text-only if you don't need vision and want a marginally smaller download. Pick VL for anything multimodal — same language model behind it.

Base merge: ManniX-ITA/Qwen3.6-27B-Omnimerge-v4
MLX text-only: ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-4bit
GGUF release: ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-GGUF
Ollama tags: mannix/omnimerge-v4
Methodology + scripts: mann1x/omnimergekit

License

Apache 2.0 — inherits from Qwen3.6 base. See the base model card for the full attribution list (Qwen team, rico03, ValiantLabs, kai-os, mergekit community).

Downloads last month: 7

Safetensors

Model size

5B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-VL-4bit

Base model

ManniX-ITA/Qwen3.6-27B-Omnimerge-v4

Quantized

(3)

this model

ManniX-ITA
/

Qwen3.6-27B-Omnimerge-v4-MLX-VL-4bit