Nemotron-3-Nano-Omni-30B-A3B-Reasoning - GGUF

This repository contains high-fidelity GGUF quantizations of NVIDIA's Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16.

These quantizations were explicitly compiled to maximize logic, reasoning, and narrative consistency for local deployments, making them highly suitable for text-based RPG engines, structured JSON output.

🧠 High-Fidelity Quantization Strategy: FP16 Output Head

Unlike standard GGUF conversions, these models were quantized using the --leave-output-tensor flag.

What does this mean? The final projection layer (lm_head / output.weight), which maps the model's internal states to the vocabulary, has been preserved in pristine FP16 precision (~2.1 GB). While this slightly increases the overall file size and initial HDD load time, it completely eliminates the "numerical noise" introduced when crushing the output head to 4-bit or 5-bit.

The Result: Smaller quantizations (like Q3_K_M or Q4_K_M) retain the sharp logic, precise tool-calling, and chain-of-thought (<think>) capabilities of the massive uncompressed model, all while fitting comfortably into local RAM constraints.

⚙️ Model Architecture & Hardware Requirements

  • Architecture: Mamba2-Transformer Hybrid Mixture of Experts (MoE)
  • Parameters: 30 Billion total
  • Active Parameters: ~3 Billion per token (A3B)
  • Context Length: Up to 256k tokens

Because this is a sparse MoE model, it requires significantly less RAM bandwidth and compute power than a dense 30B model. A Q4_K_M variant will easily run on machines with 8GB to 16GB of system RAM. The primary bottleneck will be the initial model loading time.

📂 Available Quantizations

Quantization Bits / Weight Use Case / Notes
Q8_0 8.5 Extreme fidelity. Best if you have high RAM but limited VRAM.
Q6_K 6.5 Excellent balance for 16GB+ systems. Near-perfect F16 parity.
Q5_K_M 5.5 High quality, slightly faster inference than Q6.
Q4_K_M 4.8 [RECOMMENDED] The sweet spot for performance vs. intelligence. FP16 head ensures reasoning stays intact.
Q4_K_S 4.5 Slightly smaller than K_M, minimal quality loss.
Q3_K_M 3.5 Maximum compression. Great for severely resource-constrained setups (8GB RAM).

💬 Prompt Format (Reasoning Mode)

This model is trained to utilize a chain-of-thought reasoning budget. It natively supports <think> tags before generating its final response.

Chat Template Example:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is 2+2?<|im_end|>
<|im_start|>assistant
<think>
1. The user is asking for a simple arithmetic operation.
2. The operation is addition: 2 + 2.
3. The result of 2 + 2 is 4.
</think>
The answer is 4.<|im_end|>
Downloads last month
795
GGUF
Model size
32B params
Architecture
nemotron_h_moe
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Abiray/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF

Quantized
(47)
this model