Nemotron-3-Nano-Omni-30B-A3B-Reasoning - GGUF
This repository contains high-fidelity GGUF quantizations of NVIDIA's Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16.
These quantizations were explicitly compiled to maximize logic, reasoning, and narrative consistency for local deployments, making them highly suitable for text-based RPG engines, structured JSON output.
🧠 High-Fidelity Quantization Strategy: FP16 Output Head
Unlike standard GGUF conversions, these models were quantized using the --leave-output-tensor flag.
What does this mean?
The final projection layer (lm_head / output.weight), which maps the model's internal states to the vocabulary, has been preserved in pristine FP16 precision (~2.1 GB). While this slightly increases the overall file size and initial HDD load time, it completely eliminates the "numerical noise" introduced when crushing the output head to 4-bit or 5-bit.
The Result: Smaller quantizations (like Q3_K_M or Q4_K_M) retain the sharp logic, precise tool-calling, and chain-of-thought (<think>) capabilities of the massive uncompressed model, all while fitting comfortably into local RAM constraints.
⚙️ Model Architecture & Hardware Requirements
- Architecture: Mamba2-Transformer Hybrid Mixture of Experts (MoE)
- Parameters: 30 Billion total
- Active Parameters: ~3 Billion per token (A3B)
- Context Length: Up to 256k tokens
Because this is a sparse MoE model, it requires significantly less RAM bandwidth and compute power than a dense 30B model. A Q4_K_M variant will easily run on machines with 8GB to 16GB of system RAM. The primary bottleneck will be the initial model loading time.
📂 Available Quantizations
| Quantization | Bits / Weight | Use Case / Notes |
|---|---|---|
| Q8_0 | 8.5 | Extreme fidelity. Best if you have high RAM but limited VRAM. |
| Q6_K | 6.5 | Excellent balance for 16GB+ systems. Near-perfect F16 parity. |
| Q5_K_M | 5.5 | High quality, slightly faster inference than Q6. |
| Q4_K_M | 4.8 | [RECOMMENDED] The sweet spot for performance vs. intelligence. FP16 head ensures reasoning stays intact. |
| Q4_K_S | 4.5 | Slightly smaller than K_M, minimal quality loss. |
| Q3_K_M | 3.5 | Maximum compression. Great for severely resource-constrained setups (8GB RAM). |
💬 Prompt Format (Reasoning Mode)
This model is trained to utilize a chain-of-thought reasoning budget. It natively supports <think> tags before generating its final response.
Chat Template Example:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is 2+2?<|im_end|>
<|im_start|>assistant
<think>
1. The user is asking for a simple arithmetic operation.
2. The operation is addition: 2 + 2.
3. The result of 2 + 2 is 4.
</think>
The answer is 4.<|im_end|>
- Downloads last month
- 795
3-bit
4-bit
5-bit
6-bit
8-bit