Transformers
ONNX
English
mla-attention
multi-head-latent-attention
flow-matching
rectified-flow
on-device
efficient-attention
smol-scale
research
proof-of-concept
Instructions to use Tinman-Lab/Tinman-SmolOmni-MLA-256M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Tinman-Lab/Tinman-SmolOmni-MLA-256M with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Tinman-Lab/Tinman-SmolOmni-MLA-256M", dtype="auto") - Notebooks
- Google Colab
- Kaggle
β οΈ Research Proof of Concept β NOT Production Ready
This is a research artifact demonstrating Multi-Head Latent Attention (MLA) in a multimodal VLM below 1B parameters. The architecture is novel and validated. The quality is not production-grade.
What Actually Works
| Component | Status | Evidence |
|---|---|---|
| Architecture | β Validated | 245.7M params, MLA reduces KV cache -38.9% |
| Text understanding | β Functional | Forward pass produces coherent logits |
| Image generation pipeline | β οΈ Runs but quality poor | CLIP score 0.11 (random ~0.15, good ~0.30) |
| VQA / Vision understanding | β Not trained | No VQA instruction tuning performed |
| TTS mel generation | β οΈ Architecture sound | MSE ~2.0, trained on only 8 samples |
| ASR (Audio β Text) | β Broken | 100% CER β model never learned ASR |
| Speed | β Fast | 17,140 tok/s AR, 0.30s / 50 flow steps |
| Memory | β Efficient | 510 MB peak VRAM, 109 MB NF4 quantized |
Production Readiness: What's Missing & Cost
To make image generation good:
| Gap | What's needed | Estimated cost |
|---|---|---|
| Training data | LAION-5B or COYO-700M (50M+ image-text pairs) | $0 (download) |
| Training steps | 20,000β100,000 (vs our 2,000) | $500β$5,000 on A100 |
| Latent normalization | Fix VAE scaling in generation code | $0 (code fix) |
| CFG + null conditioning | Add classifier-free guidance training | $100β$500 |
| Total for good images | $600β$5,500 |
To make VQA work:
| Gap | What's needed | Estimated cost |
|---|---|---|
| Instruction data | ChartQA train (28K) + VQAv2 (200K) + DocVQA | $0 (download) |
| Training | 2β5 epochs VQA instruction tuning | $5β$20 on L4 |
| Evaluation | VQAv2, TextVQA, ChartQA benchmarks | $0 (CPU after training) |
| Total for working VQA | $5β$20 |
To make ASR work (native):
| Gap | What's needed | Estimated cost |
|---|---|---|
| Speech encoder | Whisper-small encoder (frozen, 87M) | $0 (download) |
| Projector | Linear/MLP adapter (~20M trainable) | $0 (code) |
| Data | LibriSpeech-960 (960h) + optionally GigaSpeech (10Kh) | $0 (download) |
| Training | 5β10 epochs, projector only, LLM frozen | $10β$50 on A100 |
| Total for native ASR | $10β$50 |
Easier alternative (already available):
Use Moonshine ASR integration in the Toolkit β 27M params, 3.5% WER, 0.1s latency. No training needed.
What This Checkpoint Actually Does
| Capability | Status | Quality | Notes |
|---|---|---|---|
| Text β Text | β Functional | Coherent | SVD-initialized from SmolVLM |
| Image β Text understanding | β οΈ Pipeline runs | Untested | Needs VQA instruction tuning |
| Text β Image | β οΈ Pipeline runs | Poor (CLIP 0.11) | Architecture valid, needs scale |
| Text β Mel | β οΈ Architecture sound | Undertrained | Only 8 samples in training |
| Audio β Text | β Broken | N/A | Deleted, use Moonshine instead |
π Files
| File | Size | Purpose |
|---|---|---|
stage4_v2/final/model.pt |
938 MB | Full checkpoint (text + image gen + TTS mel) |
model_nf4.safetensors |
109 MB | 4-bit quantized (recommended for downloading) |
*.onnx |
~1.3 GB | ONNX exports for understanding + generation |
π₯ Download
pip install git+https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit
import torch
from smolomni import SmolOmni
from transformers import AutoTokenizer
model = SmolOmni.from_hub(
"TinmanLabSL/SmolOmni-MLA-256M",
checkpoint="stage4_v2/final/model.pt",
config="mla-hybrid-ar-flow-256M",
device="cuda",
dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolVLM-256M-Instruct")
ποΈ Architecture
- Base: SmolVLM-256M-Instruct
- GQA layers: 0β9 (preserves vision knowledge)
- MLA layers: 10β29 (KV cache compression)
- NoPE: every 4th layer
- SVD init: 294 of 464 weights copied from pretrained GQA via X-EcoMLA
- Parameters: 245.7M total
π Benchmarks (Verified on L4)
| Metric | Value |
|---|---|
| KV Cache / token | 7,040 floats (-38.9%) |
| AR Throughput | 17,140 tok/s |
| Peak VRAM | 510 MB |
| Image Gen (50 steps) | 0.30 s |
| NF4 Size | 109 MB |
π Links
- Toolkit (code, Moonshine integration, training scripts): TinmanLabSL/SmolOmni-MLA-Toolkit
- 500M Model: TinmanLabSL/SmolOmni-MLA-500M
- Base: SmolVLM-256M-Instruct
π Citation
@software{smolomni256m2025,
title = {Tinman-SmolOmni-MLA-256M: Research POC β MLA in Multimodal VLM},
author = {TinmanLabSL},
year = {2025},
url = {https://huggingface.co/TinmanLabSL/SmolOmni-MLA-256M},
note = {Research proof of concept. Not production ready.}
}
License
Apache 2.0
- Downloads last month
- 796
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support