⚠️ Research Proof of Concept — NOT Production Ready

This is a research artifact demonstrating Multi-Head Latent Attention (MLA) in a multimodal VLM below 1B parameters. The architecture is novel and validated. The quality is not production-grade.

What Actually Works

Component	Status	Evidence
Architecture	✅ Validated	245.7M params, MLA reduces KV cache -38.9%
Text understanding	✅ Functional	Forward pass produces coherent logits
Image generation pipeline	⚠️ Runs but quality poor	CLIP score 0.11 (random ~0.15, good ~0.30)
VQA / Vision understanding	❌ Not trained	No VQA instruction tuning performed
TTS mel generation	⚠️ Architecture sound	MSE ~2.0, trained on only 8 samples
ASR (Audio → Text)	❌ Broken	100% CER — model never learned ASR
Speed	✅ Fast	17,140 tok/s AR, 0.30s / 50 flow steps
Memory	✅ Efficient	510 MB peak VRAM, 109 MB NF4 quantized

Production Readiness: What's Missing & Cost

To make image generation good:

Gap	What's needed	Estimated cost
Training data	LAION-5B or COYO-700M (50M+ image-text pairs)	$0 (download)
Training steps	20,000–100,000 (vs our 2,000)	$500–$5,000 on A100
Latent normalization	Fix VAE scaling in generation code	$0 (code fix)
CFG + null conditioning	Add classifier-free guidance training	$100–$500
Total for good images		$600–$5,500

To make VQA work:

Gap	What's needed	Estimated cost
Instruction data	ChartQA train (28K) + VQAv2 (200K) + DocVQA	$0 (download)
Training	2–5 epochs VQA instruction tuning	$5–$20 on L4
Evaluation	VQAv2, TextVQA, ChartQA benchmarks	$0 (CPU after training)
Total for working VQA		$5–$20

To make ASR work (native):

Gap	What's needed	Estimated cost
Speech encoder	Whisper-small encoder (frozen, 87M)	$0 (download)
Projector	Linear/MLP adapter (~20M trainable)	$0 (code)
Data	LibriSpeech-960 (960h) + optionally GigaSpeech (10Kh)	$0 (download)
Training	5–10 epochs, projector only, LLM frozen	$10–$50 on A100
Total for native ASR		$10–$50

Easier alternative (already available):

Use Moonshine ASR integration in the Toolkit — 27M params, 3.5% WER, 0.1s latency. No training needed.

What This Checkpoint Actually Does

Capability	Status	Quality	Notes
Text → Text	✅ Functional	Coherent	SVD-initialized from SmolVLM
Image → Text understanding	⚠️ Pipeline runs	Untested	Needs VQA instruction tuning
Text → Image	⚠️ Pipeline runs	Poor (CLIP 0.11)	Architecture valid, needs scale
Text → Mel	⚠️ Architecture sound	Undertrained	Only 8 samples in training
Audio → Text	❌ Broken	N/A	Deleted, use Moonshine instead

📁 Files

File	Size	Purpose
`stage4_v2/final/model.pt`	938 MB	Full checkpoint (text + image gen + TTS mel)
`model_nf4.safetensors`	109 MB	4-bit quantized (recommended for downloading)
`*.onnx`	~1.3 GB	ONNX exports for understanding + generation

📥 Download

pip install git+https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit

import torch
from smolomni import SmolOmni
from transformers import AutoTokenizer

model = SmolOmni.from_hub(
    "TinmanLabSL/SmolOmni-MLA-256M",
    checkpoint="stage4_v2/final/model.pt",
    config="mla-hybrid-ar-flow-256M",
    device="cuda",
    dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolVLM-256M-Instruct")

🏗️ Architecture

Base: SmolVLM-256M-Instruct
GQA layers: 0–9 (preserves vision knowledge)
MLA layers: 10–29 (KV cache compression)
NoPE: every 4th layer
SVD init: 294 of 464 weights copied from pretrained GQA via X-EcoMLA
Parameters: 245.7M total

📊 Benchmarks (Verified on L4)

Metric	Value
KV Cache / token	7,040 floats (-38.9%)
AR Throughput	17,140 tok/s
Peak VRAM	510 MB
Image Gen (50 steps)	0.30 s
NF4 Size	109 MB

🔗 Links

Toolkit (code, Moonshine integration, training scripts): TinmanLabSL/SmolOmni-MLA-Toolkit
500M Model: TinmanLabSL/SmolOmni-MLA-500M
Base: SmolVLM-256M-Instruct

📝 Citation

@software{smolomni256m2025,
  title = {Tinman-SmolOmni-MLA-256M: Research POC — MLA in Multimodal VLM},
  author = {TinmanLabSL},
  year = {2025},
  url = {https://huggingface.co/TinmanLabSL/SmolOmni-MLA-256M},
  note = {Research proof of concept. Not production ready.}
}

License

Apache 2.0

Downloads last month: 796

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support