olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models Paper • 2502.18443 • Published Feb 25, 2025 • 12
OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation Paper • 2511.13655 • Published Nov 17, 2025 • 12
Bolmo: Byteifying the Next Generation of Language Models Paper • 2512.15586 • Published Dec 17, 2025 • 18
How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs Paper • 2602.08808 • Published Feb 9 • 10
Unified Spatio-Temporal Token Scoring for Efficient Video VLMs Paper • 2603.18004 • Published Mar 18 • 14
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion Paper • 2503.11576 • Published Mar 14, 2025 • 159
Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild Paper • 2605.22064 • Published 2 days ago • 3
Moshi: a speech-text foundation model for real-time dialogue Paper • 2410.00037 • Published Sep 17, 2024 • 17
DINOv3 Collection DINOv3: foundation models producing excellent dense features, outperforming SotA w/o fine-tuning - https://arxiv.org/abs/2508.10104 • 15 items • Updated Mar 10 • 645