Towards Pixel-Level VLM Perception via Simple Points Prediction Paper • 2601.19228 • Published Jan 27 • 18
Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision Paper • 2601.19798 • Published Jan 27 • 42
OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models Paper • 2601.21639 • Published Jan 29 • 50
PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing Paper • 2601.21957 • Published Jan 29 • 19
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding Paper • 2602.01785 • Published Feb 2 • 95
WorldVQA: Measuring Atomic World Knowledge in Multimodal Large Language Models Paper • 2602.02537 • Published Jan 28 • 6
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild? Paper • 2602.03916 • Published Feb 3 • 11
EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience Paper • 2601.15876 • Published Jan 22 • 91
MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods Paper • 2601.21821 • Published Jan 29 • 60
P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads Paper • 2602.09443 • Published Feb 10 • 57
MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation Paper • 2602.11337 • Published about 1 month ago • 6
Code2World: A GUI World Model via Renderable Code Generation Paper • 2602.09856 • Published Feb 10 • 200