-
MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels
Paper • 2405.07526 • Published • 21 -
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach
Paper • 2405.15613 • Published • 17 -
A Touch, Vision, and Language Dataset for Multimodal Alignment
Paper • 2402.13232 • Published • 17 -
How Do Large Language Models Acquire Factual Knowledge During Pretraining?
Paper • 2406.11813 • Published • 31
Collections
Discover the best community collections!
Collections including paper arxiv:2507.04009
-
Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents
Paper • 2507.04009 • Published • 55 -
FineVision: Open Data Is All You Need
Paper • 2510.17269 • Published • 81 -
Kronos: A Foundation Model for the Language of Financial Markets
Paper • 2508.02739 • Published • 35 -
VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images
Paper • 2604.09531 • Published • 8
-
A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code
Paper • 2508.18106 • Published • 350 -
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems
Paper • 2411.02959 • Published • 71 -
Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents
Paper • 2507.04009 • Published • 55 -
MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm
Paper • 2506.05218 • Published • 3
-
SAM 3: Segment Anything with Concepts
Paper • 2511.16719 • Published • 137 -
Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance
Paper • 2512.08765 • Published • 134 -
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
Paper • 2512.04677 • Published • 178 -
LongCat-Image Technical Report
Paper • 2512.07584 • Published • 25
-
Step-Audio-R1 Technical Report
Paper • 2511.15848 • Published • 59 -
OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation
Paper • 2410.17799 • Published • 13 -
Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents
Paper • 2507.04009 • Published • 55
-
Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents
Paper • 2509.06917 • Published • 44 -
Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents
Paper • 2507.04009 • Published • 55 -
WebDancer: Towards Autonomous Information Seeking Agency
Paper • 2505.22648 • Published • 33
-
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning
Paper • 2504.07128 • Published • 87 -
BM25S: Orders of magnitude faster lexical search via eager sparse scoring
Paper • 2407.03618 • Published • 14 -
Deep Think with Confidence
Paper • 2508.15260 • Published • 91 -
R-Zero: Self-Evolving Reasoning LLM from Zero Data
Paper • 2508.05004 • Published • 132
-
MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels
Paper • 2405.07526 • Published • 21 -
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach
Paper • 2405.15613 • Published • 17 -
A Touch, Vision, and Language Dataset for Multimodal Alignment
Paper • 2402.13232 • Published • 17 -
How Do Large Language Models Acquire Factual Knowledge During Pretraining?
Paper • 2406.11813 • Published • 31
-
SAM 3: Segment Anything with Concepts
Paper • 2511.16719 • Published • 137 -
Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance
Paper • 2512.08765 • Published • 134 -
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
Paper • 2512.04677 • Published • 178 -
LongCat-Image Technical Report
Paper • 2512.07584 • Published • 25
-
Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents
Paper • 2507.04009 • Published • 55 -
FineVision: Open Data Is All You Need
Paper • 2510.17269 • Published • 81 -
Kronos: A Foundation Model for the Language of Financial Markets
Paper • 2508.02739 • Published • 35 -
VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images
Paper • 2604.09531 • Published • 8
-
Step-Audio-R1 Technical Report
Paper • 2511.15848 • Published • 59 -
OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation
Paper • 2410.17799 • Published • 13 -
Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents
Paper • 2507.04009 • Published • 55
-
A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code
Paper • 2508.18106 • Published • 350 -
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems
Paper • 2411.02959 • Published • 71 -
Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents
Paper • 2507.04009 • Published • 55 -
MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm
Paper • 2506.05218 • Published • 3
-
Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents
Paper • 2509.06917 • Published • 44 -
Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents
Paper • 2507.04009 • Published • 55 -
WebDancer: Towards Autonomous Information Seeking Agency
Paper • 2505.22648 • Published • 33
-
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning
Paper • 2504.07128 • Published • 87 -
BM25S: Orders of magnitude faster lexical search via eager sparse scoring
Paper • 2407.03618 • Published • 14 -
Deep Think with Confidence
Paper • 2508.15260 • Published • 91 -
R-Zero: Self-Evolving Reasoning LLM from Zero Data
Paper • 2508.05004 • Published • 132