Reuben Multi Model Dataset Lab
This work was produced for the Uncharted Data Challenge hosted by Adaption Labs. Big credit to Adaptive Data by Adaption for putting the challenge together.
About the project
The goal here was to build an open source dataset that developers can actually do something cool with. I picked music, street view imagery and documents in low resource languages because these areas are badly underserved. Most of the existing data is private or locked behind paywalls, and I want open source LLMs to do better on this kind of content. That gap is what these datasets are trying to fill.
I scraped data from places like Free Music Archive and archive.org. For OCR I used vLLM running Gemma 4 31B, and for audio understanding I used Gemini. Once the raw data was collected I cut down the row counts to produce a curated version that I could hand off to Adaption Labs for further processing.
The datasets
Curated subsets handed to Adaption Labs (14 datasets)
These are the row-reduced versions provided to Adaption Labs. They live in the Proper Adaption collection.
| Dataset | Size |
|---|---|
| adaption-music-style-prompts | 9.95k |
| Adaption-video-qa-diverse-topics | 86 |
| Adaption-low-resource-doc-qa | 9.72k |
| Adaption-low-resource-audio | 3.7k |
| adaption-multilingual-image-captions | 462 |
| adaption-multilingual-doc-qa | 8.8k |
| current-affairs-2023 | 4.67k |
| current-affairs-2024 | 5.19k |
| current-affairs-2025 | 5.39k |
| current-affairs-2026 | 5.34k |
| frontend-html-tailwind-js | 145 |
| adaption-street-scene-descriptions | 10.1k |
| Adaption-multilingual-sentences | 10k |
| Adaption-multilingual-speech | 10.3k |
Raw scraped datasets (10 datasets across 4 sub-collections)
These are the full datasets I scraped from the web. They live in ReubenDataLab and are split across four sub-collections.
Audio
| Dataset | Size |
|---|---|
| fma-labeled | 29.3k |
| multilingual-synthetic-tts | 68.7k |
| PolyglotAudio | 1.16M |
Text
| Dataset | Size |
|---|---|
| PolyglotText | 13.4M |
| current-affairs-2023 | 4.67k |
| current-affairs-2024 | 5.19k |
| current-affairs-2025 | 5.39k |
Images
| Dataset | Size |
|---|---|
| streetview-global | 10.2k |
| magazines-multilingual-vqa | 29k |
Coding
| Dataset | Size |
|---|---|
| frontend-coding | 87 |
Where to find everything
The full raw datasets live on ReubenDataLab grouped into four collections. The Adaption versions are at Proper Adaption.
For an interactive view of the graphs head over to https://reubencf-dataset-explorer.static.hf.space/index.html
Visualizations
Over two weeks I managed to scrape data across more than 130 languages. The treemap above shows the top 90 of those.
This chart compares the original raw dataset with the version provided to adaptionlabs.ai after curation.
And here is the modality split of the dataset.
Get in touch
- Hugging Face: @Reubencf
- Datasets home: ReubenDataLab



