Reuben Multi Model Dataset Lab

Community Article Published April 25, 2026

Reubensdataset

This work was produced for the Uncharted Data Challenge hosted by Adaption Labs. Big credit to Adaptive Data by Adaption for putting the challenge together.

About the project

The goal here was to build an open source dataset that developers can actually do something cool with. I picked music, street view imagery and documents in low resource languages because these areas are badly underserved. Most of the existing data is private or locked behind paywalls, and I want open source LLMs to do better on this kind of content. That gap is what these datasets are trying to fill.

I scraped data from places like Free Music Archive and archive.org. For OCR I used vLLM running Gemma 4 31B, and for audio understanding I used Gemini. Once the raw data was collected I cut down the row counts to produce a curated version that I could hand off to Adaption Labs for further processing.

The datasets

Curated subsets handed to Adaption Labs (14 datasets)

These are the row-reduced versions provided to Adaption Labs. They live in the Proper Adaption collection.

Dataset Size
adaption-music-style-prompts 9.95k
Adaption-video-qa-diverse-topics 86
Adaption-low-resource-doc-qa 9.72k
Adaption-low-resource-audio 3.7k
adaption-multilingual-image-captions 462
adaption-multilingual-doc-qa 8.8k
current-affairs-2023 4.67k
current-affairs-2024 5.19k
current-affairs-2025 5.39k
current-affairs-2026 5.34k
frontend-html-tailwind-js 145
adaption-street-scene-descriptions 10.1k
Adaption-multilingual-sentences 10k
Adaption-multilingual-speech 10.3k

Raw scraped datasets (10 datasets across 4 sub-collections)

These are the full datasets I scraped from the web. They live in ReubenDataLab and are split across four sub-collections.

Audio

Dataset Size
fma-labeled 29.3k
multilingual-synthetic-tts 68.7k
PolyglotAudio 1.16M

Text

Dataset Size
PolyglotText 13.4M
current-affairs-2023 4.67k
current-affairs-2024 5.19k
current-affairs-2025 5.39k

Images

Dataset Size
streetview-global 10.2k
magazines-multilingual-vqa 29k

Coding

Dataset Size
frontend-coding 87

Where to find everything

The full raw datasets live on ReubenDataLab grouped into four collections. The Adaption versions are at Proper Adaption.

For an interactive view of the graphs head over to https://reubencf-dataset-explorer.static.hf.space/index.html

Visualizations

language_treemap

Over two weeks I managed to scrape data across more than 130 languages. The treemap above shows the top 90 of those.

raw_vs_adaption

This chart compares the original raw dataset with the version provided to adaptionlabs.ai after curation.

modality_split

And here is the modality split of the dataset.

Get in touch

Community

Sign up or log in to comment