Reuben Multi Model Dataset Lab

Community Article Published April 25, 2026

This work was produced for the Uncharted Data Challenge hosted by Adaption Labs. Big credit to Adaptive Data by Adaption for putting the challenge together.

About the project

The goal here was to build an open source dataset that developers can actually do something cool with. I picked music, street view imagery and documents in low resource languages because these areas are badly underserved. Most of the existing data is private or locked behind paywalls, and I want open source LLMs to do better on this kind of content. That gap is what these datasets are trying to fill.

I scraped data from places like Free Music Archive and archive.org. For OCR I used vLLM running Gemma 4 31B, and for audio understanding I used Gemini. Once the raw data was collected I cut down the row counts to produce a curated version that I could hand off to Adaption Labs for further processing.

The datasets

Curated subsets handed to Adaption Labs (14 datasets)

These are the row-reduced versions provided to Adaption Labs. They live in the Proper Adaption collection.

Dataset	Size
adaption-music-style-prompts	9.95k
Adaption-video-qa-diverse-topics	86
Adaption-low-resource-doc-qa	9.72k
Adaption-low-resource-audio	3.7k
adaption-multilingual-image-captions	462
adaption-multilingual-doc-qa	8.8k
current-affairs-2023	4.67k
current-affairs-2024	5.19k
current-affairs-2025	5.39k
current-affairs-2026	5.34k
frontend-html-tailwind-js	145
adaption-street-scene-descriptions	10.1k
Adaption-multilingual-sentences	10k
Adaption-multilingual-speech	10.3k

Raw scraped datasets (10 datasets across 4 sub-collections)

These are the full datasets I scraped from the web. They live in ReubenDataLab and are split across four sub-collections.

Audio

Dataset	Size
fma-labeled	29.3k
multilingual-synthetic-tts	68.7k
PolyglotAudio	1.16M

Text

Dataset	Size
PolyglotText	13.4M
current-affairs-2023	4.67k
current-affairs-2024	5.19k
current-affairs-2025	5.39k

Images

Dataset	Size
streetview-global	10.2k
magazines-multilingual-vqa	29k

Coding

Dataset	Size
frontend-coding	87

Where to find everything

The full raw datasets live on ReubenDataLab grouped into four collections. The Adaption versions are at Proper Adaption.

For an interactive view of the graphs head over to https://reubencf-dataset-explorer.static.hf.space/index.html

Visualizations

Over two weeks I managed to scrape data across more than 130 languages. The treemap above shows the top 90 of those.

This chart compares the original raw dataset with the version provided to adaptionlabs.ai after curation.

And here is the modality split of the dataset.

Get in touch

Hugging Face: @Reubencf
Datasets home: ReubenDataLab

Collections mentioned in this article 1

Konkani LLM: Bringing a Multi-Script Low-Resource Language to the AI Era

March 7, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote