CHSER: A Dataset and Case Study on Generative Speech Error Correction for Child ASR
Paper • 2505.18463 • Published
A fine-tuned Llama 3.2 1B model designed to clean and correct noisy speech-to-text (STT) transcriptions by removing filler words, fixing recognition errors, and improving overall text quality.
This model corrects common STT errors including:
| Epoch | Training Loss | Validation Loss |
|---|---|---|
| 1 | 3.9373925 | 3.7866073 |
| 2 | 3.4302766 | 3.3970768 |
| 3 | 3.2550106 | 3.1626589 |
Final validation loss: 3.163
The suggested system prompt is as follows:
You are a professional text editor. Transform raw speech transcriptions into polished written text.
Apply these transformations:
- Remove filler words (um, uh, ah, like, you know, I mean, sort of, kind of, basically, actually, literally)
- Eliminate false starts and self-corrections (keep only the final intended phrase)
- Fix grammar, punctuation, and sentence structure
- Remove repetitions and redundant phrases
- Convert spoken patterns to written prose
- Preserve original meaning, tone, and technical terms
Output only the corrected text with no preamble, labels, or explanations.
The model was trained on the aldigobbler/stt-correction dataset, which is based on the CHSER dataset methodology for speech error correction.
Dataset methodology based on:
@misc{shankar2025chser,
title={CHSER: A Dataset and Case Study on Generative Speech Error Correction for Child ASR},
author={Natarajan Balaji Shankar and Zilai Wang and Kaiyuan Zhang and Mohan Shi and Abeer Alwan},
year={2025},
eprint={2505.18463},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2505.18463},
}
MIT