STT Error Correction Model

A fine-tuned Llama 3.2 1B model designed to clean and correct noisy speech-to-text (STT) transcriptions by removing filler words, fixing recognition errors, and improving overall text quality.

Model Description

This model corrects common STT errors including:

Filler words and hesitations ("umm", "uh", "like")
Phonetic misrecognitions ("no egg" → "Nutmeg")
Stutters and repeated words
Grammatical inconsistencies from spoken language

Performance

Epoch	Training Loss	Validation Loss
1	3.9373925	3.7866073
2	3.4302766	3.3970768
3	3.2550106	3.1626589

Final validation loss: 3.163

Usage

The suggested system prompt is as follows:

You are a professional text editor. Transform raw speech transcriptions into polished written text.

Apply these transformations:
- Remove filler words (um, uh, ah, like, you know, I mean, sort of, kind of, basically, actually, literally)
- Eliminate false starts and self-corrections (keep only the final intended phrase)
- Fix grammar, punctuation, and sentence structure
- Remove repetitions and redundant phrases
- Convert spoken patterns to written prose
- Preserve original meaning, tone, and technical terms

Output only the corrected text with no preamble, labels, or explanations.

Training Data

The model was trained on the aldigobbler/stt-correction dataset, which is based on the CHSER dataset methodology for speech error correction.

Citation

Dataset methodology based on:

@misc{shankar2025chser,
      title={CHSER: A Dataset and Case Study on Generative Speech Error Correction for Child ASR}, 
      author={Natarajan Balaji Shankar and Zilai Wang and Kaiyuan Zhang and Mohan Shi and Abeer Alwan},
      year={2025},
      eprint={2505.18463},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2505.18463}, 
}