metadata
language:
- en
library_name: nemo
datasets:
- cdli/ugandan_english_nonstandard_speech_v1.0
tags:
- automatic-speech-recognition
- speech
- audio
- Transducer
- TDT
- FastConformer
- Conformer
- pytorch
- NeMo
- atypical-speech
- dysarthria
license: cc-by-sa-4.0
widget:
- example_title: LibriSpeech sample 1
src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
- example_title: LibriSpeech sample 2
src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
model-index:
- name: cdli-parakeet-11b-en-finetune
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: CDLI Ugandan English Non-Standard Speech v1.0
type: cdli/ugandan_english_nonstandard_speech_v1.0
split: test
args:
language: en
metrics:
- name: Test WER (raw)
type: wer
value: 31.57
- name: Test CER (raw)
type: cer
value: 15.09
- name: Test WER (normalized)
type: wer
value: 21.2
- name: Test CER (normalized)
type: cer
value: 12.56
metrics:
- wer
- cer
pipeline_tag: automatic-speech-recognition
CDLI Parakeet TDT 1.1B English Fine-Tune (lr=5e-5)
This repository contains a NeMo ASR model fine-tuned from
nvidia/parakeet-tdt-1.1b on the gated
cdli/ugandan_english_nonstandard_speech_v1.0 dataset.
This card documents the stronger 1.1B recovery run using a lower learning
rate (5e-5) after the earlier 1e-4 run plateaued early.
Model Details
- Base model:
nvidia/parakeet-tdt-1.1b - Fine-tuning framework: NVIDIA NeMo
- Language: English
- Acoustic model family: FastConformer-TDT / RNNT-BPE
Dataset
- Dataset:
cdli/ugandan_english_nonstandard_speech_v1.0 - License:
cc-by-sa-4.0 - Split sizes used by the source dataset card:
- train:
5176 - validation:
638 - test:
1017
- train:
The evaluation artifacts in this run contain 1016 scored rows.
Training Configuration
- Work root:
/jupyter_kernel/parakeet_cdli_en_5e5 - Base checkpoint:
nvidia/parakeet-tdt-1.1b - Max manifest audio length:
40.0 s - Max training audio length:
30.0 s - Min audio length:
0.2 s - Train batch size:
4 - Eval batch size:
8 - Gradient accumulation steps:
8 - Effective train batch size:
32 - Learning rate:
5e-5 - Weight decay:
1e-3 - Warmup steps:
100 - Scheduler:
CosineAnnealing - Max steps configured:
20000 - Early stopping patience:
10
Evaluation
Evaluation was run on the held-out test split using both raw and normalized
transcript comparison.
Corpus Metrics
- Raw WER:
31.57% - Raw CER:
15.09% - Normalized WER:
21.20% - Normalized CER:
12.56%
Average Utterance Metrics
- Average normalized utterance WER (capped at 1.0):
20.70% - Average normalized utterance CER (capped at 1.0):
12.58%
Files
EN-PARAKEET-TDT-F1tdt-1-1b.nemo: exported NeMo checkpointcheckpoints/: intermediate training checkpointstest_predictions.csvtest_predictions.jsonltest_predictions_scored.csvtest_predictions_scored.jsonltest_predictions_grouped_analysis.csv
Notes
- This
5e-5run improved substantially over the earlier1.1B1e-4run. - Access to the source dataset is gated. Review the dataset terms before requesting access.