KasuleTrevor's picture
Update 1.1B 5e-5 artifact README.md
a98179e verified
metadata
language:
  - en
library_name: nemo
datasets:
  - cdli/ugandan_english_nonstandard_speech_v1.0
tags:
  - automatic-speech-recognition
  - speech
  - audio
  - Transducer
  - TDT
  - FastConformer
  - Conformer
  - pytorch
  - NeMo
  - atypical-speech
  - dysarthria
license: cc-by-sa-4.0
widget:
  - example_title: LibriSpeech sample 1
    src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
  - example_title: LibriSpeech sample 2
    src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
model-index:
  - name: cdli-parakeet-11b-en-finetune
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CDLI Ugandan English Non-Standard Speech v1.0
          type: cdli/ugandan_english_nonstandard_speech_v1.0
          split: test
          args:
            language: en
        metrics:
          - name: Test WER (raw)
            type: wer
            value: 31.57
          - name: Test CER (raw)
            type: cer
            value: 15.09
          - name: Test WER (normalized)
            type: wer
            value: 21.2
          - name: Test CER (normalized)
            type: cer
            value: 12.56
metrics:
  - wer
  - cer
pipeline_tag: automatic-speech-recognition

CDLI Parakeet TDT 1.1B English Fine-Tune (lr=5e-5)

Model architecture Base model Language

This repository contains a NeMo ASR model fine-tuned from nvidia/parakeet-tdt-1.1b on the gated cdli/ugandan_english_nonstandard_speech_v1.0 dataset.

This card documents the stronger 1.1B recovery run using a lower learning rate (5e-5) after the earlier 1e-4 run plateaued early.

Model Details

  • Base model: nvidia/parakeet-tdt-1.1b
  • Fine-tuning framework: NVIDIA NeMo
  • Language: English
  • Acoustic model family: FastConformer-TDT / RNNT-BPE

Dataset

  • Dataset: cdli/ugandan_english_nonstandard_speech_v1.0
  • License: cc-by-sa-4.0
  • Split sizes used by the source dataset card:
    • train: 5176
    • validation: 638
    • test: 1017

The evaluation artifacts in this run contain 1016 scored rows.

Training Configuration

  • Work root: /jupyter_kernel/parakeet_cdli_en_5e5
  • Base checkpoint: nvidia/parakeet-tdt-1.1b
  • Max manifest audio length: 40.0 s
  • Max training audio length: 30.0 s
  • Min audio length: 0.2 s
  • Train batch size: 4
  • Eval batch size: 8
  • Gradient accumulation steps: 8
  • Effective train batch size: 32
  • Learning rate: 5e-5
  • Weight decay: 1e-3
  • Warmup steps: 100
  • Scheduler: CosineAnnealing
  • Max steps configured: 20000
  • Early stopping patience: 10

Evaluation

Evaluation was run on the held-out test split using both raw and normalized transcript comparison.

Corpus Metrics

  • Raw WER: 31.57%
  • Raw CER: 15.09%
  • Normalized WER: 21.20%
  • Normalized CER: 12.56%

Average Utterance Metrics

  • Average normalized utterance WER (capped at 1.0): 20.70%
  • Average normalized utterance CER (capped at 1.0): 12.58%

Files

  • EN-PARAKEET-TDT-F1tdt-1-1b.nemo: exported NeMo checkpoint
  • checkpoints/: intermediate training checkpoints
  • test_predictions.csv
  • test_predictions.jsonl
  • test_predictions_scored.csv
  • test_predictions_scored.jsonl
  • test_predictions_grouped_analysis.csv

Notes

  • This 5e-5 run improved substantially over the earlier 1.1B 1e-4 run.
  • Access to the source dataset is gated. Review the dataset terms before requesting access.