Update 1.1B 5e-5 artifact README.md

a98179e verified about 1 month ago

3.61 kB

language:
  - en
library_name: nemo
datasets:
  - cdli/ugandan_english_nonstandard_speech_v1.0
tags:
  - automatic-speech-recognition
  - speech
  - audio
  - Transducer
  - TDT
  - FastConformer
  - Conformer
  - pytorch
  - NeMo
  - atypical-speech
  - dysarthria
license: cc-by-sa-4.0
widget:
  - example_title: LibriSpeech sample 1
    src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
  - example_title: LibriSpeech sample 2
    src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
model-index:
  - name: cdli-parakeet-11b-en-finetune
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CDLI Ugandan English Non-Standard Speech v1.0
          type: cdli/ugandan_english_nonstandard_speech_v1.0
          split: test
          args:
            language: en
        metrics:
          - name: Test WER (raw)
            type: wer
            value: 31.57
          - name: Test CER (raw)
            type: cer
            value: 15.09
          - name: Test WER (normalized)
            type: wer
            value: 21.2
          - name: Test CER (normalized)
            type: cer
            value: 12.56
metrics:
  - wer
  - cer
pipeline_tag: automatic-speech-recognition

CDLI Parakeet TDT 1.1B English Fine-Tune (`lr=5e-5`)

This repository contains a NeMo ASR model fine-tuned from nvidia/parakeet-tdt-1.1b on the gated cdli/ugandan_english_nonstandard_speech_v1.0 dataset.

This card documents the stronger 1.1B recovery run using a lower learning rate (5e-5) after the earlier 1e-4 run plateaued early.

Model Details

Base model: nvidia/parakeet-tdt-1.1b
Fine-tuning framework: NVIDIA NeMo
Language: English
Acoustic model family: FastConformer-TDT / RNNT-BPE

Dataset

Dataset: cdli/ugandan_english_nonstandard_speech_v1.0
License: cc-by-sa-4.0
Split sizes used by the source dataset card:
- train: 5176
- validation: 638
- test: 1017

The evaluation artifacts in this run contain 1016 scored rows.

Training Configuration

Work root: /jupyter_kernel/parakeet_cdli_en_5e5
Base checkpoint: nvidia/parakeet-tdt-1.1b
Max manifest audio length: 40.0 s
Max training audio length: 30.0 s
Min audio length: 0.2 s
Train batch size: 4
Eval batch size: 8
Gradient accumulation steps: 8
Effective train batch size: 32
Learning rate: 5e-5
Weight decay: 1e-3
Warmup steps: 100
Scheduler: CosineAnnealing
Max steps configured: 20000
Early stopping patience: 10

Evaluation

Evaluation was run on the held-out test split using both raw and normalized transcript comparison.

Corpus Metrics

Raw WER: 31.57%
Raw CER: 15.09%
Normalized WER: 21.20%
Normalized CER: 12.56%

Average Utterance Metrics

Average normalized utterance WER (capped at 1.0): 20.70%
Average normalized utterance CER (capped at 1.0): 12.58%

Files

EN-PARAKEET-TDT-F1tdt-1-1b.nemo: exported NeMo checkpoint
checkpoints/: intermediate training checkpoints
test_predictions.csv
test_predictions.jsonl
test_predictions_scored.csv
test_predictions_scored.jsonl
test_predictions_grouped_analysis.csv

Notes

This 5e-5 run improved substantially over the earlier 1.1B 1e-4 run.
Access to the source dataset is gated. Review the dataset terms before requesting access.

CDLI Parakeet TDT 1.1B English Fine-Tune (lr=5e-5)