Automatic Speech Recognition
Transformers
PyTorch
whisper
audio
speech
wav2vec2
Eval Results (legacy)
Instructions to use devasheeshG/whisper_medium_fp16_transformers with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use devasheeshG/whisper_medium_fp16_transformers with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="devasheeshG/whisper_medium_fp16_transformers")# Load model directly from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq processor = AutoProcessor.from_pretrained("devasheeshG/whisper_medium_fp16_transformers") model = AutoModelForSpeechSeq2Seq.from_pretrained("devasheeshG/whisper_medium_fp16_transformers") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| pipeline_tag: automatic-speech-recognition | |
| tags: | |
| - pytorch | |
| - audio | |
| - speech | |
| - automatic-speech-recognition | |
| - whisper | |
| - wav2vec2 | |
| model-index: | |
| - name: whisper_medium_fp16_transformers | |
| results: | |
| - task: | |
| type: automatic-speech-recognition | |
| name: Automatic Speech Recognition | |
| dataset: | |
| type: librispeech_asr | |
| name: LibriSpeech (clean) | |
| config: clean | |
| split: test | |
| args: | |
| language: en | |
| metrics: | |
| - type: wer | |
| value: 0 | |
| name: Test WER | |
| description: Word Error Rate | |
| - type: mer | |
| value: 0 | |
| name: Test MER | |
| description: Match Error Rate | |
| - type: wil | |
| value: 0 | |
| name: Test WIL | |
| description: Word Information Lost | |
| - type: wip | |
| value: 0 | |
| name: Test WIP | |
| description: Word Information Preserved | |
| - type: cer | |
| value: 0 | |
| name: Test CER | |
| description: Character Error Rate | |
| - task: | |
| type: automatic-speech-recognition | |
| name: Automatic Speech Recognition | |
| dataset: | |
| type: librispeech_asr | |
| name: LibriSpeech (other) | |
| config: other | |
| split: test | |
| args: | |
| language: en | |
| metrics: | |
| - type: wer | |
| value: 0 | |
| name: Test WER | |
| description: Word Error Rate | |
| - type: mer | |
| value: 0 | |
| name: Test MER | |
| description: Match Error Rate | |
| - type: wil | |
| value: 0 | |
| name: Test WIL | |
| description: Word Information Lost | |
| - type: wip | |
| value: 0 | |
| name: Test WIP | |
| description: Word Information Preserved | |
| - type: cer | |
| value: 0 | |
| name: Test CER | |
| description: Character Error Rate | |
| - task: | |
| type: automatic-speech-recognition | |
| name: Automatic Speech Recognition | |
| dataset: | |
| type: mozilla-foundation/common_voice_14_0 | |
| name: Common Voice (14.0) (Hindi) | |
| config: hi | |
| split: test | |
| args: | |
| language: hi | |
| metrics: | |
| - type: wer | |
| value: 54.97 | |
| name: Test WER | |
| description: Word Error Rate | |
| - type: mer | |
| value: 47.86 | |
| name: Test MER | |
| description: Match Error Rate | |
| - type: wil | |
| value: 66.83 | |
| name: Test WIL | |
| description: Word Information Lost | |
| - type: wip | |
| value: 33.16 | |
| name: Test WIP | |
| description: Word Information Preserved | |
| - type: cer | |
| value: 30.23 | |
| name: Test CER | |
| description: Character Error Rate | |
| widget: | |
| - example_title: Hinglish Sample | |
| src: https://huggingface.co/devasheeshG/whisper_medium_fp16_transformers/resolve/main/test.wav | |
| - example_title: Librispeech sample 1 | |
| src: https://cdn-media.huggingface.co/speech_samples/sample1.flac | |
| - example_title: Librispeech sample 2 | |
| src: https://cdn-media.huggingface.co/speech_samples/sample2.flac | |
| language: | |
| - en | |
| - zh | |
| - de | |
| - es | |
| - ru | |
| - ko | |
| - fr | |
| - ja | |
| - pt | |
| - tr | |
| - pl | |
| - ca | |
| - nl | |
| - ar | |
| - sv | |
| - it | |
| - id | |
| - hi | |
| - fi | |
| - vi | |
| - he | |
| - uk | |
| - el | |
| - ms | |
| - cs | |
| - ro | |
| - da | |
| - hu | |
| - ta | |
| - 'no' | |
| - th | |
| - ur | |
| - hr | |
| - bg | |
| - lt | |
| - la | |
| - mi | |
| - ml | |
| - cy | |
| - sk | |
| - te | |
| - fa | |
| - lv | |
| - bn | |
| - sr | |
| - az | |
| - sl | |
| - kn | |
| - et | |
| - mk | |
| - br | |
| - eu | |
| - is | |
| - hy | |
| - ne | |
| - mn | |
| - bs | |
| - kk | |
| - sq | |
| - sw | |
| - gl | |
| - mr | |
| - pa | |
| - si | |
| - km | |
| - sn | |
| - yo | |
| - so | |
| - af | |
| - oc | |
| - ka | |
| - be | |
| - tg | |
| - sd | |
| - gu | |
| - am | |
| - yi | |
| - lo | |
| - uz | |
| - fo | |
| - ht | |
| - ps | |
| - tk | |
| - nn | |
| - mt | |
| - sa | |
| - lb | |
| - my | |
| - bo | |
| - tl | |
| - mg | |
| - as | |
| - tt | |
| - haw | |
| - ln | |
| - ha | |
| - ba | |
| - jw | |
| - su | |
| ## Versions: | |
| - CUDA: 12.1 | |
| - cuDNN Version: 8.9.2.26_1.0-1_amd64 | |
| * tensorflow Version: 2.12.0 | |
| * torch Version: 2.1.0.dev20230606+cu12135 | |
| * transformers Version: 4.30.2 | |
| * accelerate Version: 0.20.3 | |
| ## Model Benchmarks: | |
| - RAM: 2.8 GB (Original_Model: 5.5GB) | |
| - VRAM: 1812 MB (Original_Model: 6GB) | |
| - test.wav: 23 s (Multilingual Speech i.e. English+Hindi) | |
| - **Time in seconds for Processing by each device** | |
| | Device Name | float32 (Original) | float16 | CudaCores | TensorCores | | |
| | ----------------- | ------------------ | ------- | --------- | ----------- | | |
| | 3060 | 1.7 | 1.1 | 3,584 | 112 | | |
| | 1660 Super | OOM | 3.3 | 1,408 | N/A | | |
| | Collab (Tesla T4) | 2.8 | 2.2 | 2,560 | 320 | | |
| | Collab (CPU) | 35 | N/A | N/A | N/A | | |
| | M1 (CPU) | - | - | - | - | | |
| | M1 (GPU -> 'mps') | - | - | - | - | | |
| - **NOTE: TensorCores are efficient in mixed-precision calculations** | |
| - **CPU -> torch.float16 not supported on CPU (AMD Ryzen 5 3600 or Collab CPU)** | |
| - Punchuation: True | |
| ## Model Error Benchmarks: | |
| - **WER: Word Error Rate** | |
| - **MER: Match Error Rate** | |
| - **WIL: Word Information Lost** | |
| - **WIP: Word Information Preserved** | |
| - **CER: Character Error Rate** | |
| ### Hindi to Hindi (test.tsv) [Common Voice 14.0](https://commonvoice.mozilla.org/en/datasets) | |
| **Test done on RTX 3060 on 2557 Samples** | |
| | | WER | MER | WIL | WIP | CER | | |
| | ----------------------- | ----- | ----- | ----- | ----- | ----- | | |
| | Original_Model (54 min) | 52.02 | 47.86 | 66.82 | 33.17 | 23.76 | | |
| | This_Model (38 min) | 54.97 | 47.86 | 66.83 | 33.16 | 30.23 | | |
| ### Hindi to English (test.csv) [Custom Dataset](https://huggingface.co/datasets/devasheeshG/common_voices_14_0_hi2en_hi2hi) | |
| **Test done on RTX 3060 on 1000 Samples** | |
| | | WER | MER | WIL | WIP | CER | | |
| | ----------------------- | --- | --- | --- | --- | --- | | |
| | Original_Model (30 min) | - | - | - | - | - | | |
| | This_Model (20 min) | - | - | - | - | - | | |
| ### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-clean) | |
| **Test done on RTX 3060 on __ Samples** | |
| | | WER | MER | WIL | WIP | CER | | |
| | -------------- | --- | --- | --- | --- | --- | | |
| | Original_Model | - | - | - | - | - | | |
| | This_Model | - | - | - | - | - | | |
| ### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-other) | |
| **Test done on RTX 3060 on __ Samples** | |
| | | WER | MER | WIL | WIP | CER | | |
| | -------------- | --- | --- | --- | --- | --- | | |
| | Original_Model | - | - | - | - | - | | |
| | This_Model | - | - | - | - | - | | |
| - **'jiwer' library is used for calculations** | |
| ## Code for conversion: | |
| - ### [Will be soon Uploaded on Github](https://github.com/devasheeshG) | |
| ## Usage | |
| A file ``__init__.py`` is contained inside this repo which contains all the code to use this model. | |
| Firstly, clone this repo and place all the files inside a folder. | |
| ### Make sure you have git-lfs installed (https://git-lfs.com) | |
| ```bash | |
| git lfs install | |
| git clone https://huggingface.co/devasheeshG/whisper_medium_fp16_transformers | |
| ``` | |
| **Please try in jupyter notebook** | |
| ```python | |
| # Import the Model | |
| from whisper_medium_fp16_transformers import Model, load_audio, pad_or_trim | |
| ``` | |
| ```python | |
| # Initilise the model | |
| model = Model( | |
| model_name_or_path='whisper_medium_fp16_transformers', | |
| cuda_visible_device="0", | |
| device='cuda', | |
| ) | |
| ``` | |
| ```python | |
| # Load Audio | |
| audio = load_audio('whisper_medium_fp16_transformers/test.wav') | |
| audio = pad_or_trim(audio) | |
| ``` | |
| ```python | |
| # Transcribe (First transcription takes time) | |
| model.transcribe(audio) | |
| ``` | |
| ## Credits | |
| It is fp16 version of ``openai/whisper-medium`` | |