01e919d almost 3 years ago

7.97 kB

	---
	license: apache-2.0
	pipeline_tag: automatic-speech-recognition
	tags:
	- pytorch
	- audio
	- speech
	- automatic-speech-recognition
	- whisper
	- wav2vec2

	model-index:
	- name: whisper_medium_fp16_transformers
	results:
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	type: librispeech_asr
	name: LibriSpeech (clean)
	config: clean
	split: test
	args:
	language: en
	metrics:
	- type: wer
	value: 0
	name: Test WER
	description: Word Error Rate
	- type: mer
	value: 0
	name: Test MER
	description: Match Error Rate
	- type: wil
	value: 0
	name: Test WIL
	description: Word Information Lost
	- type: wip
	value: 0
	name: Test WIP
	description: Word Information Preserved
	- type: cer
	value: 0
	name: Test CER
	description: Character Error Rate

	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	type: librispeech_asr
	name: LibriSpeech (other)
	config: other
	split: test
	args:
	language: en
	metrics:
	- type: wer
	value: 0
	name: Test WER
	description: Word Error Rate
	- type: mer
	value: 0
	name: Test MER
	description: Match Error Rate
	- type: wil
	value: 0
	name: Test WIL
	description: Word Information Lost
	- type: wip
	value: 0
	name: Test WIP
	description: Word Information Preserved
	- type: cer
	value: 0
	name: Test CER
	description: Character Error Rate

	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	type: mozilla-foundation/common_voice_14_0
	name: Common Voice (14.0) (Hindi)
	config: hi
	split: test
	args:
	language: hi
	metrics:
	- type: wer
	value: 54.97
	name: Test WER
	description: Word Error Rate
	- type: mer
	value: 47.86
	name: Test MER
	description: Match Error Rate
	- type: wil
	value: 66.83
	name: Test WIL
	description: Word Information Lost
	- type: wip
	value: 33.16
	name: Test WIP
	description: Word Information Preserved
	- type: cer
	value: 30.23
	name: Test CER
	description: Character Error Rate

	widget:
	- example_title: Hinglish Sample
	src: https://huggingface.co/devasheeshG/whisper_medium_fp16_transformers/resolve/main/test.wav
	- example_title: Librispeech sample 1
	src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
	- example_title: Librispeech sample 2
	src: https://cdn-media.huggingface.co/speech_samples/sample2.flac

	language:
	- en
	- zh
	- de
	- es
	- ru
	- ko
	- fr
	- ja
	- pt
	- tr
	- pl
	- ca
	- nl
	- ar
	- sv
	- it
	- id
	- hi
	- fi
	- vi
	- he
	- uk
	- el
	- ms
	- cs
	- ro
	- da
	- hu
	- ta
	- 'no'
	- th
	- ur
	- hr
	- bg
	- lt
	- la
	- mi
	- ml
	- cy
	- sk
	- te
	- fa
	- lv
	- bn
	- sr
	- az
	- sl
	- kn
	- et
	- mk
	- br
	- eu
	- is
	- hy
	- ne
	- mn
	- bs
	- kk
	- sq
	- sw
	- gl
	- mr
	- pa
	- si
	- km
	- sn
	- yo
	- so
	- af
	- oc
	- ka
	- be
	- tg
	- sd
	- gu
	- am
	- yi
	- lo
	- uz
	- fo
	- ht
	- ps
	- tk
	- nn
	- mt
	- sa
	- lb
	- my
	- bo
	- tl
	- mg
	- as
	- tt
	- haw
	- ln
	- ha
	- ba
	- jw
	- su
	---
	## Versions:

	- CUDA: 12.1
	- cuDNN Version: 8.9.2.26_1.0-1_amd64

	* tensorflow Version: 2.12.0
	* torch Version: 2.1.0.dev20230606+cu12135
	* transformers Version: 4.30.2
	* accelerate Version: 0.20.3

	## Model Benchmarks:

	- RAM: 2.8 GB (Original_Model: 5.5GB)
	- VRAM: 1812 MB (Original_Model: 6GB)
	- test.wav: 23 s (Multilingual Speech i.e. English+Hindi)

	- Time in seconds for Processing by each device

	\| Device Name \| float32 (Original) \| float16 \| CudaCores \| TensorCores \|
	\| ----------------- \| ------------------ \| ------- \| --------- \| ----------- \|
	\| 3060 \| 1.7 \| 1.1 \| 3,584 \| 112 \|
	\| 1660 Super \| OOM \| 3.3 \| 1,408 \| N/A \|
	\| Collab (Tesla T4) \| 2.8 \| 2.2 \| 2,560 \| 320 \|
	\| Collab (CPU) \| 35 \| N/A \| N/A \| N/A \|
	\| M1 (CPU) \| - \| - \| - \| - \|
	\| M1 (GPU -> 'mps') \| - \| - \| - \| - \|


	- NOTE: TensorCores are efficient in mixed-precision calculations
	- CPU -> torch.float16 not supported on CPU (AMD Ryzen 5 3600 or Collab CPU)
	- Punchuation: True

	## Model Error Benchmarks:

	- WER: Word Error Rate
	- MER: Match Error Rate
	- WIL: Word Information Lost
	- WIP: Word Information Preserved
	- CER: Character Error Rate

	### Hindi to Hindi (test.tsv) [Common Voice 14.0](https://commonvoice.mozilla.org/en/datasets)

	Test done on RTX 3060 on 2557 Samples

	\| \| WER \| MER \| WIL \| WIP \| CER \|
	\| ----------------------- \| ----- \| ----- \| ----- \| ----- \| ----- \|
	\| Original_Model (54 min) \| 52.02 \| 47.86 \| 66.82 \| 33.17 \| 23.76 \|
	\| This_Model (38 min) \| 54.97 \| 47.86 \| 66.83 \| 33.16 \| 30.23 \|

	### Hindi to English (test.csv) [Custom Dataset](https://huggingface.co/datasets/devasheeshG/common_voices_14_0_hi2en_hi2hi)

	Test done on RTX 3060 on 1000 Samples

	\| \| WER \| MER \| WIL \| WIP \| CER \|
	\| ----------------------- \| --- \| --- \| --- \| --- \| --- \|
	\| Original_Model (30 min) \| - \| - \| - \| - \| - \|
	\| This_Model (20 min) \| - \| - \| - \| - \| - \|

	### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-clean)

	Test done on RTX 3060 on __ Samples

	\| \| WER \| MER \| WIL \| WIP \| CER \|
	\| -------------- \| --- \| --- \| --- \| --- \| --- \|
	\| Original_Model \| - \| - \| - \| - \| - \|
	\| This_Model \| - \| - \| - \| - \| - \|

	### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-other)

	Test done on RTX 3060 on __ Samples

	\| \| WER \| MER \| WIL \| WIP \| CER \|
	\| -------------- \| --- \| --- \| --- \| --- \| --- \|
	\| Original_Model \| - \| - \| - \| - \| - \|
	\| This_Model \| - \| - \| - \| - \| - \|

	- 'jiwer' library is used for calculations

	## Code for conversion:

	- ### [Will be soon Uploaded on Github](https://github.com/devasheeshG)

	## Usage

	A file ``__init__.py`` is contained inside this repo which contains all the code to use this model.

	Firstly, clone this repo and place all the files inside a folder.

	### Make sure you have git-lfs installed (https://git-lfs.com)

	```bash
	git lfs install
	git clone https://huggingface.co/devasheeshG/whisper_medium_fp16_transformers
	```

	Please try in jupyter notebook

	```python
	# Import the Model
	from whisper_medium_fp16_transformers import Model, load_audio, pad_or_trim
	```

	```python
	# Initilise the model
	model = Model(
	model_name_or_path='whisper_medium_fp16_transformers',
	cuda_visible_device="0",
	device='cuda',
	)
	```

	```python
	# Load Audio
	audio = load_audio('whisper_medium_fp16_transformers/test.wav')
	audio = pad_or_trim(audio)
	```

	```python
	# Transcribe (First transcription takes time)
	model.transcribe(audio)
	```

	## Credits

	It is fp16 version of ``openai/whisper-medium``