Model Card — Wav2Vec2 Large XLSR-53 Icelandic (Spjallrómur Fine-tune)
Model Details
Model Description
This model is a fine-tuned version of
language-and-voice-lab/wav2vec2-large-xlsr-53-icelandic-ep30-967h
for Automatic Speech Recognition (ASR) in Icelandic, with a focus on
conversational and spontaneous speech. It was further fine-tuned on the
Spjallrómur 26.03 corpus — an
Icelandic conversational speech dataset — to improve robustness on informal,
dialogue-style audio.
The base model is itself a fine-tune of
facebook/wav2vec2-large-xlsr-53,
trained for 30 epochs on 967 hours of Icelandic read speech from
Samrómur Milljón.
- Developed by: Páll Rúnarsson, Research Associate
- Funded by: Almannarómur (Language Technology Programme for Icelandic)
- Shared by: Language and Voice Laboratory, Reykjavík University
- Model type: Automatic Speech Recognition (Transformer CTC)
- Language(s): Icelandic (
is) - License: CC BY-SA 4.0 — free to use, share, and adapt, provided you give appropriate credit and distribute any derivatives under the same license.
- Fine-tuned from:
language-and-voice-lab/wav2vec2-large-xlsr-53-icelandic-ep30-967h
Model Sources
- Repository: (add link)
- Demo: (add link)
Uses
Direct Use
This model is intended for transcription of Icelandic speech, particularly
conversational and spontaneous speech. It can be used directly via the
🤗 Transformers pipeline API or integrated into larger ASR pipelines.
Downstream Use
The model may be fine-tuned further for domain-specific Icelandic ASR tasks (e.g., legal, broadcast, or medical transcription) where spontaneous speech patterns are common. Any derivative models must be released under the same CC BY-SA 4.0 license with appropriate attribution.
Out-of-Scope Use
- Languages other than Icelandic
- Highly technical or domain-specific jargon without additional fine-tuning
- Real-time streaming inference without appropriate latency optimisations
Bias, Risks, and Limitations
- Performance may degrade on heavily accented, dialectal, or code-switched speech (e.g., Icelandic/English or Icelandic/Danish mixing).
- The Spjallrómur training data reflects particular speaker demographics; underrepresented groups may see higher error rates.
- As with all ASR systems, proper nouns, rare words, and domain-specific terminology present challenges.
Recommendations
Users should evaluate the model on their target domain before deployment. For sensitive applications (legal, medical), human review of transcripts is strongly recommended.
How to Get Started with the Model
from transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model="language-and-voice-lab/<your-model-id>"
)
result = asr("audio.wav")
print(result["text"])
For more control:
import torch
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
MODEL_NAME = "language-and-voice-lab/<your-model-id>"
processor = Wav2Vec2Processor.from_pretrained(MODEL_NAME)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_NAME).to("cuda")
def transcribe(audio_path):
import librosa
audio, sr = librosa.load(audio_path, sr=16_000)
input_values = processor(
audio, sampling_rate=sr, return_tensors="pt"
).input_values.to("cuda")
with torch.no_grad():
logits = model(input_values).logits
pred_ids = torch.argmax(logits, dim=-1)
return processor.decode(pred_ids[0])
print(transcribe("audio.wav"))
Training Details
Training Data
Fine-tuned on Spjallrómur 26.03, an Icelandic conversational speech corpus published via CLARIN-IS. The corpus is 21 hours and 20 minutes long, covering 54 conversations with 102 speakers, collected by Reykjavík University between September 2020 and September 2021. This version includes additional manual transcriptions with turn-level timestamps and speaker labels for 21 full conversations.
The base model checkpoint used as the starting point was trained for 30 epochs on 967 hours of Icelandic read speech from Samrómur Milljón.
Training Procedure
Starting checkpoint:
language-and-voice-lab/wav2vec2-large-xlsr-53-icelandic-ep30-967h
Training Hyperparameters
- Training regime: (add: fp16/bf16, batch size, learning rate, epochs/steps, warmup, etc.)
Speeds, Sizes, Times
- (add: GPU type, training duration, model size)
Evaluation
Testing Data
The model was evaluated on the Spjallrómur test set — a held-out partition of conversational Icelandic speech not seen during fine-tuning.
Metrics
Word Error Rate (WER) is used as the primary evaluation metric, computed after normalising both reference and hypothesis transcripts.
Results
| Model | Test WER |
|---|---|
| Base model (no Spjallrómur fine-tuning) | 65.29% |
| This model (Spjallrómur fine-tuned) | 29.5% |
This represents a 35.8 percentage point absolute reduction — a 55% relative improvement in WER — on conversational Icelandic speech.
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: (add)
- Hours used: (add)
- Cloud Provider: (add, or "on-premise")
- Compute Region: Iceland
- Carbon Emitted: (add)
Technical Specifications
Model Architecture and Objective
Wav2Vec2 Large XLSR-53 — a transformer-based self-supervised model fine-tuned with a CTC (Connectionist Temporal Classification) objective for ASR. The architecture is unchanged from the base checkpoint; only the weights are adapted via fine-tuning on Icelandic conversational data.
Compute Infrastructure
Training was conducted at the Language and Voice Laboratory (lvl.ru.is), Reykjavík University, Iceland.
Citation
If you use this model, please cite this work, the Spjallrómur corpus, and the base model:
@misc{runarsson2026wav2vec2,
author = {Rúnarsson, Páll},
title = {Wav2Vec2 Large XLSR-53 Icelandic Fine-tuned on Spjallrómur},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/language-and-voice-lab/<your-model-id>}}
}
@misc{fong2026spjallromur,
author = {Fong, Judy Y. and Borsky, Michal and Runarsson, Pall
and Hedström, Staffan and Jónsson, Ólafur Helgi
and Hólmfriðardóttir, Lára Margrét H. and Þorsteinsdóttir, Sunneva
and Eiríksdóttir, Málfriður Anna and Mollberg, David Erik
and Magnúsdóttir, Eydís Huld and Þórhallsdóttir, Ragnheiður
and Gudnason, Jon},
title = {Spjallromur 26.03 -- Icelandic Conversational Speech},
year = {2026},
publisher = {CLARIN-IS / Reykjavík University},
howpublished = {\url{http://hdl.handle.net/20.500.12537/379}}
}
@inproceedings{mena2024samromur,
title = {Samr{\'o}mur Millj{\'o}n: An ASR Corpus of One Million Verified
Read Prompts in Icelandic},
author = {Mena, Carlos Daniel Hernandez and Gunnarsson, {\TH}orsteinn
Da{\dh}i and Gu{\dh}nason, J{\'o}n},
booktitle = {Proceedings of the 2024 Joint International Conference on
Computational Linguistics, Language Resources and Evaluation
(LREC-COLING 2024)},
pages = {14305--14312},
year = {2024}
}
Acknowledgements
This work was carried out at the Language and Voice Laboratory (lvl.ru.is) at Reykjavík University, Iceland, under the supervision of Jón Guðnason and Michal Borsky.
Funded by Almannarómur — the Language Technology Programme for Icelandic, managed and coordinated by Almannarómur and funded by the Icelandic Ministry of Education, Science and Culture.
Model Card Author
Páll Rúnarsson, Language and Voice Laboratory, Reykjavík University
Model Card Contact
(add contact email or HF profile link)
- Downloads last month
- 83