Model Card — Wav2Vec2 Large XLSR-53 Icelandic (Spjallrómur Fine-tune)

Model Details

Model Description

This model is a fine-tuned version of language-and-voice-lab/wav2vec2-large-xlsr-53-icelandic-ep30-967h for Automatic Speech Recognition (ASR) in Icelandic, with a focus on conversational and spontaneous speech. It was further fine-tuned on the Spjallrómur 26.03 corpus — an Icelandic conversational speech dataset — to improve robustness on informal, dialogue-style audio.

The base model is itself a fine-tune of facebook/wav2vec2-large-xlsr-53, trained for 30 epochs on 967 hours of Icelandic read speech from Samrómur Milljón.

Developed by: Páll Rúnarsson, Research Associate
Funded by: Almannarómur (Language Technology Programme for Icelandic)
Shared by: Language and Voice Laboratory, Reykjavík University
Model type: Automatic Speech Recognition (Transformer CTC)
Language(s): Icelandic (is)
License: CC BY-SA 4.0 — free to use, share, and adapt, provided you give appropriate credit and distribute any derivatives under the same license.
Fine-tuned from: language-and-voice-lab/wav2vec2-large-xlsr-53-icelandic-ep30-967h

Model Sources

Repository: (add link)
Demo: (add link)

Uses

Direct Use

This model is intended for transcription of Icelandic speech, particularly conversational and spontaneous speech. It can be used directly via the 🤗 Transformers pipeline API or integrated into larger ASR pipelines.

Downstream Use

The model may be fine-tuned further for domain-specific Icelandic ASR tasks (e.g., legal, broadcast, or medical transcription) where spontaneous speech patterns are common. Any derivative models must be released under the same CC BY-SA 4.0 license with appropriate attribution.

Out-of-Scope Use

Languages other than Icelandic
Highly technical or domain-specific jargon without additional fine-tuning
Real-time streaming inference without appropriate latency optimisations

Bias, Risks, and Limitations

Performance may degrade on heavily accented, dialectal, or code-switched speech (e.g., Icelandic/English or Icelandic/Danish mixing).
The Spjallrómur training data reflects particular speaker demographics; underrepresented groups may see higher error rates.
As with all ASR systems, proper nouns, rare words, and domain-specific terminology present challenges.

Recommendations

Users should evaluate the model on their target domain before deployment. For sensitive applications (legal, medical), human review of transcripts is strongly recommended.

How to Get Started with the Model

from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="language-and-voice-lab/<your-model-id>"
)

result = asr("audio.wav")
print(result["text"])

For more control:

import torch
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC

MODEL_NAME = "language-and-voice-lab/<your-model-id>"

processor = Wav2Vec2Processor.from_pretrained(MODEL_NAME)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_NAME).to("cuda")

def transcribe(audio_path):
    import librosa
    audio, sr = librosa.load(audio_path, sr=16_000)
    input_values = processor(
        audio, sampling_rate=sr, return_tensors="pt"
    ).input_values.to("cuda")

    with torch.no_grad():
        logits = model(input_values).logits

    pred_ids = torch.argmax(logits, dim=-1)
    return processor.decode(pred_ids[0])

print(transcribe("audio.wav"))

Training Details

Training Data

Fine-tuned on Spjallrómur 26.03, an Icelandic conversational speech corpus published via CLARIN-IS. The corpus is 21 hours and 20 minutes long, covering 54 conversations with 102 speakers, collected by Reykjavík University between September 2020 and September 2021. This version includes additional manual transcriptions with turn-level timestamps and speaker labels for 21 full conversations.

The base model checkpoint used as the starting point was trained for 30 epochs on 967 hours of Icelandic read speech from Samrómur Milljón.

Training Procedure

Starting checkpoint: language-and-voice-lab/wav2vec2-large-xlsr-53-icelandic-ep30-967h

Training Hyperparameters

Training regime: (add: fp16/bf16, batch size, learning rate, epochs/steps, warmup, etc.)

Speeds, Sizes, Times

(add: GPU type, training duration, model size)

Evaluation

Testing Data

The model was evaluated on the Spjallrómur test set — a held-out partition of conversational Icelandic speech not seen during fine-tuning.

Metrics

Word Error Rate (WER) is used as the primary evaluation metric, computed after normalising both reference and hypothesis transcripts.

Results

Model	Test WER
Base model (no Spjallrómur fine-tuning)	65.29%
This model (Spjallrómur fine-tuned)	29.5%

This represents a 35.8 percentage point absolute reduction — a 55% relative improvement in WER — on conversational Icelandic speech.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: (add)
Hours used: (add)
Cloud Provider: (add, or "on-premise")
Compute Region: Iceland
Carbon Emitted: (add)

Technical Specifications

Model Architecture and Objective

Wav2Vec2 Large XLSR-53 — a transformer-based self-supervised model fine-tuned with a CTC (Connectionist Temporal Classification) objective for ASR. The architecture is unchanged from the base checkpoint; only the weights are adapted via fine-tuning on Icelandic conversational data.

Compute Infrastructure

Training was conducted at the Language and Voice Laboratory (lvl.ru.is), Reykjavík University, Iceland.

Citation

If you use this model, please cite this work, the Spjallrómur corpus, and the base model:

@misc{runarsson2026wav2vec2,
  author       = {Rúnarsson, Páll},
  title        = {Wav2Vec2 Large XLSR-53 Icelandic Fine-tuned on Spjallrómur},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/language-and-voice-lab/<your-model-id>}}
}

@misc{fong2026spjallromur,
  author       = {Fong, Judy Y. and Borsky, Michal and Runarsson, Pall
                  and Hedström, Staffan and Jónsson, Ólafur Helgi
                  and Hólmfriðardóttir, Lára Margrét H. and Þorsteinsdóttir, Sunneva
                  and Eiríksdóttir, Málfriður Anna and Mollberg, David Erik
                  and Magnúsdóttir, Eydís Huld and Þórhallsdóttir, Ragnheiður
                  and Gudnason, Jon},
  title        = {Spjallromur 26.03 -- Icelandic Conversational Speech},
  year         = {2026},
  publisher    = {CLARIN-IS / Reykjavík University},
  howpublished = {\url{http://hdl.handle.net/20.500.12537/379}}
}

@inproceedings{mena2024samromur,
  title     = {Samr{\'o}mur Millj{\'o}n: An ASR Corpus of One Million Verified
               Read Prompts in Icelandic},
  author    = {Mena, Carlos Daniel Hernandez and Gunnarsson, {\TH}orsteinn
               Da{\dh}i and Gu{\dh}nason, J{\'o}n},
  booktitle = {Proceedings of the 2024 Joint International Conference on
               Computational Linguistics, Language Resources and Evaluation
               (LREC-COLING 2024)},
  pages     = {14305--14312},
  year      = {2024}
}

Acknowledgements

This work was carried out at the Language and Voice Laboratory (lvl.ru.is) at Reykjavík University, Iceland, under the supervision of Jón Guðnason and Michal Borsky.

Funded by Almannarómur — the Language Technology Programme for Icelandic, managed and coordinated by Almannarómur and funded by the Icelandic Ministry of Education, Science and Culture.

Model Card Author

Páll Rúnarsson, Language and Voice Laboratory, Reykjavík University

Model Card Contact

(add contact email or HF profile link)

Downloads last month: 83

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for palli23/Wav2Vec2-Large-XLSR-53-spjallromur

Base model

language-and-voice-lab/wav2vec2-large-xlsr-53-icelandic-ep30-967h

Finetuned

(1)

this model

Paper for palli23/Wav2Vec2-Large-XLSR-53-spjallromur

Quantifying the Carbon Emissions of Machine Learning

Paper • 1910.09700 • Published Oct 21, 2019 • 48