mt-en-so-helpline-v1

Fine-tuned Helsinki-NLP/opus-mt-synthetic-en-so for English → Somali machine translation, developed as part of the OpenCHS child helpline AI pipeline.

Model Description

This model adapts the MarianMT opus-mt-synthetic-en-so architecture for English-to-Somali translation. It was fine-tuned on a combination of 14 multilingual datasets with a focus on improving coverage of everyday and domain-specific Somali vocabulary relevant to child helpline contexts.

  • Base model: Helsinki-NLP/opus-mt-synthetic-en-so
  • Architecture: MarianMT (Transformer encoder-decoder, 6 layers, d_model=512)
  • Language pair: English → Somali (en → so)
  • Task: Sequence-to-sequence translation

Intended Use

  • English-to-Somali translation for child helpline call transcriptions
  • Supporting Somali-speaking communities through AI-assisted helpline operations
  • Downstream component in the OpenCHS multilingual NLP pipeline

Training Details

Parameter Value
Base model Helsinki-NLP/opus-mt-synthetic-en-so
Epochs 5
Optimizer AdamW
Learning rate 3e-6
Batch size 16
Train samples/sec 386.8
Training time ~3.9 hours
Final train loss 3.651

Training Datasets

Dataset Description
NLLB Meta's No Language Left Behind (up to 500K samples)
CCAligned Web-crawled aligned corpus
CCMatrix Cross-lingual web mining
XLENT Cross-lingual entailment data
Bible UEDIN Biblical text parallel corpus
Infopankki Public information texts
ParaCrawl / ParaCrawl Bonus Web-crawled parallel data
TED2020 TED talk transcriptions
Tanzil Quran translations
Wikimedia Wikipedia and related content
GNOME Software localization strings
OpenSubtitles Subtitle parallel data
TICO-19 (dev) COVID-19 domain data

Evaluation Results

FLORES-200 devtest (en → so)

Metric Score
BLEU 11.60
chrF 45.87

TICO-19 test (en → so)

Metric Score
BLEU 7.45
chrF 33.07

Note: BLEU scores for low-resource languages like Somali are typically lower than high-resource languages. chrF is a more reliable metric for morphologically rich languages. The gap between FLORES-200 and TICO-19 scores reflects the specialized medical/crisis domain vocabulary in TICO-19.

How to Use

from transformers import MarianTokenizer, MarianMTModel

model_name = "Rogendo/mt-so-en-helpline-v1"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

def translate(texts):
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
    translated = model.generate(**inputs)
    return tokenizer.batch_decode(translated, skip_special_tokens=True)

# Example
result = translate(["How can I help you today?", "Are you safe right now?"])
print(result)

Limitations

  • BLEU scores reflect room for improvement, especially on domain-specific crisis vocabulary
  • Model performs better on general text (FLORES-200) than specialized crisis/medical terms (TICO-19)
  • Low-resource language — Somali has limited parallel data available for training
  • Not evaluated on actual helpline transcription samples yet

Next Steps

  • Evaluate on real helpline call transcriptions
  • Domain adaptation with helpline-specific Somali vocabulary
  • Explore larger MarianMT variants or NLLB-200 as base models

Citation

@misc{openchs-mt-so-v1,
  author       = {OpenCHS AI Team},
  title        = {mt-en-so-helpline-v1: English-Somali Translation for Child Helplines},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Rogendo/mt-so-en-helpline-v1}}
}
Downloads last month
7
Safetensors
Model size
60.6M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Rogendo/mt-so-en-helpline-v1

Finetuned
(1)
this model

Evaluation results