mt-en-so-helpline-v1
Fine-tuned Helsinki-NLP/opus-mt-synthetic-en-so for English → Somali machine translation, developed as part of the OpenCHS child helpline AI pipeline.
Model Description
This model adapts the MarianMT opus-mt-synthetic-en-so architecture for English-to-Somali translation. It was fine-tuned on a combination of 14 multilingual datasets with a focus on improving coverage of everyday and domain-specific Somali vocabulary relevant to child helpline contexts.
- Base model:
Helsinki-NLP/opus-mt-synthetic-en-so - Architecture: MarianMT (Transformer encoder-decoder, 6 layers, d_model=512)
- Language pair: English → Somali (
en → so) - Task: Sequence-to-sequence translation
Intended Use
- English-to-Somali translation for child helpline call transcriptions
- Supporting Somali-speaking communities through AI-assisted helpline operations
- Downstream component in the OpenCHS multilingual NLP pipeline
Training Details
| Parameter | Value |
|---|---|
| Base model | Helsinki-NLP/opus-mt-synthetic-en-so |
| Epochs | 5 |
| Optimizer | AdamW |
| Learning rate | 3e-6 |
| Batch size | 16 |
| Train samples/sec | 386.8 |
| Training time | ~3.9 hours |
| Final train loss | 3.651 |
Training Datasets
| Dataset | Description |
|---|---|
| NLLB | Meta's No Language Left Behind (up to 500K samples) |
| CCAligned | Web-crawled aligned corpus |
| CCMatrix | Cross-lingual web mining |
| XLENT | Cross-lingual entailment data |
| Bible UEDIN | Biblical text parallel corpus |
| Infopankki | Public information texts |
| ParaCrawl / ParaCrawl Bonus | Web-crawled parallel data |
| TED2020 | TED talk transcriptions |
| Tanzil | Quran translations |
| Wikimedia | Wikipedia and related content |
| GNOME | Software localization strings |
| OpenSubtitles | Subtitle parallel data |
| TICO-19 (dev) | COVID-19 domain data |
Evaluation Results
FLORES-200 devtest (en → so)
| Metric | Score |
|---|---|
| BLEU | 11.60 |
| chrF | 45.87 |
TICO-19 test (en → so)
| Metric | Score |
|---|---|
| BLEU | 7.45 |
| chrF | 33.07 |
Note: BLEU scores for low-resource languages like Somali are typically lower than high-resource languages. chrF is a more reliable metric for morphologically rich languages. The gap between FLORES-200 and TICO-19 scores reflects the specialized medical/crisis domain vocabulary in TICO-19.
How to Use
from transformers import MarianTokenizer, MarianMTModel
model_name = "Rogendo/mt-so-en-helpline-v1"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
def translate(texts):
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
translated = model.generate(**inputs)
return tokenizer.batch_decode(translated, skip_special_tokens=True)
# Example
result = translate(["How can I help you today?", "Are you safe right now?"])
print(result)
Limitations
- BLEU scores reflect room for improvement, especially on domain-specific crisis vocabulary
- Model performs better on general text (FLORES-200) than specialized crisis/medical terms (TICO-19)
- Low-resource language — Somali has limited parallel data available for training
- Not evaluated on actual helpline transcription samples yet
Next Steps
- Evaluate on real helpline call transcriptions
- Domain adaptation with helpline-specific Somali vocabulary
- Explore larger MarianMT variants or NLLB-200 as base models
Citation
@misc{openchs-mt-so-v1,
author = {OpenCHS AI Team},
title = {mt-en-so-helpline-v1: English-Somali Translation for Child Helplines},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/Rogendo/mt-so-en-helpline-v1}}
}
- Downloads last month
- 7
Model tree for Rogendo/mt-so-en-helpline-v1
Base model
Helsinki-NLP/opus-mt-synthetic-en-soEvaluation results
- BLEU on FLORES-200 devtestself-reported11.600
- chrF on FLORES-200 devtestself-reported45.870
- BLEU on TICO-19 testself-reported7.450
- chrF on TICO-19 testself-reported33.070