mt-en-so-helpline-v1

Fine-tuned Helsinki-NLP/opus-mt-synthetic-en-so for English → Somali machine translation, developed as part of the OpenCHS child helpline AI pipeline.

Model Description

This model adapts the MarianMT opus-mt-synthetic-en-so architecture for English-to-Somali translation. It was fine-tuned on a combination of 14 multilingual datasets with a focus on improving coverage of everyday and domain-specific Somali vocabulary relevant to child helpline contexts.

Base model: Helsinki-NLP/opus-mt-synthetic-en-so
Architecture: MarianMT (Transformer encoder-decoder, 6 layers, d_model=512)
Language pair: English → Somali (en → so)
Task: Sequence-to-sequence translation

Intended Use

English-to-Somali translation for child helpline call transcriptions
Supporting Somali-speaking communities through AI-assisted helpline operations
Downstream component in the OpenCHS multilingual NLP pipeline

Training Details

Parameter	Value
Base model	`Helsinki-NLP/opus-mt-synthetic-en-so`
Epochs	5
Optimizer	AdamW
Learning rate	3e-6
Batch size	16
Train samples/sec	386.8
Training time	~3.9 hours
Final train loss	3.651

Training Datasets

Dataset	Description
NLLB	Meta's No Language Left Behind (up to 500K samples)
CCAligned	Web-crawled aligned corpus
CCMatrix	Cross-lingual web mining
XLENT	Cross-lingual entailment data
Bible UEDIN	Biblical text parallel corpus
Infopankki	Public information texts
ParaCrawl / ParaCrawl Bonus	Web-crawled parallel data
TED2020	TED talk transcriptions
Tanzil	Quran translations
Wikimedia	Wikipedia and related content
GNOME	Software localization strings
OpenSubtitles	Subtitle parallel data
TICO-19 (dev)	COVID-19 domain data

Evaluation Results

FLORES-200 devtest (en → so)

Metric	Score
BLEU	11.60
chrF	45.87

TICO-19 test (en → so)

Metric	Score
BLEU	7.45
chrF	33.07

Note: BLEU scores for low-resource languages like Somali are typically lower than high-resource languages. chrF is a more reliable metric for morphologically rich languages. The gap between FLORES-200 and TICO-19 scores reflects the specialized medical/crisis domain vocabulary in TICO-19.

How to Use

from transformers import MarianTokenizer, MarianMTModel

model_name = "Rogendo/mt-so-en-helpline-v1"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

def translate(texts):
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
    translated = model.generate(**inputs)
    return tokenizer.batch_decode(translated, skip_special_tokens=True)

# Example
result = translate(["How can I help you today?", "Are you safe right now?"])
print(result)

Limitations

BLEU scores reflect room for improvement, especially on domain-specific crisis vocabulary
Model performs better on general text (FLORES-200) than specialized crisis/medical terms (TICO-19)
Low-resource language — Somali has limited parallel data available for training
Not evaluated on actual helpline transcription samples yet

Next Steps

Evaluate on real helpline call transcriptions
Domain adaptation with helpline-specific Somali vocabulary
Explore larger MarianMT variants or NLLB-200 as base models

Citation

@misc{openchs-mt-so-v1,
  author       = {OpenCHS AI Team},
  title        = {mt-en-so-helpline-v1: English-Somali Translation for Child Helplines},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Rogendo/mt-so-en-helpline-v1}}
}

Downloads last month: 7

Safetensors

Model size

60.6M params

Tensor type

F32

Model tree for Rogendo/mt-so-en-helpline-v1

Base model

Helsinki-NLP/opus-mt-synthetic-en-so

Finetuned

(1)

this model

Evaluation results

BLEU on FLORES-200 devtest
self-reported

11.600
chrF on FLORES-200 devtest
self-reported

45.870
BLEU on TICO-19 test
self-reported

7.450
chrF on TICO-19 test
self-reported

33.070