YehaTranslate
YehaTranslate is a fine-tuned translation model for Ethiopian languages ↔ English, developed by Hasab AI. It is based on google/translategemma-4b-it and fine-tuned on paired Amharic–English data.
The primary training focus is Amharic (am) ↔ English (en). The training data also contains samples in Tigrinya (ti) and Oromo (om), so translation to/from English is supported for these languages — though coverage is not as extensive as Amharic. We plan to expand Tigrinya and Oromo training data in future releases.
Model Details
| Field | Value |
|---|---|
| Base model | google/translategemma-4b-it |
| Architecture | Gemma3ForConditionalGeneration (4B) |
| Languages | Amharic (am), Tigrinya (ti), Oromo (om), English (en) |
| Framework | PyTorch + HuggingFace Transformers, distributed with torchrun |
Benchmark Results
Evaluated on 960 held-out sentence pairs per direction. Primary metric: chrF2 (character-level F-score, β=2), which is more reliable than BLEU for morphologically rich Amharic.
am → en
| Model | BLEU | chrF2 | Len Ratio |
|---|---|---|---|
| Baseline Gemma (no fine-tuning) | 7.00 | 27.69 | 1.22 |
| YehaTranslate (ours) | 37.58 | 64.60 | 1.07 |
| Gemini 2.5 Flash | 36.17 | 57.78 | 0.69 |
| GPT-5.4 | 26.37 | 51.36 | 0.67 |
en → am
| Model | BLEU | chrF2 | Len Ratio |
|---|---|---|---|
| Baseline Gemma (no fine-tuning) | 1.00 | 7.69 | 1.79 |
| YehaTranslate (ours) | 8.12 | 25.93 | 1.13 |
| Gemini 2.5 Flash | 4.43 | 23.05 | 0.43 |
| GPT-5.4 | 0.15 | 9.51 | 0.18 |
Usage
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
model_id = "hasab-ai/YehaTranslate"
processor = AutoProcessor.from_pretrained(model_id, use_fast=True)
model = AutoModelForImageTextToText.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model.generation_config.pad_token_id = processor.tokenizer.eos_token_id
model.eval()
DIRECTIONS = {
"Amharic → English": ("am", "en"),
"English → Amharic": ("en", "am"),
"Oromo → English": ("om", "en"),
"English → Oromo": ("en", "om"),
"Tigrinya → English": ("ti", "en"),
"English → Tigrinya": ("en", "ti"),
}
def translate(text: str, src_lang: str, tgt_lang: str, max_new_tokens: int = 512) -> str:
messages = [{"role": "user", "content": [
{"type": "text", "source_lang_code": src_lang,
"target_lang_code": tgt_lang, "text": text.strip()}
]}]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
)
input_len = inputs["input_ids"].shape[1]
with torch.inference_mode():
out = model.generate(**inputs, do_sample=False, max_new_tokens=max_new_tokens)
return processor.decode(out[0][input_len:], skip_special_tokens=True).strip()
# Amharic → English
print(translate("የኢትዮጵያ ሕዝቦች ታሪካዊ ትብብር እና አንድነት ዛሬም ቀጥሏል።", "am", "en"))
# English → Amharic
print(translate("The coffee ceremony is an important part of Ethiopian culture.", "en", "am"))
# Oromo → English
print(translate("Biyyoonni Itoophiyaa beekamtii addunyaa qabdi.", "om", "en"))
Training Details
Training used torchrun for distributed data-parallel training, with the HuggingFace
Trainer API and bfloat16 activations.
The model was trained on paired Amharic–English sentence pairs covering news, general web, and social media domains.
Limitations
- Best performance on general-domain text; domain-specific content (legal, medical) may degrade.
- en→am generation is harder than am→en decoding, consistent with the asymmetry seen across all models.
- Amharic is morphologically rich; single-reference automatic metrics (BLEU, chrF2) underestimate translation quality when paraphrase variation is high.
Citation
@misc{oriontranslate2026,
title = {YehaTranslate: Fine-tuned Amharic-English Translation},
author = {Hasab AI},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/hasab-ai/YehaTranslate}},
}
- Downloads last month
- -
Model tree for hasab-ai/YehaTranslate
Base model
google/translategemma-4b-it