Embedding-Amharic-Medium

This is a sentence-transformers model finetuned from rasyosef/roberta-medium-amharic. It maps sentences & paragraphs to a 512-dimensional dense vector space and can be used for semantic textual similarity, semantic search, and information retrieval.

It was introduced in the paper The Multilingual Curse at the Retrieval Layer: Evidence from Amharic.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: rasyosef/roberta-medium-amharic
  • Maximum Sequence Length: 510 tokens
  • Output Dimensionality: 512 dimensions
  • Similarity Function: Cosine Similarity
  • Language: am
  • License: mit

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 510, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 512, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("rasyosef/embedding-amharic-medium")
# Run inference
sentences = [
    'ለውጭ ገበያ በሚቀርበው የኢትዮጵያ ቡና ላይ የተጋረጠው ፈተና',
    'የኢትዮጵያ ዋነኛ የውጭ ምንዛሬ ምንጭ የሆነው ወደ ውጭ የሚላክ ቡና ዘርፍ በአሁኑ ጊዜ ከፍተኛ ውጥረት ውስጥ ገብቷል። በዚህ የተነሳም የኢትዮጵያ ቡናና ሻይ ባለሥልጣንን ጨምሮ የሚመላካታቸው ሁሉ ቡና ላኪዎችና አምራቾች ያከማቹትን ቡና በፍጥነት ወደ ዓለም ገበያ እንዲያወጡ ጥሪ እያቀረቡ ነው ።',
    'የቻይናው ፕሬዝዳንት ዚ ጂንፒንግ ከትራምፕ ጋር ባደረጉት ጉባኤ ትኩረታቸው በሁለቱ ሀገራት መካከል ለወራት ከተፈጠረ ውጥረት እና የንግድ ጦርነት በኋላ የተረገጋጋ ግንኙነትን ማስቀጠል ነበር። ከፑቲን ጋር ደግሞ ዢ ለሁለቱ አገራት ስልታዊም ሆነ ኢኮኖሚያዊ ጠቀሜታ ረጅም ጊዜ የዘለቀውን አጋርነትን ይበልጥ ማጠናከር ላይ ነበር ትኩረታቸው።',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 512]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval (dim 512)

Metric Value
cosine_recall@5 0.8430
cosine_recall@10 0.8883
cosine_ndcg@10 0.7794
cosine_mrr@10 0.7444

Information Retrieval (dim 256)

Metric Value
cosine_recall@5 0.8409
cosine_recall@10 0.8831
cosine_ndcg@10 0.7736
cosine_mrr@10 0.7382

Training Details

Training Dataset

  • Size: 122,938 training samples
  • Columns: anchor, positive, negative_1, and negative_2
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            512,
            256
        ],
        "matryoshka_weights": [
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • learning_rate: 6e-05
  • num_train_epochs: 6
  • lr_scheduler_type: cosine
  • fp16: True

Framework Versions

  • Python: 3.11.13
  • Sentence Transformers: 4.1.0
  • Transformers: 4.52.4
  • PyTorch: 2.7.1+cu126

Citation

@inproceedings{alemneh2026amharicir,
  title     = {The Multilingual Curse at the Retrieval Layer: Evidence from Amharic},
  author    = {Alemneh, Yosef Worku and Mekonnen, Kidist Amde and de Rijke, Maarten},
  booktitle = {Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM), ACL 2026},
  year      = {2026},
}
Downloads last month
177
Safetensors
Model size
42.1M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rasyosef/embedding-amharic-medium

Finetuned
(11)
this model

Dataset used to train rasyosef/embedding-amharic-medium

Collection including rasyosef/embedding-amharic-medium

Paper for rasyosef/embedding-amharic-medium

Evaluation results