ayymen/Pontoon-Translations
Viewer β’ Updated β’ 3.56M β’ 1.66k β’ 19
A fine-tuned mT5 model optimized for Ossetian machine translation. The model was trained for 10 epochs using the Hugging Face Trainer framework, demonstrating stable convergence with both training and validation loss plateauing around epoch 8.
Starting training...
[1130/1130 27:40, Epoch 10/10]
Epoch Training Loss Validation Loss
1 4.760233 3.660645
2 3.665481 3.144127
3 3.284355 2.871792
4 2.910443 2.698959
5 2.716799 2.608596
6 2.567515 2.554176
7 2.334619 2.507996
8 2.326849 2.465583
9 2.173708 2.468309
10 2.190471 2.469198
Writing model shards: 100%
1/1 [00:25<00:00, 25.28s/it]
Convergence Notes:
4.76 β 2.192.46β2.47, indicating effective learning without severe overfitting.11,300 (1,130 steps/epoch Γ 10 epochs)google/mt5-[base/small/large] (update to your exact variant)[Source Language] β Ossetian)transformers + Trainer[e.g., 1Γ NVIDIA A100 40GB / Colab Pro / etc.][Link or name of your translation dataset]import os
import json
import torch
from pathlib import Path
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, T5Tokenizer
MODEL_ID = "ajsbsd/mt5-ossetian-translator"
print("β¬οΈ Downloading model (with auth)...")
local_path = snapshot_download(
repo_id=MODEL_ID,
resume_download=True,
ignore_patterns=["optimizer.pt", "*.pt"] # Skip training artifacts
)
print(f"π Model downloaded to: {local_path}")
# π Find the SentencePiece model file
spm_candidates = ["spiece.model", "sentencepiece.bpe.model", "tokenizer.model"]
spm_file = None
for candidate in spm_candidates:
path = os.path.join(local_path, candidate)
if os.path.exists(path):
spm_file = path
print(f"β
Found SentencePiece model: {spm_file}")
break
if not spm_file:
print("β οΈ No SentencePiece model found β falling back to base mT5 tokenizer")
tokenizer = AutoTokenizer.from_pretrained("google/mt5-small", use_fast=False)
else:
# β
Load tokenizer with EXPLICIT string path (critical!)
print("π§ Loading tokenizer with explicit spm path...")
tokenizer = T5Tokenizer(
vocab_file=str(spm_file), # β Must be string, not Path object
eos_token="</s>",
unk_token="<unk>",
pad_token="<pad>",
extra_ids=100, # mT5 uses 100 sentinel tokens
legacy=True
)
# Patch config to avoid future issues
config_file = os.path.join(local_path, "tokenizer_config.json")
if os.path.exists(config_file):
with open(config_file, "r", encoding="utf-8") as f:
config = json.load(f)
# Fix known issues
if "extra_special_tokens" in config and isinstance(config["extra_special_tokens"], list):
config["extra_special_tokens"] = {}
if "vocab_file" in config and config["vocab_file"] is None:
config["vocab_file"] = str(spm_file) if spm_file else "spiece.model"
if "spm_model_file" in config and config["spm_model_file"] is None:
config["spm_model_file"] = str(spm_file) if spm_file else "spiece.model"
with open(config_file, "w", encoding="utf-8") as f:
json.dump(config, f, indent=2, ensure_ascii=False)
print("π§ Patched tokenizer_config.json")
# Load model
print("π¦ Loading model weights...")
model = AutoModelForSeq2SeqLM.from_pretrained(local_path)
if torch.cuda.is_available():
model = model.to("cuda")
print("β
Model moved to CUDA")
# π§ͺ Run translation test
prompt = "translate english to ossetian: Hello, how are you?"
print(f"\nπ Translating: '{prompt}'")
inputs = tokenizer(prompt, return_tensors="pt")
if torch.cuda.is_available():
inputs = inputs.to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=128,
num_beams=4,
early_stopping=True,
no_repeat_ngram_size=2
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"β
Output: {result}")
from transformers import pipeline
translator = pipeline("translation", model="ajsbsd/mt5-ossetian-translator", tokenizer=model_id)
result = translator("translate [source_lang] to ossetian: Your text here.")
print(result[0]["translation_text"])
@misc{mt5_ossetian_translator,
title={mt5-ossetian-translator},
author={ajsbsd},
year={2026},
url={https://huggingface.co/ajsbsd/mt5-ossetian-translator}
}
Base model
google/mt5-small