Translation
Transformers
Safetensors
Chinese
Vietnamese
marian
text2text-generation
zh-vi
chinese-vietnamese
marianmt
machine-translation
webnovel
xianxia
trained-from-scratch
Instructions to use DanVP/MoxhiMT-30 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use DanVP/MoxhiMT-30 with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "translation" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("translation", model="DanVP/MoxhiMT-30")# Load model directly from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("DanVP/MoxhiMT-30") model = AutoModelForSeq2SeqLM.from_pretrained("DanVP/MoxhiMT-30") - Notebooks
- Google Colab
- Kaggle
Moxhi MT 30 zh-vi
Fast Chinese to Vietnamese Marian-style machine translation model, trained for web-novel / xianxia content.
Intended Use
- Chinese -> Vietnamese web novel / fiction translation (xianxia, modern, cross-domain).
- Fast local or server inference where a small model is preferred.
- Strong general / cross-domain coverage (sci-fi, historical, modern, mystery) while keeping xianxia / wuxia / classical register sharp.
- Experimental release; review output for high-stakes or publication use.
Model Details
- Architecture: Marian seq2seq (asymmetric 8 encoder + 2 decoder)
- Parameters: ~37M
- Tokenizer: SentencePiece joint source/target, 24k
- Suggested decoding:
num_beams=4,max_length=512
Versions
| Tag | Notes |
|---|---|
v4.0 (current main) |
New version. |
v3.0 |
Pin with revision="v3.0". |
v2.2 |
Pin with revision="v2.2". |
Pin a specific version:
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("DanVP/MoxhiMT-30", revision="v3.0")
Quick Start
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = "DanVP/MoxhiMT-30"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
text = "δ»ζ¬ε€΄ηεθΏε€ηε±±ι¨γ"
inputs = tok(text, return_tensors="pt", truncation=True, max_length=512)
out = model.generate(**inputs, max_length=512, num_beams=4)
print(tok.decode(out[0], skip_special_tokens=True))
Fast CPU Runtime
A CTranslate2 INT8 export is in ct2-int8_float32/ for ~3-5x faster CPU inference.
import ctranslate2
from pathlib import Path
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer
model_id = "DanVP/MoxhiMT-30"
model_path = Path(snapshot_download(model_id, allow_patterns=[
"config.json", "source.spm", "target.spm", "vocab.json",
"tokenizer_config.json", "ct2-int8_float32/*",
]))
tokenizer = AutoTokenizer.from_pretrained(model_path)
translator = ctranslate2.Translator(
str(model_path / "ct2-int8_float32"),
device="cpu", compute_type="int8_float32",
)
Training Data
Trained from scratch on a curated Chinese-Vietnamese parallel corpus covering xianxia, modern fiction, historical, sci-fi, and cross-domain web-novel content, with a research-grounded layer for idioms and classical-Chinese grammar, then a light preference-tuning (DPO) pass for xianxia/idiom sharpness.
Notes
- Prioritizes speed and small footprint.
- Known hard cases include rare proper nouns and highly domain-specific OOD terminology.
- For production usage, pair with reviewed glossary/guard layers where appropriate.
License
CC-BY-NC-4.0 (research / non-commercial use).
- Downloads last month
- 217