MoxhiMT Pronoun Classifier (speaker attribution)
Small extractive speaker-attribution classifier for Chinese web-novel dialogue, used downstream of DanVP/MoxhiMT-30 to make Vietnamese pronouns consistent per speaker–addressee pair (tôi/cậu/ngươi…).
- Task: given 3 sentences of context + the current line + a candidate name, score whether that candidate is the speaker (listwise softmax over candidates).
- Arch: 2-layer Transformer encoder, d=256, ~7.8M params. Tokenizer = MoxhiMT source SentencePiece (24k ZH).
- Candidates: dictionary lookup over 347 known character names (longest-match)
- 旁白 (narration) / 自语 (soliloquy).
- Speed: ~3 ms/sentence (CPU), ~3% overhead on top of MT decode.
Why
MT models translate sentence-by-sentence, so the same 你/我 flips between ngươi/cậu, tôi/ta across lines for the same speaker — especially in modern settings. This classifier identifies the speaker so a renderer can normalize pronouns per relationship pair.
Honest scope
- Speaker accuracy ~50% on a multi-genre validation set (10× a regex baseline), even across genres (modern 63%). PoC, not production.
- Trained on ~3.8k lines labeled by DeepSeek-V4-Flash + external agents + manual.
- Best on genres present in the training pool; new novels (new names) degrade.
- The renderer is intentionally conservative: it skips xianxia, skips plural
我们/咱们/你们, multi-你quotes, and raw Vietnamese quote pairs liketa/ngươiwhen it cannot rewrite the whole pair. It only rewrites self/you when the matching Chinese pronoun is present, protects fixed Vietnamese phrases, and caps edits per quote segment. - Direct family address such as
妈,...你.../儿子,...你...is an exception: the addressee is explicit, so multiple Vietnamese second-person pronouns in the same quoted segment may be normalized together. render(...)returns aninfostring with guard status plus speaker score metadata (p,margin, candidate count) so downstream apps can surface abstentions instead of silently over-editing.
Usage
See pronoun_infer.py — PronounRenderer(clf_dir, spm_path).render(zh, vi, ctx_zh, genre_class).
Files: speaker_clf.pt (weights), known_names.json, source.spm, pronoun_infer.py.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support