MoxhiMT Pronoun Classifier (speaker attribution)

Small extractive speaker-attribution classifier for Chinese web-novel dialogue, used downstream of DanVP/MoxhiMT-30 to make Vietnamese pronouns consistent per speaker–addressee pair (tôi/cậu/ngươi…).

Task: given 3 sentences of context + the current line + a candidate name, score whether that candidate is the speaker (listwise softmax over candidates).
Arch: 2-layer Transformer encoder, d=256, ~7.8M params. Tokenizer = MoxhiMT source SentencePiece (24k ZH).
Candidates: dictionary lookup over 347 known character names (longest-match)
- 旁白 (narration) / 自语 (soliloquy).
Speed: ~3 ms/sentence (CPU), ~3% overhead on top of MT decode.

Why

MT models translate sentence-by-sentence, so the same 你/我 flips between ngươi/cậu, tôi/ta across lines for the same speaker — especially in modern settings. This classifier identifies the speaker so a renderer can normalize pronouns per relationship pair.

Honest scope

Speaker accuracy ~50% on a multi-genre validation set (10× a regex baseline), even across genres (modern 63%). PoC, not production.
Trained on ~3.8k lines labeled by DeepSeek-V4-Flash + external agents + manual.
Best on genres present in the training pool; new novels (new names) degrade.
The renderer is intentionally conservative: it skips xianxia, skips plural 我们/咱们/你们, multi-你 quotes, and raw Vietnamese quote pairs like ta/ngươi when it cannot rewrite the whole pair. It only rewrites self/you when the matching Chinese pronoun is present, protects fixed Vietnamese phrases, and caps edits per quote segment.
Direct family address such as 妈，...你... / 儿子，...你... is an exception: the addressee is explicit, so multiple Vietnamese second-person pronouns in the same quoted segment may be normalized together.
render(...) returns an info string with guard status plus speaker score metadata (p, margin, candidate count) so downstream apps can surface abstentions instead of silently over-editing.

Usage

See pronoun_infer.py — PronounRenderer(clf_dir, spm_path).render(zh, vi, ctx_zh, genre_class).

Files: speaker_clf.pt (weights), known_names.json, source.spm, pronoun_infer.py.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

DanVP
/

moxhimt-pronoun-clf

MoxhiMT Pronoun Classifier (speaker attribution)

Why

Honest scope

Usage

Space using DanVP/moxhimt-pronoun-clf 1