MoxhiMT Pronoun Classifier (speaker attribution)

Small extractive speaker-attribution classifier for Chinese web-novel dialogue, used downstream of DanVP/MoxhiMT-30 to make Vietnamese pronouns consistent per speaker–addressee pair (tôi/cậu/ngươi…).

  • Task: given 3 sentences of context + the current line + a candidate name, score whether that candidate is the speaker (listwise softmax over candidates).
  • Arch: 2-layer Transformer encoder, d=256, ~7.8M params. Tokenizer = MoxhiMT source SentencePiece (24k ZH).
  • Candidates: dictionary lookup over 347 known character names (longest-match)
    • 旁白 (narration) / 自语 (soliloquy).
  • Speed: ~3 ms/sentence (CPU), ~3% overhead on top of MT decode.

Why

MT models translate sentence-by-sentence, so the same 你/我 flips between ngươi/cậu, tôi/ta across lines for the same speaker — especially in modern settings. This classifier identifies the speaker so a renderer can normalize pronouns per relationship pair.

Honest scope

  • Speaker accuracy ~50% on a multi-genre validation set (10× a regex baseline), even across genres (modern 63%). PoC, not production.
  • Trained on ~3.8k lines labeled by DeepSeek-V4-Flash + external agents + manual.
  • Best on genres present in the training pool; new novels (new names) degrade.
  • The renderer is intentionally conservative: it skips xianxia, skips plural 我们/咱们/你们, multi- quotes, and raw Vietnamese quote pairs like ta/ngươi when it cannot rewrite the whole pair. It only rewrites self/you when the matching Chinese pronoun is present, protects fixed Vietnamese phrases, and caps edits per quote segment.
  • Direct family address such as 妈,...你... / 儿子,...你... is an exception: the addressee is explicit, so multiple Vietnamese second-person pronouns in the same quoted segment may be normalized together.
  • render(...) returns an info string with guard status plus speaker score metadata (p, margin, candidate count) so downstream apps can surface abstentions instead of silently over-editing.

Usage

See pronoun_infer.pyPronounRenderer(clf_dir, spm_path).render(zh, vi, ctx_zh, genre_class).

Files: speaker_clf.pt (weights), known_names.json, source.spm, pronoun_infer.py.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using DanVP/moxhimt-pronoun-clf 1