Personal-Facts Classifier — EmbeddingGemma-300M (multi-head)

A seven-head text classifier over the adugeen/personal-facts-msc annotation scheme. One CLS embedding from a fine-tuned unsloth/embeddinggemma-300m encoder is routed through seven independent classification heads — one per annotation dimension.

Dimension # classes Captures
main_category 9 Topical bucket: Preferences, Experience, Goals and plans, Routine activities, Possessions, Characteristics, Relationships, Demographics, None
time 4 Past · Present · Future · None
referent 3 Self · Other · None
duration 3 Short-term · Long-term · None
broken 2 Validity flag (No / Yes)
broken_reason 6 Failure mode when invalid: Opinion · Multiple facts · No fact · Not about self/known people · Context Insufficient · None
followup 3 Followup potential: Yes · Maybe · None

Test-set metrics

Evaluated on the 556-example test split of adugeen/personal-facts-msc.

Seed 42 (this checkpoint):

Dimension F1 Accuracy
main_category 0.7964 0.8345
time 0.8562 0.8921
referent 0.8427 0.9245
duration 0.8218 0.8831
broken 0.8659 0.9371
broken_reason 0.6622 0.9137
followup 0.8134 0.9514
Macro F1 across dims 0.8267 —
Mean per-dim F1 0.8084 —
Mean per-dim accuracy — 0.9052

Five-seed ensemble (paper-1 headline number): macro F1 = 0.8159 ± 0.0256, mean per-dim F1 = 0.7889 ± 0.0169.

This single-seed checkpoint outperforms GPT-5-mini (76.18% overall F1) by ≈6 points while running on a 300M-parameter encoder.

Quick start

from transformers import AutoModel, AutoTokenizer

repo = "adugeen/personal-facts-classifier-embeddinggemma-300m"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModel.from_pretrained(repo, trust_remote_code=True).eval()

facts = [
    "I am a software engineer.",
    "My golden retriever Maggie loves the beach.",
    "She likes Indian food.",                  # invalid: third-person
    "I'm 30 and I live in Boston.",            # invalid: multiple facts
    "I want to open a jiu-jitsu studio.",
]
preds = model.predict(facts, tokenizer=tokenizer)
for fact, p in zip(facts, preds):
    print(fact)
    print({k: v for k, v in p.items() if not k.endswith("_confidence")})

predict(...) returns one dict per input with the predicted label for each of the seven dimensions plus a <dim>_confidence softmax probability.

For raw logits, call the model directly:

inputs = tokenizer(facts, padding="max_length", truncation=True,
                   max_length=128, return_tensors="pt").to(model.device)
out = model(**inputs)
out.logits["main_category"]   # (batch, 9) — etc. for each dimension
out.all_logits                # (batch, 7, max_classes) — padded with -inf

Training

  • Encoder: unsloth/embeddinggemma-300m (loaded via AutoModel).
  • Architecture: shared CLS-token pooling → 7× Dense(dim,dim) → tanh → Dropout → Linear(dim, n_classes) heads.
  • Loss: mean of per-head CrossEntropy (ignoring index -1), weighted equally across dimensions.
  • Data: 2,223 manually annotated personal facts from adugeen/personal-facts-msc (train split).
  • Optimizer / schedule: AdamW · 10 epochs · max sequence length 128 · batch-size 32 · 5 random seeds; this checkpoint is seed 42.
  • Architecture/encoder ablations: the multi-head architecture beat hierarchical and conditional variants on RoBERTa-large; EmbeddingGemma-300M beat ModernBERT-large, RoBERTa-large, and DeBERTa-v3-large at this size.
  • Hardware: single A100 (40 GB), ≈3 minutes per seed.

Limitations

  • broken_reason is the noisiest dimension (F1 0.66): the underlying annotation has the lowest inter-annotator agreement (Fleiss' κ = 0.458) among the seven dimensions.
  • Possession-attribution bias: the model occasionally labels attributes of named pets as Self rather than Other (e.g. "My golden retriever is named Maggie and is nine years old").
  • Pragmatic reasoning gaps: long-term aspirations are sometimes labeled short-term; followup potential is under-estimated for concrete goals.
  • English / dialogue domain only. The training set comes from Multi-Session Chat — an English crowd-authored corpus. Distributions and category boundaries are unlikely to transfer to clinical, legal, or non-English text.
  • Single-seed checkpoint. For more robust deployments, ensemble the five released seeds; macro F1 std-dev across seeds is ≈2.6%.

Intended use

  • Quality filtering / curation of persona corpora — drop or flag candidates predicted broken=Yes.
  • Memory policy in dialogue agents — duration and followup heads drive long-term-memory eviction and topic-continuation.
  • Distribution audits of persona datasets — the same model was used in the accompanying paper to characterize PersonaChat (6,126 facts) and the full MSC corpus (55,795 facts).
  • Retrieval filtering — predicted main_category is used as a category-conditioned filter in downstream learned-retrieval experiments.

License & citation

Released under CC BY-SA 4.0, matching the underlying dataset.

If you use this model, please cite (placeholder until proceedings publish):

@misc{personal-facts-classifier-2026,
  title  = {An Extended Annotation Scheme for Personal-Fact Classification in Dialogue},
  author = {Zaitsev, Konstantin},
  year   = {2026},
  note   = {Model: https://huggingface.co/adugeen/personal-facts-classifier-embeddinggemma-300m;
            Dataset: https://huggingface.co/datasets/adugeen/personal-facts-msc}
}
Downloads last month
11
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for adugeen/personal-facts-classifier-embeddinggemma-300m

Finetuned
(20)
this model

Dataset used to train adugeen/personal-facts-classifier-embeddinggemma-300m

Evaluation results