Personal-Facts Classifier — EmbeddingGemma-300M (multi-head)

A seven-head text classifier over the adugeen/personal-facts-msc annotation scheme. One CLS embedding from a fine-tuned unsloth/embeddinggemma-300m encoder is routed through seven independent classification heads — one per annotation dimension.

Dimension	# classes	Captures
`main_category`	9	Topical bucket: Preferences, Experience, Goals and plans, Routine activities, Possessions, Characteristics, Relationships, Demographics, None
`time`	4	Past · Present · Future · None
`referent`	3	Self · Other · None
`duration`	3	Short-term · Long-term · None
`broken`	2	Validity flag (No / Yes)
`broken_reason`	6	Failure mode when invalid: Opinion · Multiple facts · No fact · Not about self/known people · Context Insufficient · None
`followup`	3	Followup potential: Yes · Maybe · None

Test-set metrics

Evaluated on the 556-example test split of adugeen/personal-facts-msc.

Seed 42 (this checkpoint):

Dimension	F1	Accuracy
main_category	0.7964	0.8345
time	0.8562	0.8921
referent	0.8427	0.9245
duration	0.8218	0.8831
broken	0.8659	0.9371
broken_reason	0.6622	0.9137
followup	0.8134	0.9514
Macro F1 across dims	0.8267	—
Mean per-dim F1	0.8084	—
Mean per-dim accuracy	—	0.9052

Five-seed ensemble (paper-1 headline number): macro F1 = 0.8159 ± 0.0256, mean per-dim F1 = 0.7889 ± 0.0169.

This single-seed checkpoint outperforms GPT-5-mini (76.18% overall F1) by ≈6 points while running on a 300M-parameter encoder.

Quick start

from transformers import AutoModel, AutoTokenizer

repo = "adugeen/personal-facts-classifier-embeddinggemma-300m"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModel.from_pretrained(repo, trust_remote_code=True).eval()

facts = [
    "I am a software engineer.",
    "My golden retriever Maggie loves the beach.",
    "She likes Indian food.",                  # invalid: third-person
    "I'm 30 and I live in Boston.",            # invalid: multiple facts
    "I want to open a jiu-jitsu studio.",
]
preds = model.predict(facts, tokenizer=tokenizer)
for fact, p in zip(facts, preds):
    print(fact)
    print({k: v for k, v in p.items() if not k.endswith("_confidence")})

predict(...) returns one dict per input with the predicted label for each of the seven dimensions plus a <dim>_confidence softmax probability.

For raw logits, call the model directly:

inputs = tokenizer(facts, padding="max_length", truncation=True,
                   max_length=128, return_tensors="pt").to(model.device)
out = model(**inputs)
out.logits["main_category"]   # (batch, 9) — etc. for each dimension
out.all_logits                # (batch, 7, max_classes) — padded with -inf

Training

Encoder: unsloth/embeddinggemma-300m (loaded via AutoModel).
Architecture: shared CLS-token pooling → 7× Dense(dim,dim) → tanh → Dropout → Linear(dim, n_classes) heads.
Loss: mean of per-head CrossEntropy (ignoring index -1), weighted equally across dimensions.
Data: 2,223 manually annotated personal facts from adugeen/personal-facts-msc (train split).
Optimizer / schedule: AdamW · 10 epochs · max sequence length 128 · batch-size 32 · 5 random seeds; this checkpoint is seed 42.
Architecture/encoder ablations: the multi-head architecture beat hierarchical and conditional variants on RoBERTa-large; EmbeddingGemma-300M beat ModernBERT-large, RoBERTa-large, and DeBERTa-v3-large at this size.
Hardware: single A100 (40 GB), ≈3 minutes per seed.

Limitations

broken_reason is the noisiest dimension (F1 0.66): the underlying annotation has the lowest inter-annotator agreement (Fleiss' κ = 0.458) among the seven dimensions.
Possession-attribution bias: the model occasionally labels attributes of named pets as Self rather than Other (e.g. "My golden retriever is named Maggie and is nine years old").
Pragmatic reasoning gaps: long-term aspirations are sometimes labeled short-term; followup potential is under-estimated for concrete goals.
English / dialogue domain only. The training set comes from Multi-Session Chat — an English crowd-authored corpus. Distributions and category boundaries are unlikely to transfer to clinical, legal, or non-English text.
Single-seed checkpoint. For more robust deployments, ensemble the five released seeds; macro F1 std-dev across seeds is ≈2.6%.

Intended use

Quality filtering / curation of persona corpora — drop or flag candidates predicted broken=Yes.
Memory policy in dialogue agents — duration and followup heads drive long-term-memory eviction and topic-continuation.
Distribution audits of persona datasets — the same model was used in the accompanying paper to characterize PersonaChat (6,126 facts) and the full MSC corpus (55,795 facts).
Retrieval filtering — predicted main_category is used as a category-conditioned filter in downstream learned-retrieval experiments.

License & citation

Released under CC BY-SA 4.0, matching the underlying dataset.

If you use this model, please cite (placeholder until proceedings publish):

@misc{personal-facts-classifier-2026,
  title  = {An Extended Annotation Scheme for Personal-Fact Classification in Dialogue},
  author = {Zaitsev, Konstantin},
  year   = {2026},
  note   = {Model: https://huggingface.co/adugeen/personal-facts-classifier-embeddinggemma-300m;
            Dataset: https://huggingface.co/datasets/adugeen/personal-facts-msc}
}

Downloads last month: 11

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for adugeen/personal-facts-classifier-embeddinggemma-300m

Base model

unsloth/embeddinggemma-300m

Finetuned

(20)

this model

Dataset used to train adugeen/personal-facts-classifier-embeddinggemma-300m

Evaluation results

Macro F1 (across-dim macro, seed 42) on Personal Facts (MSC)
test set self-reported

0.827
Mean per-dimension F1 (seed 42) on Personal Facts (MSC)
test set self-reported

0.808
Mean per-dimension accuracy (seed 42) on Personal Facts (MSC)
test set self-reported

0.905