Instructions to use adugeen/personal-facts-classifier-embeddinggemma-300m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use adugeen/personal-facts-classifier-embeddinggemma-300m with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="adugeen/personal-facts-classifier-embeddinggemma-300m", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("adugeen/personal-facts-classifier-embeddinggemma-300m", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Personal-Facts Classifier — EmbeddingGemma-300M (multi-head)
A seven-head text classifier over the
adugeen/personal-facts-msc
annotation scheme. One CLS embedding from a fine-tuned
unsloth/embeddinggemma-300m
encoder is routed through seven independent classification heads — one per
annotation dimension.
| Dimension | # classes | Captures |
|---|---|---|
main_category |
9 | Topical bucket: Preferences, Experience, Goals and plans, Routine activities, Possessions, Characteristics, Relationships, Demographics, None |
time |
4 | Past · Present · Future · None |
referent |
3 | Self · Other · None |
duration |
3 | Short-term · Long-term · None |
broken |
2 | Validity flag (No / Yes) |
broken_reason |
6 | Failure mode when invalid: Opinion · Multiple facts · No fact · Not about self/known people · Context Insufficient · None |
followup |
3 | Followup potential: Yes · Maybe · None |
Test-set metrics
Evaluated on the 556-example test split of adugeen/personal-facts-msc.
Seed 42 (this checkpoint):
| Dimension | F1 | Accuracy |
|---|---|---|
| main_category | 0.7964 | 0.8345 |
| time | 0.8562 | 0.8921 |
| referent | 0.8427 | 0.9245 |
| duration | 0.8218 | 0.8831 |
| broken | 0.8659 | 0.9371 |
| broken_reason | 0.6622 | 0.9137 |
| followup | 0.8134 | 0.9514 |
| Macro F1 across dims | 0.8267 | — |
| Mean per-dim F1 | 0.8084 | — |
| Mean per-dim accuracy | — | 0.9052 |
Five-seed ensemble (paper-1 headline number): macro F1 = 0.8159 ± 0.0256, mean per-dim F1 = 0.7889 ± 0.0169.
This single-seed checkpoint outperforms GPT-5-mini (76.18% overall F1) by ≈6 points while running on a 300M-parameter encoder.
Quick start
from transformers import AutoModel, AutoTokenizer
repo = "adugeen/personal-facts-classifier-embeddinggemma-300m"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModel.from_pretrained(repo, trust_remote_code=True).eval()
facts = [
"I am a software engineer.",
"My golden retriever Maggie loves the beach.",
"She likes Indian food.", # invalid: third-person
"I'm 30 and I live in Boston.", # invalid: multiple facts
"I want to open a jiu-jitsu studio.",
]
preds = model.predict(facts, tokenizer=tokenizer)
for fact, p in zip(facts, preds):
print(fact)
print({k: v for k, v in p.items() if not k.endswith("_confidence")})
predict(...) returns one dict per input with the predicted label for each
of the seven dimensions plus a <dim>_confidence softmax probability.
For raw logits, call the model directly:
inputs = tokenizer(facts, padding="max_length", truncation=True,
max_length=128, return_tensors="pt").to(model.device)
out = model(**inputs)
out.logits["main_category"] # (batch, 9) — etc. for each dimension
out.all_logits # (batch, 7, max_classes) — padded with -inf
Training
- Encoder:
unsloth/embeddinggemma-300m(loaded viaAutoModel). - Architecture: shared CLS-token pooling → 7×
Dense(dim,dim) → tanh → Dropout → Linear(dim, n_classes)heads. - Loss: mean of per-head CrossEntropy (ignoring index
-1), weighted equally across dimensions. - Data: 2,223 manually annotated personal facts from
adugeen/personal-facts-msc(train split). - Optimizer / schedule: AdamW · 10 epochs · max sequence length 128 · batch-size 32 · 5 random seeds; this checkpoint is seed 42.
- Architecture/encoder ablations: the multi-head architecture beat hierarchical and conditional variants on RoBERTa-large; EmbeddingGemma-300M beat ModernBERT-large, RoBERTa-large, and DeBERTa-v3-large at this size.
- Hardware: single A100 (40 GB), ≈3 minutes per seed.
Limitations
broken_reasonis the noisiest dimension (F1 0.66): the underlying annotation has the lowest inter-annotator agreement (Fleiss' κ = 0.458) among the seven dimensions.- Possession-attribution bias: the model occasionally labels attributes
of named pets as
Selfrather thanOther(e.g. "My golden retriever is named Maggie and is nine years old"). - Pragmatic reasoning gaps: long-term aspirations are sometimes labeled short-term; followup potential is under-estimated for concrete goals.
- English / dialogue domain only. The training set comes from Multi-Session Chat — an English crowd-authored corpus. Distributions and category boundaries are unlikely to transfer to clinical, legal, or non-English text.
- Single-seed checkpoint. For more robust deployments, ensemble the five released seeds; macro F1 std-dev across seeds is ≈2.6%.
Intended use
- Quality filtering / curation of persona corpora — drop or flag
candidates predicted
broken=Yes. - Memory policy in dialogue agents —
durationandfollowupheads drive long-term-memory eviction and topic-continuation. - Distribution audits of persona datasets — the same model was used in the accompanying paper to characterize PersonaChat (6,126 facts) and the full MSC corpus (55,795 facts).
- Retrieval filtering — predicted
main_categoryis used as a category-conditioned filter in downstream learned-retrieval experiments.
License & citation
Released under CC BY-SA 4.0, matching the underlying dataset.
If you use this model, please cite (placeholder until proceedings publish):
@misc{personal-facts-classifier-2026,
title = {An Extended Annotation Scheme for Personal-Fact Classification in Dialogue},
author = {Zaitsev, Konstantin},
year = {2026},
note = {Model: https://huggingface.co/adugeen/personal-facts-classifier-embeddinggemma-300m;
Dataset: https://huggingface.co/datasets/adugeen/personal-facts-msc}
}
- Downloads last month
- 11
Model tree for adugeen/personal-facts-classifier-embeddinggemma-300m
Base model
unsloth/embeddinggemma-300mDataset used to train adugeen/personal-facts-classifier-embeddinggemma-300m
Evaluation results
- Macro F1 (across-dim macro, seed 42) on Personal Facts (MSC)test set self-reported0.827
- Mean per-dimension F1 (seed 42) on Personal Facts (MSC)test set self-reported0.808
- Mean per-dimension accuracy (seed 42) on Personal Facts (MSC)test set self-reported0.905