Tiny Specialized Encoder Models Beat Popular LLMs at PII Entity Extraction
kalyan-ks
• • 1Light Weight PII Detection Model | Open Source | 32M Parameters | 95.73 F1 Score
Ettin-32m-nemotron-pii is based on the ettin-encoder-32M model and fine-tuned over the Nemotron PII dataset. This model can detect 50+ PII entities in both structured and unstructured texts across various domains like healthcare, finance, legal, cybersecurity etc. With just 32M parameters, the model achieves a strong F1-score of 95.73.
This model can detect the following 55 PII entity types
| Entity | Description |
|---|---|
| account_number | Account Number |
| age | Age |
| api_key | API Key |
| bank_routing_number | Bank Routing Number |
| biometric_identifier | Biometric Identifier |
| blood_type | Blood Type |
| certificate_license_number | Certificate or License Number |
| city | City |
| company_name | Company Name |
| coordinate | Geographic Coordinate |
| country | Country |
| county | County |
| credit_debit_card | Credit or Debit Card Number |
| customer_id | Customer ID |
| cvv | Card Verification Value (CVV) |
| date | Date |
| date_of_birth | Date of Birth |
| date_time | Date and Time |
| device_identifier | Device Identifier |
| education_level | Education Level |
| Email Address | |
| employee_id | Employee ID |
| employment_status | Employment Status |
| fax_number | Fax Number |
| first_name | First Name |
| gender | Gender |
| health_plan_beneficiary_number | Health Plan Beneficiary Number |
| http_cookie | HTTP Cookie |
| ipv4 | IPv4 Address |
| ipv6 | IPv6 Address |
| language | Language |
| last_name | Last Name |
| license_plate | Vehicle License Plate |
| mac_address | MAC Address |
| medical_record_number | Medical Record Number |
| national_id | National Identification Number |
| occupation | Occupation |
| password | Password |
| phone_number | Phone Number |
| pin | Personal Identification Number (PIN) |
| political_view | Political View |
| postcode | Postcode / Zip Code |
| race_ethnicity | Race or Ethnicity |
| religious_belief | Religious Belief |
| sexuality | Sexuality / Sexual Orientation |
| ssn | Social Security Number |
| state | State |
| street_address | Street Address |
| swift_bic | SWIFT / BIC Code |
| tax_id | Tax Identification Number |
| time | Time |
| unique_id | Unique Identifier |
| url | URL / Web Address |
| user_name | Username |
| vehicle_identifier | Vehicle Identification Number (VIN) |
# First install Hugging Face transformers library
!pip install transformers
# Initialize and run the PII detection pipeline to extract PII entities
from transformers import pipeline
## Initialize the PII detection pipeline
ner = pipeline("ner", model="kalyan-ks/ettin-32m-nemotron-pii", aggregation_strategy="simple")
input_text = "Kalyan KS is from India. His email id is kalyan.ks@yahoo.com"
## Run the PII detection pipeline
pii_entities = ner(input_text)
## Display the extracted PII entities
for entity in pii_entities:
print(f"{entity['entity_group']}: {entity['word']} (Score:{entity['score']:.2f})")
This model is evaluated on a 10k sample test set from Neomotron PII dataset and achieved the following results
| Metric | Score |
|---|---|
| F1 | 95.73 |
| Precision | 95.96 |
| Recall | 95.49 |
| Accuracy | 99.17 |
| Entity | Precision | Recall | F1 |
|---|---|---|---|
| mac_address | 0.9965 | 1.0000 | 0.9982 |
| biometric_identifier | 0.9957 | 0.9978 | 0.9967 |
| date_of_birth | 0.9937 | 0.9976 | 0.9956 |
| 0.9937 | 0.9940 | 0.9939 | |
| api_key | 0.9921 | 0.9949 | 0.9935 |
| coordinate | 0.9910 | 0.9955 | 0.9932 |
| vehicle_identifier | 0.9863 | 0.9981 | 0.9922 |
| medical_record_number | 0.9960 | 0.9881 | 0.9921 |
| employee_id | 0.9909 | 0.9915 | 0.9912 |
| credit_debit_card | 0.9935 | 0.9870 | 0.9902 |
| Entity | Precision | Recall | F1 |
|---|---|---|---|
| occupation | 0.7110 | 0.5115 | 0.5949 |
| time | 0.8614 | 0.7784 | 0.8178 |
| political_view | 0.8348 | 0.8842 | 0.8588 |
| age | 0.8220 | 0.9184 | 0.8676 |
| state | 0.8941 | 0.8570 | 0.8751 |
| national_id | 0.8671 | 0.8999 | 0.8832 |
| company_name | 0.8860 | 0.8860 | 0.8860 |
| fax_number | 0.9070 | 0.8742 | 0.8903 |
| race_ethnicity | 0.8611 | 0.9299 | 0.8942 |
| education_level | 0.9232 | 0.8874 | 0.9049 |
occupation has low F1 score.@misc{ettin-32m-pii-2026,
title = {ettin-32m-nemotron-pii-2026: PII Detection Model},
author = {Kalyan KS},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/kalyan-ks/ettin-17m-nemotron-pii}
}
Base model
jhu-clsp/ettin-encoder-32m