Capitalized BERT Base Uncased MLM

This is the best checkpoint from the CapitalizationEmbeddings project.

The model starts from bert-base-uncased and adds a tiny learned capitalization embedding channel. The lexical token ID still follows the uncased BERT vocabulary, while a parallel capitalization_ids tensor carries case information:

0 = lowercase / no capitalization feature / punctuation / special token
1 = first-cap, e.g. Tom
2 = all-caps, e.g. NASA
3 = mixed-case, e.g. iPhone

This checkpoint is a masked-language-model checkpoint, not a downstream fine-tuned classifier.

Checkpoint

Best project checkpoint:

mixed_case_dropout/capitalized_from_3class_steps3000_lr2e5_drop01/final

Training recipe:

base model: bert-base-uncased
capitalization vocab size: 4
capitalization loss weight: 0.25
capitalization class weights: [1, 2, 8, 4]
capitalization embedding dropout: 0.1
continued pretraining: 3,000 steps on the real-acronym mix

Loading

This model uses a custom architecture, so it is not directly loadable with plain AutoModelForMaskedLM. Install the project package first:

pip install git+https://github.com/Santosh-Gupta/CapitalizationEmbeddings.git

Then load:

from transformers import AutoTokenizer

from capitalization_embeddings import (
    CapitalizedBertForMaskedLM,
    tokenize_with_capitalization,
)

repo_id = "Santosh-Gupta/capitalized-bert-base-uncased-mlm"

tokenizer = AutoTokenizer.from_pretrained(repo_id, use_fast=True)
model = CapitalizedBertForMaskedLM.from_pretrained(repo_id)

encoding = tokenize_with_capitalization(
    tokenizer,
    "Tom works at NASA.",
    return_tensors="pt",
    use_mixed_case=True,
)

outputs = model(**encoding)
print(outputs.logits.shape)
print(outputs.capitalization_logits.shape)

Intended Use

This checkpoint is mainly intended for research and continued experimentation:

continued MLM pretraining;
probing capitalization-aware BERT representations;
initializing token/entity classification fine-tuning experiments;
reproducing the blog-post results in the repository.

It is not intended as a general replacement for bert-base-cased.

Results Summary

The project found that capitalization embeddings are most useful on token and entity-heavy tasks where case behaves like a reusable feature. Sequence-task results were mixed.

Headline examples from the repository:

Benchmark	Metric	Uncased	Cased	Capitalized
CoNLL-2003 NER	entity F1	0.9040 +/- 0.0025	0.9119 +/- 0.0035	0.9165 +/- 0.0018
WNUT-17 NER	entity F1	0.4424 +/- 0.0152	0.4426 +/- 0.0100	0.4495 +/- 0.0103
SST-5	accuracy	0.5410 +/- 0.0039	0.5283 +/- 0.0084	0.5407 +/- 0.0068

See the repository README for the full writeup and limitations.

Limitations

This is a custom BERT architecture requiring the project package.
This is an MLM checkpoint, not an instruction model or text generator.
The method did not universally dominate cased or uncased BERT.
Some benchmark families favored cased BERT or uncased BERT for reasons that a small case embedding did not fully recover.

Citation

If you use this checkpoint, cite the GitHub repository:

Santosh Gupta. CapitalizationEmbeddings.
https://github.com/Santosh-Gupta/CapitalizationEmbeddings

Downloads last month: 19

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for Santosh-Gupta/capitalized-bert-base-uncased-mlm

Base model

google-bert/bert-base-uncased

Finetuned

(6754)

this model

Santosh-Gupta
/

capitalized-bert-base-uncased-mlm