Capitalized BERT Base Uncased MLM

This is the best checkpoint from the CapitalizationEmbeddings project.

The model starts from bert-base-uncased and adds a tiny learned capitalization embedding channel. The lexical token ID still follows the uncased BERT vocabulary, while a parallel capitalization_ids tensor carries case information:

0 = lowercase / no capitalization feature / punctuation / special token
1 = first-cap, e.g. Tom
2 = all-caps, e.g. NASA
3 = mixed-case, e.g. iPhone

This checkpoint is a masked-language-model checkpoint, not a downstream fine-tuned classifier.

Checkpoint

Best project checkpoint:

mixed_case_dropout/capitalized_from_3class_steps3000_lr2e5_drop01/final

Training recipe:

base model: bert-base-uncased
capitalization vocab size: 4
capitalization loss weight: 0.25
capitalization class weights: [1, 2, 8, 4]
capitalization embedding dropout: 0.1
continued pretraining: 3,000 steps on the real-acronym mix

Loading

This model uses a custom architecture, so it is not directly loadable with plain AutoModelForMaskedLM. Install the project package first:

pip install git+https://github.com/Santosh-Gupta/CapitalizationEmbeddings.git

Then load:

from transformers import AutoTokenizer

from capitalization_embeddings import (
    CapitalizedBertForMaskedLM,
    tokenize_with_capitalization,
)

repo_id = "Santosh-Gupta/capitalized-bert-base-uncased-mlm"

tokenizer = AutoTokenizer.from_pretrained(repo_id, use_fast=True)
model = CapitalizedBertForMaskedLM.from_pretrained(repo_id)

encoding = tokenize_with_capitalization(
    tokenizer,
    "Tom works at NASA.",
    return_tensors="pt",
    use_mixed_case=True,
)

outputs = model(**encoding)
print(outputs.logits.shape)
print(outputs.capitalization_logits.shape)

Intended Use

This checkpoint is mainly intended for research and continued experimentation:

  • continued MLM pretraining;
  • probing capitalization-aware BERT representations;
  • initializing token/entity classification fine-tuning experiments;
  • reproducing the blog-post results in the repository.

It is not intended as a general replacement for bert-base-cased.

Results Summary

The project found that capitalization embeddings are most useful on token and entity-heavy tasks where case behaves like a reusable feature. Sequence-task results were mixed.

Headline examples from the repository:

Benchmark Metric Uncased Cased Capitalized
CoNLL-2003 NER entity F1 0.9040 +/- 0.0025 0.9119 +/- 0.0035 0.9165 +/- 0.0018
WNUT-17 NER entity F1 0.4424 +/- 0.0152 0.4426 +/- 0.0100 0.4495 +/- 0.0103
SST-5 accuracy 0.5410 +/- 0.0039 0.5283 +/- 0.0084 0.5407 +/- 0.0068

See the repository README for the full writeup and limitations.

Limitations

  • This is a custom BERT architecture requiring the project package.
  • This is an MLM checkpoint, not an instruction model or text generator.
  • The method did not universally dominate cased or uncased BERT.
  • Some benchmark families favored cased BERT or uncased BERT for reasons that a small case embedding did not fully recover.

Citation

If you use this checkpoint, cite the GitHub repository:

Santosh Gupta. CapitalizationEmbeddings.
https://github.com/Santosh-Gupta/CapitalizationEmbeddings
Downloads last month
19
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Santosh-Gupta/capitalized-bert-base-uncased-mlm

Finetuned
(6754)
this model

Dataset used to train Santosh-Gupta/capitalized-bert-base-uncased-mlm