Instructions to use Santosh-Gupta/capitalized-bert-base-uncased-mlm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Santosh-Gupta/capitalized-bert-base-uncased-mlm with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Santosh-Gupta/capitalized-bert-base-uncased-mlm")# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("Santosh-Gupta/capitalized-bert-base-uncased-mlm", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Capitalized BERT Base Uncased MLM
This is the best checkpoint from the CapitalizationEmbeddings project.
The model starts from bert-base-uncased and adds a tiny learned
capitalization embedding channel. The lexical token ID still follows the
uncased BERT vocabulary, while a parallel capitalization_ids tensor carries
case information:
0 = lowercase / no capitalization feature / punctuation / special token
1 = first-cap, e.g. Tom
2 = all-caps, e.g. NASA
3 = mixed-case, e.g. iPhone
This checkpoint is a masked-language-model checkpoint, not a downstream fine-tuned classifier.
Checkpoint
Best project checkpoint:
mixed_case_dropout/capitalized_from_3class_steps3000_lr2e5_drop01/final
Training recipe:
base model: bert-base-uncased
capitalization vocab size: 4
capitalization loss weight: 0.25
capitalization class weights: [1, 2, 8, 4]
capitalization embedding dropout: 0.1
continued pretraining: 3,000 steps on the real-acronym mix
Loading
This model uses a custom architecture, so it is not directly loadable with plain
AutoModelForMaskedLM. Install the project package first:
pip install git+https://github.com/Santosh-Gupta/CapitalizationEmbeddings.git
Then load:
from transformers import AutoTokenizer
from capitalization_embeddings import (
CapitalizedBertForMaskedLM,
tokenize_with_capitalization,
)
repo_id = "Santosh-Gupta/capitalized-bert-base-uncased-mlm"
tokenizer = AutoTokenizer.from_pretrained(repo_id, use_fast=True)
model = CapitalizedBertForMaskedLM.from_pretrained(repo_id)
encoding = tokenize_with_capitalization(
tokenizer,
"Tom works at NASA.",
return_tensors="pt",
use_mixed_case=True,
)
outputs = model(**encoding)
print(outputs.logits.shape)
print(outputs.capitalization_logits.shape)
Intended Use
This checkpoint is mainly intended for research and continued experimentation:
- continued MLM pretraining;
- probing capitalization-aware BERT representations;
- initializing token/entity classification fine-tuning experiments;
- reproducing the blog-post results in the repository.
It is not intended as a general replacement for bert-base-cased.
Results Summary
The project found that capitalization embeddings are most useful on token and entity-heavy tasks where case behaves like a reusable feature. Sequence-task results were mixed.
Headline examples from the repository:
| Benchmark | Metric | Uncased | Cased | Capitalized |
|---|---|---|---|---|
| CoNLL-2003 NER | entity F1 | 0.9040 +/- 0.0025 | 0.9119 +/- 0.0035 | 0.9165 +/- 0.0018 |
| WNUT-17 NER | entity F1 | 0.4424 +/- 0.0152 | 0.4426 +/- 0.0100 | 0.4495 +/- 0.0103 |
| SST-5 | accuracy | 0.5410 +/- 0.0039 | 0.5283 +/- 0.0084 | 0.5407 +/- 0.0068 |
See the repository README for the full writeup and limitations.
Limitations
- This is a custom BERT architecture requiring the project package.
- This is an MLM checkpoint, not an instruction model or text generator.
- The method did not universally dominate cased or uncased BERT.
- Some benchmark families favored cased BERT or uncased BERT for reasons that a small case embedding did not fully recover.
Citation
If you use this checkpoint, cite the GitHub repository:
Santosh Gupta. CapitalizationEmbeddings.
https://github.com/Santosh-Gupta/CapitalizationEmbeddings
- Downloads last month
- 19
Model tree for Santosh-Gupta/capitalized-bert-base-uncased-mlm
Base model
google-bert/bert-base-uncased