Feature Extraction
Transformers
PyTorch
Safetensors
Russian
English
roberta
text-embeddings-inference
Instructions to use deepvk/roberta-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use deepvk/roberta-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="deepvk/roberta-base")# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("deepvk/roberta-base") model = AutoModel.from_pretrained("deepvk/roberta-base") - Notebooks
- Google Colab
- Kaggle
metadata
license: apache-2.0
language:
- ru
- en
library_name: transformers
pipeline_tag: fill-mask
RoBERTa-base
Pretrained bidirectional encoder for russian language.
The model was trained using standard MLM objective on large text corpora including open social data.
See Training Details section for more information
- Developed by: deepvk
- Model type: RoBERTa
- Languages: Mostly russian and small fraction of other languages
- License: Apache 2.0
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("deepvk/roberta-base")
model = AutoModel.from_pretrained("deepvk/roberta-base")
text = "Привет, мир!"
inputs = tokenizer(text, return_tensors='pt')
predictions = model(**inputs)
Training Details
Training Data
500 GB of raw text in total. A mix of the following data: Wikipedia, Books, Twitter comments, Pikabu, Proza.ru, Film subtitles, News websites, and Social corpus.
Training Hyperparameters
| Argument | Value |
|---|---|
| Training regime | fp16 mixed precision |
| Training framework | Fairseq |
| Optimizer | Adam |
| Adam betas | 0.9,0.98 |
| Adam eps | 1e-6 |
| Num training steps | 500k |
The model was trained on a machine with 8xA100 for approximately 22 days.
Architecture details
| Argument | Value |
|---|---|
| Encoder layers | 12 |
| Encoder attention heads | 12 |
| Encoder embed dim | 768 |
| Encoder ffn embed dim | 3,072 |
| Activation function | GeLU |
| Attention dropout | 0.1 |
| Dropout | 0.1 |
| Max positions | 512 |
| Vocab size | 50266 |
| Tokenizer type | Byte-level BPE |
Evaluation
We evaluated the model on Russian Super Glue dev set. The best result in each task is marked in bold. All models have the same size except the distilled version of DeBERTa.
| Model | RCB | PARus | MuSeRC | TERRa | RUSSE | RWSD | DaNetQA | Score |
|---|---|---|---|---|---|---|---|---|
| vk-deberta-distill | 0.433 | 0.56 | 0.625 | 0.59 | 0.943 | 0.569 | 0.726 | 0.635 |
| vk-roberta-base | 0.46 | 0.56 | 0.679 | 0.769 | 0.960 | 0.569 | 0.658 | 0.665 |
| vk-deberta-base | 0.450 | 0.61 | 0.722 | 0.704 | 0.948 | 0.578 | 0.76 | 0.682 |
| vk-bert-base | 0.467 | 0.57 | 0.587 | 0.704 | 0.953 | 0.583 | 0.737 | 0.657 |
| sber-bert-base | 0.491 | 0.61 | 0.663 | 0.769 | 0.962 | 0.574 | 0.678 | 0.678 |