Text Classification
Transformers
PyTorch
TensorFlow
Safetensors
Russian
bert
toxic comments classification
Instructions to use s-nlp/russian_toxicity_classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use s-nlp/russian_toxicity_classifier with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="s-nlp/russian_toxicity_classifier")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("s-nlp/russian_toxicity_classifier") model = AutoModelForSequenceClassification.from_pretrained("s-nlp/russian_toxicity_classifier") - Inference
- Notebooks
- Google Colab
- Kaggle
| language: | |
| - ru | |
| tags: | |
| - toxic comments classification | |
| licenses: | |
| - cc-by-nc-sa | |
| license: openrail++ | |
| base_model: | |
| - DeepPavlov/rubert-base-cased-conversational | |
| Bert-based classifier (finetuned from [Conversational Rubert](https://huggingface.co/DeepPavlov/rubert-base-cased-conversational)) trained on merge of Russian Language Toxic Comments [dataset](https://www.kaggle.com/blackmoon/russian-language-toxic-comments/metadata) collected from 2ch.hk and Toxic Russian Comments [dataset](https://www.kaggle.com/alexandersemiletov/toxic-russian-comments) collected from ok.ru. | |
| The datasets were merged, shuffled, and split into train, dev, test splits in 80-10-10 proportion. | |
| The metrics obtained from test dataset is as follows | |
| | | precision | recall | f1-score | support | | |
| |:------------:|:---------:|:------:|:--------:|:-------:| | |
| | 0 | 0.98 | 0.99 | 0.98 | 21384 | | |
| | 1 | 0.94 | 0.92 | 0.93 | 4886 | | |
| | accuracy | | | 0.97 | 26270| | |
| | macro avg | 0.96 | 0.96 | 0.96 | 26270 | | |
| | weighted avg | 0.97 | 0.97 | 0.97 | 26270 | | |
| ## How to use | |
| ```python | |
| from transformers import BertTokenizer, BertForSequenceClassification | |
| # load tokenizer and model weights | |
| tokenizer = BertTokenizer.from_pretrained('s-nlp/russian_toxicity_classifier') | |
| model = BertForSequenceClassification.from_pretrained('s-nlp/russian_toxicity_classifier') | |
| # prepare the input | |
| batch = tokenizer.encode('ты супер', return_tensors='pt') | |
| # inference | |
| model(batch) | |
| ``` | |
| ## Citation | |
| To acknowledge our work, please, use the corresponding citation: | |
| ``` | |
| @article{dementieva2022russe, | |
| title={RUSSE-2022: Findings of the First Russian Detoxification Shared Task Based on Parallel Corpora}, | |
| author={Dementieva, Daryna and Logacheva, Varvara and Nikishina, Irina and Fenogenova, Alena and Dale, David and Krotova, Irina and Semenov, Nikita and Shavrina, Tatiana and Panchenko, Alexander} | |
| } | |
| ``` | |
| ## Licensing Information | |
| This model is licensed under the OpenRAIL++ License, which supports the development of various technologies—both industrial and academic—that serve the public good. |