s-nlp
/

russian_toxicity_classifier

Text Classification

toxic comments classification

Model card Files Files and versions

russian_toxicity_classifier / README.md

dardem's picture

Update README.md

0694e1f verified over 1 year ago

|

history blame contribute delete

2.18 kB

	---
	language:
	- ru
	tags:
	- toxic comments classification
	licenses:
	- cc-by-nc-sa
	license: openrail++
	base_model:
	- DeepPavlov/rubert-base-cased-conversational
	---

	Bert-based classifier (finetuned from [Conversational Rubert](https://huggingface.co/DeepPavlov/rubert-base-cased-conversational)) trained on merge of Russian Language Toxic Comments [dataset](https://www.kaggle.com/blackmoon/russian-language-toxic-comments/metadata) collected from 2ch.hk and Toxic Russian Comments [dataset](https://www.kaggle.com/alexandersemiletov/toxic-russian-comments) collected from ok.ru.

	The datasets were merged, shuffled, and split into train, dev, test splits in 80-10-10 proportion.
	The metrics obtained from test dataset is as follows

	\| \| precision \| recall \| f1-score \| support \|
	\|:------------:\|:---------:\|:------:\|:--------:\|:-------:\|
	\| 0 \| 0.98 \| 0.99 \| 0.98 \| 21384 \|
	\| 1 \| 0.94 \| 0.92 \| 0.93 \| 4886 \|
	\| accuracy \| \| \| 0.97 \| 26270\|
	\| macro avg \| 0.96 \| 0.96 \| 0.96 \| 26270 \|
	\| weighted avg \| 0.97 \| 0.97 \| 0.97 \| 26270 \|


	## How to use
	```python
	from transformers import BertTokenizer, BertForSequenceClassification

	# load tokenizer and model weights
	tokenizer = BertTokenizer.from_pretrained('s-nlp/russian_toxicity_classifier')
	model = BertForSequenceClassification.from_pretrained('s-nlp/russian_toxicity_classifier')

	# prepare the input
	batch = tokenizer.encode('ты супер', return_tensors='pt')

	# inference
	model(batch)
	```

	## Citation

	To acknowledge our work, please, use the corresponding citation:

	```
	@article{dementieva2022russe,
	title={RUSSE-2022: Findings of the First Russian Detoxification Shared Task Based on Parallel Corpora},
	author={Dementieva, Daryna and Logacheva, Varvara and Nikishina, Irina and Fenogenova, Alena and Dale, David and Krotova, Irina and Semenov, Nikita and Shavrina, Tatiana and Panchenko, Alexander}
	}
	```


	## Licensing Information

	This model is licensed under the OpenRAIL++ License, which supports the development of various technologies—both industrial and academic—that serve the public good.