| --- |
| language: |
| - ar |
| - fr |
| license: mit |
| pipeline_tag: text-classification |
| tags: |
| - misinformation-detection |
| - fake-news |
| - text-classification |
| - algerian-darija |
| - arabic |
| - mbert |
| model_name: mBERT-Algerian-Darija |
| base_model: bert-base-multilingual-cased |
| --- |
| |
| # mBERT — Algerian Darija Misinformation Detection |
|
|
| Fine-tuned **BERT-base-multilingual-cased** for detecting misinformation in **Algerian Darija** text. |
|
|
| - **Base model**: `bert-base-multilingual-cased` (170M parameters) |
| - **Task**: Multi-class text classification (5 classes) |
| - **Classes**: F (Factual), R (Reporting), N (Non-factual), M (Misleading), S (Satire) |
|
|
| --- |
|
|
| ## Performance (Test set: 3,344 samples) |
|
|
| - **Accuracy**: 75.42% |
| - **Macro F1**: 64.48% |
| - **Weighted F1**: 75.70% |
|
|
| **Per-class F1**: |
| - Factual (F): 83.72% |
| - Reporting (R): 76.35% |
| - Non-factual (N): 81.01% |
| - Misleading (M): 61.46% |
| - Satire (S): 19.86% |
|
|
|
|
| --- |
|
|
| ## Training Summary |
|
|
| - **Max sequence length**: 128 |
| - **Epochs**: 3 (early stopping) |
| - **Batch size**: 16 |
| - **Learning rate**: 2e-5 |
| - **Loss**: Weighted CrossEntropy |
| - **Seed**: 42 (reproducibility) |
|
|
| --- |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| |
| MODEL_ID = "Rahilgh/model4_1" |
| tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) |
| model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID) |
| |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| model.to(device).eval() |
| |
| LABEL_MAP = {0: "F", 1: "R", 2: "N", 3: "M", 4: "S"} |
| LABEL_NAMES = { |
| "F": "Factual", |
| "R": "Reporting", |
| "N": "Non-factual", |
| "M": "Misleading", |
| "S": "Satire" |
| } |
| |
| texts = [ |
| "قالك بلي رايحين ينحو الباك هذا العام", |
| |
| ] |
| |
| for text in texts: |
| inputs = tokenizer( |
| text, |
| return_tensors="pt", |
| max_length=128, |
| truncation=True, |
| padding=True, |
| ).to(device) |
| |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| probs = torch.softmax(outputs.logits, dim=1)[0] |
| pred_id = probs.argmax().item() |
| confidence = probs[pred_id].item() |
| |
| label = LABEL_MAP[pred_id] |
| print(f"Text: {text}") |
| print(f"Prediction: {LABEL_NAMES[label]} ({label}) — {confidence:.2%}\n") |