Scam detection — NLP (ML project)

Python project layout for training and serving a scam / phishing / coercion text classifier (multilingual can be added later via model/dataset choice).

Layout

scam-nlp-ml/
├── data/
│   ├── raw/          # Original CSVs, dumps, exports (gitignored contents — keep samples elsewhere)
│   └── processed/    # Train/val splits, tokenized cache
├── models/           # Checkpoints, exported ONNX/Torch artifacts
├── src/              # Training, evaluation, data pipeline code
├── api/              # Optional FastAPI inference service
├── notebooks/        # EDA and experiments
├── requirements.txt
├── .env.example
└── README.md

Quick start

Create a virtual environment (Python 3.10+ recommended):
```
python -m venv .venv
```
Activate:
- Windows: .venv\Scripts\activate
- macOS/Linux: source .venv/bin/activate

Install dependencies:

pip install -U pip
pip install -r requirements.txt

Environment:

copy .env.example .env
# Edit .env with your paths and hyperparameters

Place raw datasets under data/raw/, then implement preprocessing in src/ (add modules as you build).

Notes

Do not commit secrets or large raw datasets; use .env and optional .gitignore rules for data/raw/* and models/* if needed.
For India-focused scams (e.g. digital-arrest SMS), ensure your labels and evaluation reflect those patterns; consider a multilingual encoder (e.g. xlm-roberta-base) when you expand languages.

Next steps (implementation)

src/data.py — load, clean, split
src/train.py — fine-tune transformers
src/eval.py — metrics (precision/recall on scam class)
api/main.py — POST /predict with text body

This repository scaffold only creates the folders and baseline config; add those modules as you iterate.

Downloads last month: 9

Safetensors

Model size

0.2B params

Tensor type

F32

Evaluation results

accuracy
self-reported

0.976
f1
self-reported

0.957