π Multilingual Quality Classifier
A high-throughput, production-grade text quality classification system for large-scale corpora across multiple languages.
This repository provides:
- β Language-specific quality classifiers (one model per language)
- β Streaming JSONL inference (handles multi-TB corpora without RAM blowup)
- β Single-GPU and Multi-GPU (DDP) support
- β Corruption-safe pipeline (skips bad JSON, logs errors, never crashes)
- β Per-class sharded outputs
- β Automatic logging of progress and failures
π§ What is this?
This system classifies text into 5 quality buckets 0, 1, 2, 3, 4
Each language has its own model, and a shared label mapping.
π Repository Structure
Quality-Classifier/
βββ models/
β βββ en/
β βββ bn/
β βββ hi/
β βββ ...
β βββ label_to_id.json
βββ qc_infer.py
models/<language>/ β HuggingFace-style model directory
models/label_to_id.json β Shared class mapping
qc_infer.py β Production inference script
π Input Format
Input can be: β A single .jsonl file, or β A directory containing many .jsonl files (recursively)
Each line must be a JSON object containing at least one of the text keys:
{"text": "This is a sample sentence"}
{"content": "Another example"}
{"body": "Yet another example"}
You can control which keys are checked using:
--text_key text content body
How to Run
For Multi-GPU DDP training
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 qc_infer.py \
--input_path /data/jsonl_inputs \
--output_path /data/qc_outputs \
--language en \
--text_key text content generated_text
For Single-GPU DDP training
CUDA_VISIBLE_DEVICES=2 torchrun --nproc_per_node=1 qc_infer.py \
--input_path /data/jsonl_inputs \
--output_path /data/qc_outputs \
--language en \
--text_key text content generated_text
π€ Output Format
The script writes class-wise sharded JSONL files:
output_dir/
βββ 0.rank*.jsonl
βββ 1.rank*.jsonl
βββ 2.rank*.jsonl
βββ 3.rank*.jsonl
βββ 4.rank*.jsonl
βββ qc_infer.log
π§Ύ Logging
A full log is written to:
output_dir/qc_infer.log
It contains:
- Number of files discovered
- Number of lines indexed
- Corrupted JSON lines
- Missing text key errors
- Batch failures (with stack traces)
- Progress info
The script:
- β Skips corrupted JSON
- β Skips invalid samples
- β Never crashes due to bad data