You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

πŸ“Š Multilingual Quality Classifier

A high-throughput, production-grade text quality classification system for large-scale corpora across multiple languages.

This repository provides:

  • βœ… Language-specific quality classifiers (one model per language)
  • βœ… Streaming JSONL inference (handles multi-TB corpora without RAM blowup)
  • βœ… Single-GPU and Multi-GPU (DDP) support
  • βœ… Corruption-safe pipeline (skips bad JSON, logs errors, never crashes)
  • βœ… Per-class sharded outputs
  • βœ… Automatic logging of progress and failures

🧠 What is this?

This system classifies text into 5 quality buckets 0, 1, 2, 3, 4

Each language has its own model, and a shared label mapping.

πŸ“ Repository Structure

Quality-Classifier/
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ en/
β”‚   β”œβ”€β”€ bn/
β”‚   β”œβ”€β”€ hi/
β”‚   β”œβ”€β”€ ...
β”‚   └── label_to_id.json
└── qc_infer.py

models/<language>/ β†’ HuggingFace-style model directory

models/label_to_id.json β†’ Shared class mapping

qc_infer.py β†’ Production inference script

πŸ“‚ Input Format

Input can be: βœ… A single .jsonl file, or βœ… A directory containing many .jsonl files (recursively)

Each line must be a JSON object containing at least one of the text keys:

{"text": "This is a sample sentence"}
{"content": "Another example"}
{"body": "Yet another example"}

You can control which keys are checked using:

--text_key text content body

How to Run

For Multi-GPU DDP training

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 qc_infer.py \
  --input_path /data/jsonl_inputs \
  --output_path /data/qc_outputs \
  --language en \
  --text_key text content generated_text

For Single-GPU DDP training

CUDA_VISIBLE_DEVICES=2 torchrun --nproc_per_node=1 qc_infer.py \
  --input_path /data/jsonl_inputs \
  --output_path /data/qc_outputs \
  --language en \
  --text_key text content generated_text

πŸ“€ Output Format

The script writes class-wise sharded JSONL files:

output_dir/
 β”œβ”€β”€ 0.rank*.jsonl
 β”œβ”€β”€ 1.rank*.jsonl
 β”œβ”€β”€ 2.rank*.jsonl
 β”œβ”€β”€ 3.rank*.jsonl
 β”œβ”€β”€ 4.rank*.jsonl
 └── qc_infer.log

🧾 Logging

A full log is written to:

output_dir/qc_infer.log

It contains:

  • Number of files discovered
  • Number of lines indexed
  • Corrupted JSON lines
  • Missing text key errors
  • Batch failures (with stack traces)
  • Progress info

The script:

  • βœ… Skips corrupted JSON
  • βœ… Skips invalid samples
  • βœ… Never crashes due to bad data
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support