Text Classification
Transformers
Safetensors
Chinese
bert
agent
nlp
chinese
sentiment-analysis
emotion
regression
vad
valence-arousal-dominance
macbert
text-embeddings-inference
Instructions to use Pectics/vad-macbert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Pectics/vad-macbert with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Pectics/vad-macbert")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("Pectics/vad-macbert") model = AutoModelForSequenceClassification.from_pretrained("Pectics/vad-macbert") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| datasets: | |
| - Helsinki-NLP/open_subtitles | |
| language: | |
| - zh | |
| base_model: | |
| - hfl/chinese-macbert-base | |
| pipeline_tag: text-classification | |
| tags: | |
| - agent | |
| - nlp | |
| - chinese | |
| - sentiment-analysis | |
| - emotion | |
| - regression | |
| - vad | |
| - valence-arousal-dominance | |
| - transformers | |
| - bert | |
| - macbert | |
| <div align="center"> | |
| <h1>vad-macbert</h1> | |
| <p>Chinese VAD (valence/arousal/dominance) regression on top of chinese-macbert-base.</p> | |
| <p> | |
| <a href="https://huggingface.co/Pectics/vad-macbert"> | |
| <img alt="HF Model" src="https://img.shields.io/badge/Hugging%20Face-Model-yellow"> | |
| </a> | |
| <img alt="Task" src="https://img.shields.io/badge/task-VAD%20regression-1f6feb"> | |
| <img alt="Backbone" src="https://img.shields.io/badge/backbone-chinese--macbert--base-4b8bbe"> | |
| </p> | |
| </div> | |
| The model predicts 3 continuous values aligned to the VAD scale produced by | |
| `RobroKools/vad-bert` (teacher model). | |
| ## Quickstart | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification | |
| import torch | |
| model_path = "Pectics/vad-macbert" | |
| tokenizer = AutoTokenizer.from_pretrained(model_path) | |
| model = AutoModelForSequenceClassification.from_pretrained(model_path) | |
| model.eval() | |
| text = "这部电影让我很感动。" | |
| inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128) | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| vad = outputs.logits.squeeze().tolist() | |
| print("VAD:", vad) | |
| ``` | |
| ## Model Details | |
| - Base model: `hfl/chinese-macbert-base` | |
| - Task: VAD regression (3 outputs: valence, arousal, dominance) | |
| - Head: `AutoModelForSequenceClassification` with `num_labels=3`, `problem_type=regression` | |
| ## Data Sources & Labeling | |
| ### en-zh_cn_vad_clean.csv | |
| - Source: OpenSubtitles EN-ZH parallel corpus. | |
| - Labeling: English side fed into `RobroKools/vad-bert` to obtain VAD values, | |
| then assigned to the paired Chinese text. | |
| ### en-zh_cn_vad_long.csv | |
| - Derived from `en-zh_cn_vad_clean.csv` by filtering for longer texts using a | |
| length threshold (original threshold was not recorded). | |
| - Inferred from statistics: minimum length is 32 characters, so the filter | |
| likely kept samples with length >= 32 chars. | |
| ### en-zh_cn_vad_long_clean.csv | |
| - Cleaned from `en-zh_cn_vad_long.csv` by removing subtitle formatting noise: | |
| - ASS/SSA tag blocks like `{\\fs..\\pos(..)}` (including broken `{` blocks) | |
| - HTML-like tags (e.g. `<i>...</i>`) | |
| - Escape codes like `\\N`, `\\n`, `\\h`, `\\t` | |
| - Extra whitespace normalization | |
| - Non-CJK rows were dropped. | |
| ### en-zh_cn_vad_mix.csv | |
| - Mixed dataset created for replay training: | |
| - 200k samples from `en-zh_cn_vad_clean.csv` | |
| - 200k samples from `en-zh_cn_vad_long_clean.csv` | |
| - Shuffled after sampling | |
| ## Training Summary | |
| The final model (`vad-macbert-mix/best`) was obtained in three stages: | |
| 1. **Base training** on `en-zh_cn_vad_clean.csv` | |
| 2. **Long-text adaptation** on `en-zh_cn_vad_long_clean.csv` | |
| 3. **Replay mix** on `en-zh_cn_vad_mix.csv` (resume from stage 2) | |
| ### Final-stage Command (Replay Mix) | |
| ``` | |
| --model_name hfl/chinese-macbert-base | |
| --output_dir train/vad-macbert-mix | |
| --data_path train/en-zh_cn_vad_mix.csv | |
| --epochs 4 | |
| --batch_size 32 | |
| --grad_accum_steps 4 | |
| --learning_rate 0.00001 | |
| --weight_decay 0.01 | |
| --warmup_ratio 0.1 | |
| --warmup_steps 0 | |
| --max_length 512 | |
| --eval_ratio 0.01 | |
| --eval_every 100 | |
| --eval_batches 200 | |
| --loss huber | |
| --huber_delta 1.0 | |
| --shuffle_buffer 4096 | |
| --min_chars 2 | |
| --save_every 100 | |
| --log_every 1 | |
| --max_steps 5000 | |
| --seed 42 | |
| --dtype fp16 | |
| --num_rows 400000 | |
| --resume_from train/vad-macbert-long/best | |
| --encoding utf-8 | |
| ``` | |
| Training environment (conda `llm`): | |
| - Python 3.10.19 | |
| - torch 2.9.1+cu130 | |
| - transformers 4.57.6 | |
| ## Evaluation | |
| Benchmark script: `train/vad_benchmark.py` | |
| - Evaluation uses a fixed stride derived from `eval_ratio=0.01` | |
| (roughly 1 out of 100 samples). | |
| - Length buckets by character count: 0–20, 20–40, 40–80, 80–120, 120–200, | |
| 200–400, 400+ | |
| ### Results (vad-macbert-mix/best) | |
| **en-zh_cn_vad_clean.csv** | |
| - mse_mean=0.043734 | |
| - mae_mean=0.149322 | |
| - pearson_mean=0.7335 | |
| **en-zh_cn_vad_long_clean.csv** | |
| - mse_mean=0.031895 | |
| - mae_mean=0.131320 | |
| - pearson_mean=0.7565 | |
| Notes: | |
| - `400+` bucket Pearson is unstable due to small sample size; interpret with care. | |
| ## Limitations | |
| - Labels are derived from an English VAD teacher and transferred via parallel | |
| alignment, so they reflect the teacher’s bias and may not match human Chinese | |
| annotations. | |
| - Subtitle corpora include translation artifacts and formatting noise; cleaned | |
| versions mitigate but do not fully remove this. | |
| - Extreme-length sentences are under-represented; performance on 400+ chars | |
| is not reliable. | |
| ## Files in This Repo | |
| - `config.json` | |
| - `model.safetensors` | |
| - `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, `vocab.txt` | |
| - `training_args.json` | |