YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
TASTE2-8B-ZH-SFT
A fine-tuned Spoken Language Model (SLM) for Taiwanese Mandarin dialogue.
Given a spoken question (WAV file), the model generates a spoken answer (MP3 file) — entirely in the audio domain.
Model Description
This model is based on TASTE2-8B, a spoken language model that processes audio input and produces audio output using:
- ASR frontend: Transcribes the input audio (Breeze-ASR-25)
- SLM (Qwen2-7B backbone): Generates a text+prosody response conditioned on the input speech
- Vocoder (Flow + HiFT): Synthesizes the final audio from generated tokens
The SLM was fine-tuned with LoRA (r=8, alpha=16) on a Taiwanese Mandarin conversational dataset and merged into the full model weights at epoch 20.
Modification: No-Punctuation Inference (Method B)
Standard inference produces punctuation tokens (, 。 ?) which cause unnatural pauses in the generated audio.
This model applies post-generation token stripping before vocoder synthesis:
- Pure punctuation tokens → dropped (token + prosody embedding removed)
- Mixed BPE tokens (e.g.
,在→在) → punctuation stripped, token replaced, prosody embedding preserved
This produces more natural, fluent speech output without comma/period pauses.
Quick Start
1. Clone the repository
git lfs install
git clone https://huggingface.co/Jaylin0418/TASTE2-8B-ZH-SFT
cd TASTE2-8B-ZH-SFT
2. Install dependencies
pip install torch torchaudio transformers huggingface_hub modelscope \
hyperpyyaml omegaconf einx wetext
3. Run inference
python generate_audio.py \
--model_dir . \
--output_dir ./output \
--test_files your_audio.wav \
--asr_model_dir MediaTek-Research/Breeze-ASR-25 \
--stage sft
Output files will be saved to ./output/:
*_stage_sft_reconstructed.mp3— generated spoken response*_results.json— ASR transcription + generated text
Multiple input files
python generate_audio.py \
--model_dir . \
--output_dir ./output \
--test_files audio1.wav audio2.wav audio3.wav \
--asr_model_dir MediaTek-Research/Breeze-ASR-25 \
--stage sft
Example Output
| Input (ASR) | Generated Response |
|---|---|
| 中秋節快到了我想帶家人去參加一些活動請問這附近有哪些燈會和烤肉活動呢 | 我了解你想和家人一起享受月亮和天氣你可以先上網查一下附近的燈會和烤肉活動找到幾個選擇後再跟我說 |
Repository Structure
TASTE2-8B-ZH-SFT/
├── generate_audio.py # Inference script (run this)
├── taste_speech/ # Model inference package
├── taste2_stagesft.yaml # Model config
├── slm.pt # Fine-tuned SLM (ep20, ~18GB)
├── llm.pt # Stage 1 LM weights
├── flow.pt # Flow matching vocoder
├── hift.pt # HiFT vocoder
├── campplus.onnx # Speaker encoder
├── speech_tokenizer_v2.onnx # Speech tokenizer
├── CosyVoice-BlankEN/ # Text tokenizer
├── distil-whisper/ # Audio feature extractor
└── qwen2-7b/ # SLM backbone weights
Requirements
- Python 3.10+
- CUDA GPU (the model is ~22GB in memory)
- ~50GB disk space
Input Audio Requirements
- Format: WAV (any sample rate, will be resampled to 16kHz internally)
- Language: Taiwanese Mandarin (繁體中文)
- Content: Spoken questions or statements