TASTE2-8B-ZH-SFT

A fine-tuned Spoken Language Model (SLM) for Taiwanese Mandarin dialogue.
Given a spoken question (WAV file), the model generates a spoken answer (MP3 file) — entirely in the audio domain.

Model Description

This model is based on TASTE2-8B, a spoken language model that processes audio input and produces audio output using:

ASR frontend: Transcribes the input audio (Breeze-ASR-25)
SLM (Qwen2-7B backbone): Generates a text+prosody response conditioned on the input speech
Vocoder (Flow + HiFT): Synthesizes the final audio from generated tokens

The SLM was fine-tuned with LoRA (r=8, alpha=16) on a Taiwanese Mandarin conversational dataset and merged into the full model weights at epoch 20.

Modification: No-Punctuation Inference (Method B)

Standard inference produces punctuation tokens (, 。 ？) which cause unnatural pauses in the generated audio.
This model applies post-generation token stripping before vocoder synthesis:

Pure punctuation tokens → dropped (token + prosody embedding removed)
Mixed BPE tokens (e.g. ，在 → 在) → punctuation stripped, token replaced, prosody embedding preserved

This produces more natural, fluent speech output without comma/period pauses.

Quick Start

1. Clone the repository

git lfs install
git clone https://huggingface.co/Jaylin0418/TASTE2-8B-ZH-SFT
cd TASTE2-8B-ZH-SFT

2. Install dependencies

pip install torch torchaudio transformers huggingface_hub modelscope \
            hyperpyyaml omegaconf einx wetext

3. Run inference

python generate_audio.py \
    --model_dir . \
    --output_dir ./output \
    --test_files your_audio.wav \
    --asr_model_dir MediaTek-Research/Breeze-ASR-25 \
    --stage sft

Output files will be saved to ./output/:

*_stage_sft_reconstructed.mp3 — generated spoken response
*_results.json — ASR transcription + generated text

Multiple input files

python generate_audio.py \
    --model_dir . \
    --output_dir ./output \
    --test_files audio1.wav audio2.wav audio3.wav \
    --asr_model_dir MediaTek-Research/Breeze-ASR-25 \
    --stage sft

Example Output

Input (ASR)	Generated Response
中秋節快到了我想帶家人去參加一些活動請問這附近有哪些燈會和烤肉活動呢	我了解你想和家人一起享受月亮和天氣你可以先上網查一下附近的燈會和烤肉活動找到幾個選擇後再跟我說

Repository Structure

TASTE2-8B-ZH-SFT/
├── generate_audio.py          # Inference script (run this)
├── taste_speech/              # Model inference package
├── taste2_stagesft.yaml       # Model config
├── slm.pt                     # Fine-tuned SLM (ep20, ~18GB)
├── llm.pt                     # Stage 1 LM weights
├── flow.pt                    # Flow matching vocoder
├── hift.pt                    # HiFT vocoder
├── campplus.onnx              # Speaker encoder
├── speech_tokenizer_v2.onnx   # Speech tokenizer
├── CosyVoice-BlankEN/         # Text tokenizer
├── distil-whisper/            # Audio feature extractor
└── qwen2-7b/                  # SLM backbone weights

Requirements

Python 3.10+
CUDA GPU (the model is ~22GB in memory)
~50GB disk space

Input Audio Requirements

Format: WAV (any sample rate, will be resampled to 16kHz internally)
Language: Taiwanese Mandarin (繁體中文)
Content: Spoken questions or statements

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support