You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

FroST — Qwen-based Conversational Speech Model

A Qwen-based conversational speech model (CSM). Given a message, it produces both a text reply and spoken audio in a single model — not plain TTS, but a model that responds and speaks its response (24 kHz).

Install

pip install "transformers>=4.50" torch peft soundfile torchaudio numpy

Usage (code ships with the weights — no extra package)

from transformers import AutoModel
m = AutoModel.from_pretrained("apxrv/frost-4b", trust_remote_code=True)   # revision="v1" or "v0"

# chat: writes a reply AND speaks it
out = m.chat("hi, how are you?", system="You are Pho — warm and playful.")
print(out.text)          # the text reply
out.save("reply.wav")    # the spoken audio

# say: speak a given line
m.say("Read this line out loud.").save("line.wav")

out.audio is a float waveform at out.sample_rate (24 kHz).

Multi-turn conversations

Pass the dialogue so far as history (a list of {"role","content"} turns); the model conditions on the whole conversation when it replies and speaks:

m.chat("and then what happened?", history=[
    {"role": "user", "content": "tell me about your day"},
    {"role": "assistant", "content": "It was lovely, thanks for asking."},
], system="You are Pho — warm and playful.")

The model is trained on multi-turn dialogue: alternating user/assistant turns, where each assistant turn carries the spoken audio for that line and the earlier turns serve as conversational context.

Versions

main / v2 — latest; best instruction-following (KL-preserved text, base $pprox$ pristine Qwen3-4B)
v1 — earlier focused single-voice model
v0 — earlier base model

Limitations

Single voice; English; synthetic training data. Reply text is produced by the language model — use a first-person persona system prompt for in-character replies.

Downloads last month: 2