Instructions to use apxrv/frost-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use apxrv/frost-4b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="apxrv/frost-4b", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("apxrv/frost-4b", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
FroST β Qwen-based Conversational Speech Model
A Qwen-based conversational speech model (CSM). Given a message, it produces both a text reply and spoken audio in a single model β not plain TTS, but a model that responds and speaks its response (24 kHz).
Install
pip install "transformers>=4.50" torch peft soundfile torchaudio numpy
Usage (code ships with the weights β no extra package)
from transformers import AutoModel
m = AutoModel.from_pretrained("apxrv/frost-4b", trust_remote_code=True) # revision="v1" or "v0"
# chat: writes a reply AND speaks it
out = m.chat("hi, how are you?", system="You are Pho β warm and playful.")
print(out.text) # the text reply
out.save("reply.wav") # the spoken audio
# say: speak a given line
m.say("Read this line out loud.").save("line.wav")
out.audio is a float waveform at out.sample_rate (24 kHz).
Multi-turn conversations
Pass the dialogue so far as history (a list of {"role","content"} turns); the
model conditions on the whole conversation when it replies and speaks:
m.chat("and then what happened?", history=[
{"role": "user", "content": "tell me about your day"},
{"role": "assistant", "content": "It was lovely, thanks for asking."},
], system="You are Pho β warm and playful.")
The model is trained on multi-turn dialogue: alternating user/assistant turns, where each assistant turn carries the spoken audio for that line and the earlier turns serve as conversational context.
Versions
main/v2β latest; best instruction-following (KL-preserved text, base $pprox$ pristine Qwen3-4B)v1β earlier focused single-voice modelv0β earlier base model
Limitations
Single voice; English; synthetic training data. Reply text is produced by the language model β use a first-person persona system prompt for in-character replies.
- Downloads last month
- 2