This model is nothing but amazing
I spend five damn days to find a proper TTS for my local voice assistant. Piper was too robotic, XTTS hallucinated too much. Fish and Chatterbox were too slow for my GPU. I ruled out all non-German models like Spark/Kyutai/Kokoro because I'm not the only one in this home.
Just when I was about to give up and settle with Piper I found this model. The results are amazing. Thank you so much for the work spent!
I now built a server to properly integrate Spark/MiraTTS into Home Assistant and will now use that model every day with a big smile on my face. :)
I attached a small demo sample. Just random generated book text. The quality definitely exceeds any expectation I had.
Cheers!
Little, minor issue at ".txt" that can probably be fixed with a better LLM system prompt. Still an amazing result:
Which Reference Audio did you use? Just a random recording of yourself or are there best practices which should be taken into account?
Yeah, that sounds great. Do you have sometimes issue that the end of the text (like the last up to 10 letters) are cut off at the end of the audio? I have that sometimes. Do you use a specific TTS server for it? Maybe I used the wrong github repo for my project as base to get simple streaming working.
Beside that it works really well. I use it for low latency streaming by splitting the input text into chunks, with that the first audio chunk is generated after ~500ms at 80 chars. But only with a simple test client that sends a fixed text that I change sometimes, so no LLM input yet. I hope I can lower it even more, because with a LLM you definitely will have some extra latency, but I noticed Qwen 30B A3B Instruct is even in 2bit better than I thought and crazy fast.
Which Reference Audio did you use? Just a random recording of yourself or are there best practices which should be taken into account?
I used a 30s audio book recording of Klaus-Dieter Klebsch. Voice of German Dr. House.
Cleanly cut (not in the middle of a word), without background noise.
Afaik the sample language should be equal to the language of the text you are transcribing, but I found a German voice with English text works just as good.
Yeah, that sounds great. Do you have sometimes issue that the end of the text (like the last up to 10 letters) are cut off at the end of the audio? I have that sometimes. Do you use a specific TTS server for it? Maybe I used the wrong github repo for my project as base to get simple streaming working.
I wrote my own TTS server implementing the Wyoming protocol so it works directly with Home Assistant. You probably can't really make use for it unless you use Home Assistant. It does not expose a HTTP Server. Wyoming works over TCP and needs a compatible endpoint. And no, I must say I did not face any cut offs. Just rare mispronunciations. I fix those by changing the system prompt of the LLM. For example letting it output "Drei Uhr zweiundzwanging" instead of "3:22 Uhr". That solves a lot of issues for me.
Thanks for the response. I also think about writing my own server since the chatbot backend, my smarthome software and the tts is running on the same machine. So HTTP is not the best connection for that. Maybe it is a issue with the server I used for testing or the way it generates audio output.
Hm, I think it didn't sounds close enough to Klaus-Dieter Klebsch. His voice is so specific that I would have recognized it instantly since I loved Dr. House and I'm a fan of his voice since some years now, but not in this examples. He sounds a bit high and without the deep part in his voice. Did you changed his voice a bit or did he sound different in the audiobook?
I also ask because of the general interest of the quality of voice cloning in that model. For the one voice which I used it worked very well. Since this model itself is already a finetune of MiraTTS, but only with a larger dataset, a voice finetune should be also possible here, if there is a training software on Github. With that you can come way closer to the voice. You normally can also train more than only one voice, if you want to use different ones. But no clue how many data you need here, on xTTSv2 7 minutes of audio for one voice was already enough.
I wrote my own TTS server implementing the Wyoming protocol so it works directly with Home Assistant. You probably can't really make use for it unless you use Home Assistant. It does not expose a HTTP Server. Wyoming works over TCP and needs a compatible endpoint. And no, I must say I did not face any cut offs. Just rare mispronunciations. I fix those by changing the system prompt of the LLM. For example letting it output "Drei Uhr zweiundzwanging" instead of "3:22 Uhr". That solves a lot of issues for me.
I also would like to use this model for my homeassistant. I tried already with the MiraTTS from @maglat in combination with the wyoming_openai container from roryeckel.
But no luck so far. The container with MiraTTS and the model is running but the access trough the wyoming container is producing errors. I am not sure which settings to use.
What kind of TTS server did you wrote? Is it possible that you share this server?