Hacker News

amrrs•10m

A CC-By Open-Source TTS Model with Voice Cloning huggingface.co

31 comments

sandreas•10m
BTW I was really impressed by the results of F5-TTS. The thing I liked best was the "Tagged" TTS, where you can specify a tag to use different tones of your own voice, like
```
  {Angry}What have you done?
  {Suprised}Me, I did nothing?
  {Shouting}Who else do you think I'm talking to?
  {Sad}Why are you always shouting at me?
```
I wonder if this would also work for "Character" tags, like
```
  {Susan}How was your day?
  {Peter}I had a great day.
```
That would open great new ways of having audio books read by cloned voices - switching between characters with the same voice like often done by the real narrators
- throwaway89201•10m
  This feature also greatly interests me, although I'm looking for a system that would allow to slightly alter the pronunciation of individual words. Is anyone aware of such a system?
  Especially with TTS in a language other than English (but also with English), the pronunciation of certain words is sometimes jarringly wrong. Until TTS systems can compensate for this themselves, it would be great if it were possible for humans to use such tags to hint the system to pronounce better. Even if you can't specify the exact correction, but the TTS would just generate a 'different' sound, that could help.
  - willwade•10m
    Are you not looking for ssml with ipa tag? I think you might be. It’s part of all your standard OS tts - including espeak-ng on Linux. Also in Google cloud, azure, Watson, and Amazon Polly voices.
    - sandreas•10m
      I didn't know it existed... Thank you very much
  - sandreas•10m
    Features like artificial breathing, slightly different pronounciation and other "features" are only available in commercial systems... unfortunately I don't remember the name or the video I saw about these, because I'm not interested in non FOSS stuff for my personal projects.
- thorsten-voice•10m
  IMHO this should work (in english or chinese). Here i show how it sounds with different tags (in this case emotions and not characters): https://youtu.be/ASFoTNpkM8o?t=27
  Here's how it's done: https://youtu.be/ASFoTNpkM8o?t=992
  - sandreas•10m
    Hey, thorsten-voice himself. Thank you for your contribution to the community. I'm a happy follower of your content.
    Can't wait F5-tts to support the german language. Do you know wether this is planned in the near future?
    - thorsten-voice•10m
      You're very welcome. On f5 github repo is an active discussion (i'm involved too) on supporting other languages including german: https://github.com/SWivid/F5-TTS/issues/87#issuecomment-2418...
dmezzetti•10m
Good quality and easy-to-use open TTS models are hard to find. SpeechT5 while a bit old was relatively easy to clone voices with using the Transformers library.
I've also found a couple of the ESPNet TTS models are decent. I've exported those models to ONNX to make them easier to use.
For what it's worth, here is a list of models that cover what I've worked on in the "Open models" TTS space.
https://huggingface.co/collections/NeuML/text-to-speech-tts-...
asaddhamani•10m
From a quick try results aren't good. Sounds bland, and the text I type isn't exactly equal to the text that is spoken. Didn't try with voice cloning though.
Why is good TTS so expensive and why are there no good open source options? Is it just from the need for high quality training data? I don't imagine these models are more expensive to run compared to SOTA LLMs, yet they cost so much more.
- miki123211•10m
  From what I'm seeing, most of the open source TTS models are trained on the same few voices, mostly in 16Khz, mostly from Librivox books I think.
  Eleven Labs is most likely trained on stolen audiobooks, they've published a few Youtube videos in Polish, now taken down, of AI renditions of famous Polish audiobook narrators. This was all before they became popular, and before their voice cloning models were publicly available I think.
  - generalizations•10m
    > mostly from Librivox books
    That probably explains a lot. I've tried listening to some of those audiobooks - very hit and miss, mostly miss. Definitely amateur hour and mostly bad quality.
- sandreas•10m
  I had pretty good results with coqui-tts and a VITS model, I trained myself with an open dataset and later with one I extracted from audiobooks / epub and therefore can't publish (german)
  The dataset and video tutorials are all available and linked on (also english):
  https://www.thorsten-voice.de/en/motivation-vision/
  - thorsten-voice•10m
    Thanks for mentioning my Thorsten-Voice project, dear sandreas :)
    - sandreas•10m
      You're very welcome.
- em-bee•10m
  a few weeks ago i used piper to create an acceptable translation of a book. i didn't listen to it all, but the result sounded better than anything i was able to listen to before. good enough to listen to a book if a human read one is not available. just a few years ago, this was not the case.
  in other words, while FOSS TTS lags behind commercial options, it does get better and i expect within a few years it will produce results that are at least as good as the commercial options today if not fully caught up.
  - asaddhamani•10m
    Piper seems roughly equivalent to old-school TTS outputs that sound flat, jumpy with the concatenative approach. Listen to this first example I tried:
    https://rhasspy.github.io/piper-samples/samples/en/en_GB/ala...
    Of all the TTS APIs I have tried, I like OpenAI voices the best. Haven't considered things like elevenlabs because I find them ridiculously expensive.
    I love voice to voice interfaces, but only when they sound natural to my ears, and the current pricing for good ones is prohibitive for a huge number of use cases.
    - em-bee•10m
      well, i was comparing it to the free tools available a few years ago, and against that, this example is a markable improvement. it's the first that i could actually bear to listen to over a longer period of time. i expect just another few years and this will actually be good.
- modeless•10m
  There are a lot of options. StyleTTS2 is pretty good, XTTSv2 is pretty good, the new E2 TTS and F5 TTS also seem decent.
- amrrs•10m
  Commercially available high quality training dataset is the key. Open search libraries don't get the luxury of working with voice actors to record voices.
  - Aeolun•10m
    Would it be hard to create such a training dataset? Seems like you’d just need a lot of people to say a bunch of stuff for you?
    - wahnfrieden•10m
      needs a crowdsourced model
      - huggingmouth•10m
        Ideally, Mozilla would step up here given their mission statement, but they won't, probably because their CEO needs another bonus.
        IshKebab•10m
        Yeah there's no chance Mozilla would do anything like this:
        https://commonvoice.mozilla.org/
        mgkimsal•10m
        That's the first thing I thought of! I wonder how used these are. Are there any sources or data points indicating that this commonvoice data is being used, and if so, where/how? I think I may have contributed to this a few times back years ago. Nice to see it's still going, would be better to know it's being used.
        willwade•10m
        It was used quite a bit of speech to text - but tts it’s not that great.
        Aeolun•10m
        It costs a million dollar a year to host 32k hours of audio?
- sjnair96•10m
  Have you tried VoiceCraft?
  - asaddhamani•10m
    Yeah all these seem hyper focused on "voice cloning" so on replicate VoiceCraft doesn't even let you try normal TTS unless you provide a reference voice so I noped out.
DrPhish•10m
I’ve had great luck so far with GPT-SoVITS. With a custom trained Japanese model and clean reference audio the quality is outstanding. It is quite finicky to set up and use though.
https://github.com/RVC-Boss/GPT-SoVITS
xrd•10m
I have been having fun with this as well:
https://github.com/neonbjb/tortoise-tts
It supports voice cloning, but I am indeed having trouble getting docker container working and the command line docs are not perfect:
https://github.com/neonbjb/tortoise-tts/blob/1e061bc6752f05b...