31 comments
  • sandreas10m

    BTW I was really impressed by the results of F5-TTS. The thing I liked best was the "Tagged" TTS, where you can specify a tag to use different tones of your own voice, like

      {Angry}What have you done?
      {Suprised}Me, I did nothing?
      {Shouting}Who else do you think I'm talking to?
      {Sad}Why are you always shouting at me?
    
    I wonder if this would also work for "Character" tags, like

      {Susan}How was your day?
      {Peter}I had a great day.
    
    That would open great new ways of having audio books read by cloned voices - switching between characters with the same voice like often done by the real narrators
    • throwaway8920110m

      This feature also greatly interests me, although I'm looking for a system that would allow to slightly alter the pronunciation of individual words. Is anyone aware of such a system?

      Especially with TTS in a language other than English (but also with English), the pronunciation of certain words is sometimes jarringly wrong. Until TTS systems can compensate for this themselves, it would be great if it were possible for humans to use such tags to hint the system to pronounce better. Even if you can't specify the exact correction, but the TTS would just generate a 'different' sound, that could help.

      • willwade10m

        Are you not looking for ssml with ipa tag? I think you might be. It’s part of all your standard OS tts - including espeak-ng on Linux. Also in Google cloud, azure, Watson, and Amazon Polly voices.

        • sandreas10m

          I didn't know it existed... Thank you very much

      • sandreas10m

        Features like artificial breathing, slightly different pronounciation and other "features" are only available in commercial systems... unfortunately I don't remember the name or the video I saw about these, because I'm not interested in non FOSS stuff for my personal projects.

    • thorsten-voice10m

      IMHO this should work (in english or chinese). Here i show how it sounds with different tags (in this case emotions and not characters): https://youtu.be/ASFoTNpkM8o?t=27

      Here's how it's done: https://youtu.be/ASFoTNpkM8o?t=992

      • sandreas10m

        Hey, thorsten-voice himself. Thank you for your contribution to the community. I'm a happy follower of your content.

        Can't wait F5-tts to support the german language. Do you know wether this is planned in the near future?

  • dmezzetti10m

    Good quality and easy-to-use open TTS models are hard to find. SpeechT5 while a bit old was relatively easy to clone voices with using the Transformers library.

    I've also found a couple of the ESPNet TTS models are decent. I've exported those models to ONNX to make them easier to use.

    For what it's worth, here is a list of models that cover what I've worked on in the "Open models" TTS space.

    https://huggingface.co/collections/NeuML/text-to-speech-tts-...

  • asaddhamani10m

    From a quick try results aren't good. Sounds bland, and the text I type isn't exactly equal to the text that is spoken. Didn't try with voice cloning though.

    Why is good TTS so expensive and why are there no good open source options? Is it just from the need for high quality training data? I don't imagine these models are more expensive to run compared to SOTA LLMs, yet they cost so much more.

    • miki12321110m

      From what I'm seeing, most of the open source TTS models are trained on the same few voices, mostly in 16Khz, mostly from Librivox books I think.

      Eleven Labs is most likely trained on stolen audiobooks, they've published a few Youtube videos in Polish, now taken down, of AI renditions of famous Polish audiobook narrators. This was all before they became popular, and before their voice cloning models were publicly available I think.

      • generalizations10m

        > mostly from Librivox books

        That probably explains a lot. I've tried listening to some of those audiobooks - very hit and miss, mostly miss. Definitely amateur hour and mostly bad quality.

    • sandreas10m

      I had pretty good results with coqui-tts and a VITS model, I trained myself with an open dataset and later with one I extracted from audiobooks / epub and therefore can't publish (german)

      The dataset and video tutorials are all available and linked on (also english):

      https://www.thorsten-voice.de/en/motivation-vision/

      • thorsten-voice10m

        Thanks for mentioning my Thorsten-Voice project, dear sandreas :)

    • em-bee10m

      a few weeks ago i used piper to create an acceptable translation of a book. i didn't listen to it all, but the result sounded better than anything i was able to listen to before. good enough to listen to a book if a human read one is not available. just a few years ago, this was not the case.

      in other words, while FOSS TTS lags behind commercial options, it does get better and i expect within a few years it will produce results that are at least as good as the commercial options today if not fully caught up.

      • asaddhamani10m

        Piper seems roughly equivalent to old-school TTS outputs that sound flat, jumpy with the concatenative approach. Listen to this first example I tried:

        https://rhasspy.github.io/piper-samples/samples/en/en_GB/ala...

        Of all the TTS APIs I have tried, I like OpenAI voices the best. Haven't considered things like elevenlabs because I find them ridiculously expensive.

        I love voice to voice interfaces, but only when they sound natural to my ears, and the current pricing for good ones is prohibitive for a huge number of use cases.

        • em-bee10m

          well, i was comparing it to the free tools available a few years ago, and against that, this example is a markable improvement. it's the first that i could actually bear to listen to over a longer period of time. i expect just another few years and this will actually be good.

    • modeless10m

      There are a lot of options. StyleTTS2 is pretty good, XTTSv2 is pretty good, the new E2 TTS and F5 TTS also seem decent.

    • amrrs10m

      Commercially available high quality training dataset is the key. Open search libraries don't get the luxury of working with voice actors to record voices.

      • Aeolun10m

        Would it be hard to create such a training dataset? Seems like you’d just need a lot of people to say a bunch of stuff for you?

        • wahnfrieden10m

          needs a crowdsourced model

          • huggingmouth10m

            Ideally, Mozilla would step up here given their mission statement, but they won't, probably because their CEO needs another bonus.

            • IshKebab10m

              Yeah there's no chance Mozilla would do anything like this:

              https://commonvoice.mozilla.org/

              • mgkimsal10m

                That's the first thing I thought of! I wonder how used these are. Are there any sources or data points indicating that this commonvoice data is being used, and if so, where/how? I think I may have contributed to this a few times back years ago. Nice to see it's still going, would be better to know it's being used.

                • willwade10m

                  It was used quite a bit of speech to text - but tts it’s not that great.

              • Aeolun10m

                It costs a million dollar a year to host 32k hours of audio?

    • sjnair9610m

      Have you tried VoiceCraft?

      • asaddhamani10m

        Yeah all these seem hyper focused on "voice cloning" so on replicate VoiceCraft doesn't even let you try normal TTS unless you provide a reference voice so I noped out.

  • DrPhish10m

    I’ve had great luck so far with GPT-SoVITS. With a custom trained Japanese model and clean reference audio the quality is outstanding. It is quite finicky to set up and use though.

    https://github.com/RVC-Boss/GPT-SoVITS

  • xrd10m

    I have been having fun with this as well:

    https://github.com/neonbjb/tortoise-tts

    It supports voice cloning, but I am indeed having trouble getting docker container working and the command line docs are not perfect:

    https://github.com/neonbjb/tortoise-tts/blob/1e061bc6752f05b...