Guided-TTS: Text-to-Speech With Untranscribed Speech

Neural text-to-speech (TTS) models are successfully used to generate high-quality human-like speech. However, most TTS models can be trained if only the transcribed data of the desired speaker is given. That means that long-form untranscribed data, such as podcasts, cannot be used to train existing models.

Image credit: pxhere.com, CC0 Public Domain

A recent paper on arXiv proposes an unconditional diffusion-based generative model. It is trained on untranscribed data that leverages a phoneme classifier for text-to-speech synthesis. A probabilistic model learns to generate mel-spectrograms of the speaker without any context.

The results show that the proposed method matches the performance of the existing models on LJSpeech. By training the classifier on a multi-speaker paired dataset, comparable performance is shown without seeing any transcript of LJSpeech. Therefore, it is possible to build a high-quality TTS model without a transcript for the desired speaker.

Most neural text-to-speech (TTS) models require paired data from the desired speaker for high-quality speech synthesis, which limits the usage of large amounts of untranscribed data for training. In this work, we present Guided-TTS, a high-quality TTS model that learns to generate speech from untranscribed speech data. Guided-TTS combines an unconditional diffusion probabilistic model with a separately trained phoneme classifier for text-to-speech. By modeling the unconditional distribution for speech, our model can utilize the untranscribed data for training. For text-to-speech synthesis, we guide the generative process of the unconditional DDPM via phoneme classification to produce mel-spectrograms from the conditional distribution given transcript. We show that Guided-TTS achieves comparable performance with the existing methods without any transcript for LJSpeech. Our results further show that a single speaker-dependent phoneme classifier trained on multispeaker large-scale data can guide unconditional DDPMs for various speakers to perform TTS.

Research paper: Kim, H., Kim, S., and Yoon, S., “Guided-TTS:Text-to-Speech with Untranscribed Speech”, 2021. Link: https://arxiv.org/abs/2111.11755

Guided-TTS: Text-to-Speech with Untranscribed Speech

Related Posts

Revolutionizing Lunar Construction: Innovations in Regolith Solidification Techniques Unveiled

Octopus inspires new suction mechanism for robots

Proof-of-concept nanogenerator turns CO₂ into sustainable power

Experts Say Shannen Doherty’s Approach to Cancer Diagnosis Can Be Helpful

Amanda Bynes ‘was approached to interview for shocking Quiet On Set docuseries but DECLINED’ as the former child star ‘did not have a bad experience with Nickelodeon’

Product And Service Training For Sales Professionals

CDC investigating botched Botox shots that have hospitalized women in multiple states

Experts are digging into Milei Moneda’s standout features: could this be the next Dogwifhat (WIF)?

My Biggest First-Year Teaching Mistakes

We have failed on migration but Labour would be infinitely worse – my drastic plan would fix the problem, says Jenrick

Watch A Tesla Model 3 And Porsche Taycan Drive Down Flooded Streets In Dubai

PopularStories

Experts are digging into Milei Moneda’s standout features: could this be the next Dogwifhat (WIF)?

My Biggest First-Year Teaching Mistakes

We have failed on migration but Labour would be infinitely worse – my drastic plan would fix the problem, says Jenrick

Watch A Tesla Model 3 And Porsche Taycan Drive Down Flooded Streets In Dubai

About Us

Pages