Neural text-to-speech (TTS) models are successfully used to generate high-quality human-like speech. However, most TTS models can be trained if only the transcribed data of the desired speaker is given. That means that long-form untranscribed data, such as podcasts, cannot be used to train existing models.
A recent paper on arXiv proposes an unconditional diffusion-based generative model. It is trained on untranscribed data that leverages a phoneme classifier for text-to-speech synthesis. A probabilistic model learns to generate mel-spectrograms of the speaker without any context.
The results show that the proposed method matches the performance of the existing models on LJSpeech. By training the classifier on a multi-speaker paired dataset, comparable performance is shown without seeing any transcript of LJSpeech. Therefore, it is possible to build a high-quality TTS model without a transcript for the desired speaker.
Most neural text-to-speech (TTS) models require
paired data from the desired speaker for high-quality speech synthesis, which limits the usage of large amounts of untranscribed data for training. In this work, we present Guided-TTS, a high-quality TTS model that learns to generate speech from untranscribed speech data. Guided-TTS combines an unconditional diffusion probabilistic model with a separately trained phoneme classifier for text-to-speech. By modeling the unconditional distribution for speech, our model can utilize the untranscribed data for training. For text-to-speech synthesis, we guide the generative process of the unconditional DDPM via phoneme classification to produce mel-spectrograms from the conditional distribution given transcript. We show that Guided-TTS achieves comparable performance with the existing methods without any transcript for LJSpeech. Our results further show that a single speaker-dependent phoneme classifier trained on multispeaker large-scale data can guide unconditional DDPMs for various speakers to perform TTS.
Research paper: Kim, H., Kim, S., and Yoon, S., “Guided-TTS:Text-to-Speech with Untranscribed Speech”, 2021. Link: https://arxiv.org/abs/2111.11755