Speech Synthesis or Text-to-Speech is the task of artificially producing human speech from a raw transcripts. With deep learning today, the synthesized waveforms can sound very natural, almost undistinguishable from how a human would speak. Such Text-to-Speech models can be used in cases like when an interactive virtual assistants responds, or when a mobile device converts the text on a webpage to speech for accessibility reasons.
In this collection, we will cover:
TTS synthesis is a 2-step process described as follows: -
Text to Spectrogram Model: This model Transforms the text into time-aligned features such as spectrogram, mel spectrogram, or F0 frequencies and other linguistic features. We use architectures like Tacotron
Spectrogram to Audio Model: Converts generated spectrogram time-aligned representation into continuous human-like audio—for example, WaveGlow.
E-mail services have become very prevalent in this decade. However, it is sometimes challenging to understand and read those important messages when being abroad. The lack of proper computer systems or some security problems may arise. With TTS technology, e-mail messages can listen quickly and efficiently on smartphones, adding to productivity.
NVIDIA provides Deep Learning Examples for Image Segmentation on its GitHub repository. These examples provide you with easy to consume and highly optimized scripts for both training and inferencing. The quick start guide at our GitHub repository will help you in setting up the environment using NGC Docker Images, download pre-trained models from NGC and adapt the model training and inference for your application/use-case. Here are the examples relevant for image segmentation, directly from Deep Learning Examples:
Tacotron2 and WaveGlow for Speech Synthesis using PyTorch 1.1 Git repository 1.2 Uses PyTorch 20.03-py3 NGC container
FastPitch for text to melspectogram generation using PyTorch 2.1 Git repository 2.2 Uses PyTorch 20.03-py3 NGC container