The first stage of the TTS pipeline involves text preparation. This process includes text analysis, recognizing and analyzing expressions like dates, monetary amounts, and airport codes, as well as text normalization, which converts written text into its spoken form such as expanding abbreviations (e.g., “10 kg” to “ten kilograms”).
The next step is text encoding, where each character is converted into a numerical value, and the text is transformed into an encoded vector for input into a spectrogram generator.
After encoding, pitch and duration predictors estimate how long each phoneme should be held and the pitch at which it should be spoken, ensuring natural prosody in the generated speech. This information, along with the encoded text, is then fed into a spectrogram generator, which converts the text into Mel spectrograms.
Finally, these spectrograms are passed through a vocoder model that generates natural-sounding speech.