An in-depth explainer of ASR and TTS, the two main components of Speech AI.
Automatic Speech Recognition (ASR) and Text-To-Speech (TTS) are the two most essential Speech AI technologies. Each of these technological pipelines includes multiple stages, such as data preprocessing, deep learning models, and post-processing. This eBook details what occurs in each of their individual components and how to evaluate the performance of these technologies.
ASR, also known as speech-to-text, is the process of automatically converting spoken audio into written form.
TTS, also known as speech synthesis, takes text as an input and generates a human-like synthesized voice.
Metrics, such as word error rate (WER) and mean opinion score (MOS), are used to assess the performance of ASR and TTS pipelines, respectively.