Speech AI is a subset of conversational AI and includes automatic speech recognition (ASR) for converting spoken language into text and text-to-speech (TTS) for transforming written words into a natural-sounding voice.
A speech AI system includes two main components:
The first step in a typical ASR pipeline is extracting useful features from the input audio. This is often done using a Mel spectrogram, which represents the strength of various frequencies in the audio over time on a logarithmic scale. The Mel spectrogram is then passed to an acoustic model that predicts the probability of each character.
The decoder then takes these character probabilities at each time step and converts them into a sequence of words.
To improve the accuracy of ASR models, a language model is employed to predict the likelihood of a sentence and correct errors made by the acoustic model.
Finally, a punctuation and capitalization model enhances the readability of the text, and inverse text normalization rules are applied to format the text correctly (e.g., converting “ten o’clock” to “10:00”).
The first stage of the TTS pipeline involves text preparation. This process includes text analysis, recognizing and analyzing expressions like dates, monetary amounts, and airport codes, as well as text normalization, which converts written text into its spoken form such as expanding abbreviations (e.g., “10 kg” to “ten kilograms”).
The next step is text encoding, where each character is converted into a numerical value, and the text is transformed into an encoded vector for input into a spectrogram generator.
After encoding, pitch and duration predictors estimate how long each phoneme should be held and the pitch at which it should be spoken, ensuring natural prosody in the generated speech. This information, along with the encoded text, is then fed into a spectrogram generator, which converts the text into Mel spectrograms.
Finally, these spectrograms are passed through a vocoder model that generates natural-sounding speech.
Speech AI components typically form part of a larger voice-based conversational AI system, which combines various technologies such as automatic speech recognition, large language model (LLM) enhanced with retrieval-augmented generation (RAG), and text-to-speech to understand and respond to different interactions.
An example of speech AI and conversational AI in action can be found in AI-powered virtual assistants, such as those used in customer service applications. Speech AI enables the system to transcribe and interpret spoken language, allowing users to interact naturally through voice commands.
Conversational AI then engages in meaningful, context-aware conversations, understanding intent, responding to inquiries, and handling tasks like booking appointments, providing technical support, or guiding users through troubleshooting steps. Together, these technologies create a seamless interaction, improving both the efficiency and quality of customer service.
Speech AI is reshaping workflows across industries by automating communication tasks and enabling more efficient, intelligent interactions.
To enhance customer service experiences and strengthen customer relationships, businesses are building avatars with internal domain-specific knowledge and recognizable brand voices. With NIMs, RAG-enhanced LLMs, and world-class, fully customizable, multilingual speech and translation AI, they deliver personalized answers and recommendations with unique, high-quality, customized voices.
Virtual assistants are found in every industry, enhancing user experience. ASR is used to transcribe an audio query for a virtual assistant. Then, text-to-speech generates the virtual assistant’s synthetic voice. Besides humanizing transactional situations, virtual assistants also help the visually impaired interact with non-braille texts, the vocally challenged to communicate with individuals, and children to learn how to read.
Consumers expect contact center agents to resolve their issues quickly and efficiently. To meet these expectations and deliver the best customer and agent experiences possible, enterprises across industries are implementing agent-assist technology powered by Riva speech and translation AI.
In the global economy, businesses hold millions of online meetings daily and serve customers with diverse linguistic backgrounds. Companies achieve accurate live captioning with real-time transcription and translation, accommodating worldwide accents and domain-specific vocabularies. They can use LLM NIMs for summarization and insights, ensuring effective communication and smooth global interactions.
Service robots are increasingly found in hospitals, airports, and retail stores worldwide. They aid frontline workers by handling daily repetitive tasks in restaurants and manufacturing facilities, assisting customers in locating store items, and supporting physicians and nurses in patient care.
About 10 million call center agents are answering 2 billion phone calls daily worldwide. Call center use cases include all of the following:
For example, automatic speech recognition transcribes live conversations between customers and call center agents for text analysis, which is then used to provide agents with real-time recommendations for quickly resolving customer queries.
In healthcare, speech AI applications improve patient access to medical professionals and claims representatives. ASR automates note-taking during patient-physician conversations and information extraction for claims agents.
Speech AI enables seamless content localization for global audiences. For example, a video originally produced in Japanese can be translated in real time and output as Portuguese or Spanish, facilitating broader access. AI voice generators are then used to dub the translated content, whether for entertainment, podcasts, or educational material, ensuring a smooth, natural-sounding experience.
Additionally, speech AI can produce accurate video transcripts, enhancing accessibility for individuals with speech impairments. This integration of real-time translation, dubbing, and transcription streamlines video editing and content creation workflows, supporting multilingual engagement across various platforms.