Automatic speech recognition (ASR), or speech-to-text, is the combination of processes and software that decode human speech and convert it to digitized text.
Automatic speech recognition (ASR), or speech-to-text, is the combination of processes and software that decode human speech and convert it to digitized text.
Automatic speech recognition (ASR) takes human voice as input and converts it into readable text. ASR helps us compose hands-free text messages and provides a framework for machine understanding. Human language becomes searchable and actionable, giving developers the ability to derive advanced analytics like sentiment analysis.
ASR is the first stage of a conversational AI application pipeline, the use of natural language to communicate with machines. A typical conversational AI application uses three subsystems to do the steps of processing and transcribing the audio—understanding (deriving meaning) of the question asked, generating the response (text), and speaking the response back to the human. These steps are achieved by multiple deep learning solutions working together. First, ASR is used to process the raw audio signal and transcribe text from it. Second, natural language processing (NLP) is used to derive meaning from the transcribed text (ASR output). Last, speech synthesis, or text-to-speech (TTS), is used for the artificial production of human speech from text. Optimizing this multi-step process is complicated, as each of these steps requires building and using one or more deep learning models.
Applications in speech recognition and conversational AI are growing every day, from voice assistants and chatbots to question-answering systems that enable customer self-service. The range of industries adapting ASR or conversational AI into their solutions are wide, and have diverse domains extending from finance to healthcare. There are a wide variety of practical applications of speech-to-text:
ASR is a challenging task in natural language, as it consists of a series of subtasks such as speech segmentation, acoustic modelling, and language modelling to form a prediction (of sequences of labels) from noisy, unsegmented input data. Deep learning has replaced traditional statistical methods for ASR—such as Hidden Markov Models and Gaussian Mixture Models—as it offers higher accuracy when identifying phonemes (the most basic sounds that are used to create speech.) The introduction of Deep learning Connectionist Temporal Classification (CTC) removed the need for pre-segmented data and allows the network to be trained end-to-end directly for sequence labelling tasks like ASR.
A typical ASR CTC pipeline involves the following steps:
Innovations like Connectionist Temporal Classification (CTC) have brought ASR squarely into the domain of deep learning. Popular deep learning models for ASR include Wav2letter, Deepspeech, LAS, and more recently, Jasper by NVIDIA Research—a popular toolkit for developing speech applications using deep learning. Kaldi is a C++ toolkit that, in addition to deep learning modules, supports traditional methods.
A GPU is composed of hundreds of cores that can handle thousands of threads in parallel. Because neural nets are created from large numbers of identical neurons, they’re highly parallel by nature. This parallelism maps naturally to GPUs, providing a significant computation speed-up over CPU-only training. For example, GPU-accelerated Kaldi solutions can perform 3500X faster than real time audio and 10X faster than CPU-only options. This performance has made GPUs the platform of choice to train deep learning models and perform inference.
Healthcare
One of the difficulties facing healthcare is making it easily accessible. Calling your doctor’s office and waiting on hold is a common occurrence, a is delays connecting with a claims representative. The implementation of conversational AI to train chatbots is an emerging technology within healthcare to address the shortage of healthcare professionals and open the lines of communication with patients.
In this blog—Empowering Smart Hospitals with NVIDIA Clara Guardian from NGC—you can read about how you can build a Virtual Patient Assistant client application that takes input queries from the patient, interprets the query by extracting intent and relevant slots, and computes a response in real-time, in a natural-sounding voice.
Financial Services
Conversational AI is building better chatbots and AI assistants for financial service firms.
Retail
Chatbot technology is also commonly used for retail applications to accurately analyze customer queries and generate responses or recommendations. This streamlines the customer journey and improves efficiencies in store operations.
Deploying a service with conversation AI can seem daunting, but NVIDIA now has tools to make this process easier, including Neural Modules (NeMo for short) and a new technology called NVIDIA Riva. To save time, pretrained ASR models, training scripts, and performance results are also available on the NGC software hub.
NVIDIA NeMo is a toolkit based on PyTorch created for developing AI applications for Conversational AI. Through modular Deep Neural Networks development, NeMo enables fast experimentation by connecting modules, mixing and matching components. NeMo modules typically represent data layers, encoders, decoders, language models, loss functions, or methods of combining activations. NeMo makes it easy to compose complex neural network architectures and systems using reusable components for each of ASR, NLP and TTS. Additionally, with NVIDIA GPU Cloud (NGC), you can find NeMo resources for conversational AI such as pre-trained models, scripts for training or evaluation, and NeMo end-to-end applications that allow developers to experiment with different algorithms and perform transfer learning using their own datasets.
To facilitate the implementation and domain adaptation of the complete ASR pipeline, NVIDIA created the Domain Specific – NeMo ASR Application. This application is developed using NeMo and lets you train or fine-tune pre-trained (acoustic and language) ASR models with your own data. This gives you the ability to progressively create better performing ASR models specifically built for your data.
NVIDIA Riva is an application framework that provides several pipelines for accomplishing conversational AI tasks.
The NVIDIA RAPIDS™ suite of open-source software libraries, built on CUDA, gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs, while still using familiar interfaces like Pandas and Scikit-Learn APIs.
GPU-accelerated deep learning frameworks offer flexibility to design and train custom deep neural networks and provide interfaces to commonly used programming languages such as Python and C/C++. Widely used deep learning frameworks such as MXNet, PyTorch, TensorFlow, and others rely on NVIDIA GPU-accelerated libraries to deliver high-performance, multi-GPU accelerated training.
To learn more, refer to the blogs, code samples, and webinar listed below:
Find out more: