A virtual digital assistant is a program that understands natural language and can answer questions or complete tasks based on voice commands.
A virtual digital assistant is a program that understands natural language and can answer questions or complete tasks based on voice commands.
Virtual digital assistants like Siri, Alexa, Google Home, and Cortana use conversational AI to recognize and respond to voice commands in order to carry out electronic tasks. Conversational AI is the application of machine learning to develop language based apps that allow humans to interact naturally with devices, machines, and computers using speech. You use conversational AI when your virtual assistant wakes you up in the morning. You speak in your normal voice, the device understands, finds the best answer, and replies with speech that sounds natural.
Virtual digital assistants are essentially voice-enabled front ends to cloud applications. The software is most often embedded in smartphones, tablets, desktop computers and, in some cases, dedicated devices. In most cases, the assistant is connected to the Internet to access the cloud-based back ends needed to recognize speech and perform queries. The technology behind conversational AI is complex, involving a multi-step process that requires a massive amount of computing power and computations that must happen in less than 300 milliseconds to deliver a great user experience.
Virtual personal assistants such as Amazon’s Alexa, Apple’s Siri, and Microsoft’s Cortana are tuned to respond to simple requests without carrying context from one conversation to the next. A more specialized version of personal assistant is the virtual customer assistant, which understands context and is able to carry on a conversation from one interaction to the next. Another specialized form of conversational AI is virtual employee assistants, which learn the context of an employee’s interactions with software applications and workflows and suggest improvements. Virtual employee assistants are widely used in the popular new software category of robotic process automation.
Demand for digital voice assistants is on the rise: Juniper Research firm estimates there will be 8 billion digital voice assistants in use by 2023, more than triple the 2.5 billion that were in use at the end of 2018. The shift toward working from home, telemedicine, and remote learning has created a surge in demand for custom, language-based AI services ranging from customer support to real-time transcriptions and summarization of video calls to keep people productive and connected.
Applications in conversational AI are growing every day, from voice assistants to question-answering systems that enable customer self-service. The range of industries adapting conversational AI into their solutions are wide, and have diverse domains extending from finance to healthcare. The technology is especially useful in situations in which using a screen or keyboard is inconvenient or unsafe, such as while driving a car. Virtual assistants are already ubiquitous in smart phones. As applications become mainstream and get deployed through devices in the home, car, and office, research from academia and industry for this space has exploded.
Virtual assistants require massive amounts of data and incorporate several artificial intelligence capabilities. Algorithms enable the assistant to learn from requests and improve contextual responses, such as providing answers based upon previous queries.
A typical conversational AI application uses three subsystems to do the steps of processing and transcribing the audio—understanding (deriving meaning) of the question asked, generating the response (text), and speaking the response back to the human. These steps are achieved by multiple deep learning solutions working together. First, automatic speech recognition (ASR) is used to process the raw audio signal and transcribing text from it. Second, natural language processing (NLP) or understanding (NLU) is used to derive meaning from the transcribed text (ASR output). Last, speech synthesis or text-to-speech (TTS) is used for the artificial production of human speech from text. Optimizing this multi-step process is complicated, as each of these steps requires building and using one or more deep learning models.
Deep learning models are applied for NLU because of their ability to accurately generalize over a range of contexts and languages. Transformer deep learning models, such as BERT (Bidirectional Encoder Representations from Transformers), are an alternative to recurrent neural networks that applies an attention technique—parsing a sentence by focusing attention on the most relevant words that come before and after it. BERT revolutionized progress in NLU by offering accuracy comparable to human baselines on benchmarks for question answer (QA), entity recognition, intent recognition, sentiment analysis, and more.
Conversational requires a massive amount of computing power and needs to deliver results in less than 300 milliseconds.
A GPU is composed of hundreds of cores that can handle thousands of threads in parallel. GPUs have become the platform of choice to train deep learning models and perform inference because they can deliver 10X higher performance than CPU-only platforms.
Deploying a service with conversation AI can seem daunting, but NVIDIA has tools to make this process easier, including a new technology called NVIDIA Riva.
NVIDIA Riva is a GPU-accelerated application framework that allows companies to use video and speech data to build state-of-the-art conversational AI services customized for their own industry, products, and customers.
This framework offers an end-to-end deep learning pipeline for conversational AI. It includes state-of-the-art deep learning models, such as NVIDIA’s Megatron BERT for natural language understanding. Enterprises can further fine-tune these models on their data using NVIDIA NeMo, optimize for inference using NVIDIA® TensorRT™, and deploy in the cloud and at the edge using Helm charts available on NVIDIA GPU Cloud™ (NGC), NVIDIA’s catalog of GPU-optimized software.
Applications built with Riva can take advantage of innovations in the new NVIDIA A100 Tensor Core GPU for AI computing and the latest optimizations in NVIDIA TensorRT for inference. This makes it possible to run an entire multimodal application, using the most powerful vision and speech models, faster than the 300-millisecond threshold for real-time interactions.
Companies worldwide are using NVIDIA’s conversational AI platform to improve their services.
Voca’s AI virtual agents—which use NVIDIA for faster, more interactive, human-like engagements—are used by Toshiba, AT&T, and other world-leading companies. Voca uses AI to understand the full intent of a customer’s spoken conversation and speech. This makes it possible for the agents to automatically identify different tones and vocal clues to discern between what a customer says and what a customer means. Additionally, they can use scalability features built into NVIDIA’s AI platform to dramatically reduce customer wait time.
Kensho, the innovation hub for S&P Global located in Cambridge, Mass. that deploys scalable machine learning and analytics systems, has used NVIDIA’s conversational AI to develop Scribe, a speech-recognition solution for finance and business. With NVIDIA, Scribe outperforms other commercial solutions on earnings calls and similar financial audio in terms of accuracy by a margin of up to 20 percent.
Square has created an AI virtual assistant that allows Square sellers to use AI to automatically confirm, cancel, or change appointments with their customers. This frees them to conduct more strategic customer engagement. With GPUs, Square is able to train models 10X faster versus CPUs to deliver more accurate, human-like interactions.
To learn more refer to:
Find out more: