Deploying a service with conversation AI can seem daunting, but NVIDIA has tools to make this process easier, including Neural Modules (NeMo for short) and a new technology called NVIDIA Riva. To save time, pretrained models, training scripts, and performance results are available on the NVIDIA GPU Cloud (NGC) software hub.
NVIDIA Riva is a GPU-accelerated application framework that allows companies to use video and speech data to build state-of-the-art conversational AI services customized for their own industry, products, and customers.
Riva offers an end-to-end deep learning pipeline for conversational AI. It includes state-of-the-art deep learning models, such as NVIDIA’s Megatron BERT for natural language understanding. Enterprises can further fine-tune these models on their data using NVIDIA NeMo, optimize for inference using NVIDIA TensorRT™ , and deploy in the cloud and at the edge using Helm charts available on NGC, NVIDIA’s catalog of GPU-optimized software.
Applications built with Riva can take advantage of innovations in the new NVIDIA A100 Tensor Core GPU for AI computing and the latest optimizations in NVIDIA TensorRT for inference. This makes it possible to run an entire multimodal application, using the most powerful vision and speech models, faster than the 300-millisecond threshold for real-time interactions.
NVIDIA GPU-Accelerated, End-to-End Data Science
The RAPIDS™ suite of open-source software libraries, built on CUDA, gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs, while still using familiar interfaces like Pandas and Scikit-Learn APIs.
NVIDIA GPU-Accelerated Deep Learning Frameworks
GPU-accelerated deep learning frameworks offer flexibility to design and train custom deep neural networks, and provide interfaces to commonly used programming languages such as Python and C/C++. Widely used deep learning frameworks such as MXNet, PyTorch, TensorFlow, and others rely on NVIDIA GPU-accelerated libraries to deliver high-performance, multi-GPU accelerated training.
The Future of Conversational AI on the NVIDIA Platform
What drives the massive performance requirements of Transformer-based language networks like BERT and GPT-2 8B is their sheer complexity as well as pre-training on enormous datasets. The combination needs a robust computing platform to handle all the necessary computations to drive both fast execution and accuracy. The fact that these models can work on massive unlabeled datasets have made them a hub of innovation for modern NLP and, by extension, a strong choice for the coming wave of intelligent assistants with conversational AI applications across many use cases.
The NVIDIA platform with its Tensor Core architecture provides the programmability to accelerate the full diversity of modern AI, including Transformer-based models. In addition, data center-scale design and optimizations of the DGX SuperPOD™, combined with software libraries and direct support for leading AI frameworks, provides a seamless end-to-end platform for developers to take on the most daunting NLP tasks.
Continuous optimizations to accelerate training of BERT and Transformer for GPUs on multiple frameworks are freely available on NGC, NVIDIA’s hub for accelerated software.
NVIDIA TensorRT includes optimizations for running real-time inference on BERT and large Transformer based models. To learn more, check out our “Real-Time BERT Inference for Conversational AI” blog. NVIDIA’s BERT GitHub repository also has code today to reproduce the single-node training performance quoted in this blog, and in the near future, the repository will be updated with the scripts necessary to reproduce the large-scale training performance numbers. For the NVIDIA research team’s NLP code on Project Megatron, head over to the Megatron Language Model GitHub repository.