NVIDIA TENSOR CORES

The Next Generation of Deep Learning

NVIDIA^® Tesla^® GPUs are powered by Tensor Cores, a revolutionary technology that delivers groundbreaking AI performance. Tensor Cores can accelerate large matrix operations, which are at the heart of AI, and perform mixed-precision matrix multiply and accumulate calculations in a single operation. With hundreds of Tensor Cores operating in parallel in one NVIDIA GPU, this enables massive increases in throughput and efficiency.

BREAKTHROUGH INFERENCE PERFORMANCE

NVIDIA T4 Powered by Turing Tensor Cores

Tesla T4 introduces NVIDIA Turing Tensor Core technology with multi-precision computing for the world’s most efficient AI inference. Turing Tensor Cores provide a full range of precisions for inference, from FP32 to FP16 to INT8, as well as INT4, to provide giant leaps in performance over NVIDIA Pascal^® GPUs.

THE MOST ADVANCED INFERENCE PLATFORM

T4 delivers breakthrough performance for deep learning training in FP32, FP16, INT8, INT4, and binary precisions for inference. With 130 teraOPS (TOPS) of INT8 and 260TOPS of INT4, T4 has the world’s highest inference efficiency, up to 40X higher performance compared to CPUs with just 60 percent of the power consumption. Using just 75 watts (W), it’s the ideal solution for scale-out servers at the edge.

T4 INFERENCE PERFORMANCE

Resnet50

DeepSpeech2

GNMT

Chip-to-chip GPU-to-CPU speedups | NVIDIA Tesla T4 GPU vs Xeon Gold 6140 CPU

THE WORLD’S HIGHEST DEEP LEARNING THROUGHPUT

NVIDIA V100 GPU Powered by Volta Tensor Cores

Designed specifically for deep learning, the first-generation Tensor Cores in Volta deliver groundbreaking performance with mixed-precision matrix multiply in FP16 and FP32—up to 12X higher peak teraflops (TFLOPS) for training and 6X higher peak TFLOPS for inference over the prior-generation NVIDIA Pascal™. This key capability enables Volta to deliver 3X performance speedups in training and inference over Pascal.

Each of Tesla V100's 640 Tensor Cores operates on a 4x4 matrix, and their associated data paths are custom-designed to power the world’s fastest floating-point compute throughput with high-energy efficiency.

A BREAKTHROUGH IN TRAINING AND INFERENCE

Deep Learning Training in Less Than a Workday

Volta is equipped with 640 Tensor Cores, each performing 64 floating-point fused-multiply-add (FMA) operations per clock. That delivers up to 125 TFLOPS for training and inference applications. This means that developers can run deep learning training using a mixed precision of FP16 compute with FP32 accumulate, achieving both a 3X speedup over the previous generation and convergence to a network’s expected accuracy levels.

This 3X speedup in performance is a key breakthrough of Tensor Core technology. Now, deep learning can happen in mere hours.

27X Higher Throughput than CPU Server on Deep Learning Inference

For inference, Tesla V100 also achieves more than a 3X performance advantage versus the previous generation and is 47X faster than a CPU-based server. Using the NVIDIA TensorRT^™ Programmable Inference Accelerator, these speedups are due in large part to Tensor Cores accelerating inference work using mixed precision

A Major Boost in Computing Performance

Read the whitepaper about Tensor Cores and the NVIDIA Volta architecture.

DOWNLOAD NOW