NVIDIA Tensor Cores

Unprecedented Acceleration for HPC and AI

Tensor Cores enable mixed-precision computing, dynamically adapting calculations to accelerate throughput while preserving accuracy. The latest generation expands these speedups to a full range of workloads. From 10X speedups in AI training with Tensor Float 32 (TF32), a revolutionary new precision, to 2.5X boosts for high-performance computing (HPC) with FP64, NVIDIA Tensor Cores deliver new capabilities to all workloads.

Revolutionary Deep Learning Training

AI models continue to explode in complexity as they take on next-level challenges such as accurate conversational AI and deep recommender systems. Conversational AI models like Megatron are hundreds of times larger and more complex than image classification models like ResNet-50. Training these massive models in FP32 precision can take days or even weeks. Tensor Cores in NVIDIA GPUs provide an order-of-magnitude higher performance with reduced precisions like TF32 and FP16. And with direct support in native frameworks via NVIDIA CUDA-X™ libraries, implementation is automatic,which dramatically slashes training-to-convergence times while maintaining accuracy.

Tensor Cores enabled NVIDIA to win MLPerf 0.6, the first AI industry-wide benchmark for training.

Breakthrough Deep Learning Inference

A great AI inference accelerator has to not only deliver great performance but also the versatility to accelerate diverse neural networks, along with the programmability to enable developers to build new ones. Low latency at high throughput while maximizing utilization are the most important performance requirements of deploying inference reliably. NVIDIA Tensor Cores offer a full range of precisions—TF32, BFLOAT16, FP16, INT8, and INT4—to provide unmatched versatility and performance.

Tensor Cores enabled NVIDIA to win MLPerf Inference 0.5, the first AI industry-wide benchmark for inference.

Advanced High-Performance Computing

HPC is a fundamental pillar of modern science. To unlock next-generation discoveries, scientists use simulations to better understand complex molecules for drug discovery, physics for potential sources of energy, and atmospheric data to better predict and prepare for extreme weather patterns. NVIDIA Tensor Cores offer a full range of precision, including FP64, to accelerate scientific computing with the highest accuracy needed.

A100 Tensor Cores

Third Generation

NVIDIA Tensor Core technology has brought dramatic speedups to AI, bringing down training times from weeks to hours and providing massive acceleration to inference. The NVIDIA Ampere architecture provides a huge performance boost and delivers new precisions to cover the full spectrum required by researchers—TF32, FP64, FP16, INT8, and INT4—accelerating and simplifying AI adoption and extending the power of NVIDIA Tensor Cores to HPC.

LEARN MORE ABOUT THE NVIDIA AMPERE ARCHITECTURE

TF32
FP64
FP16
INT8

Tensor Float 32

As AI networks and datasets continue to expand exponentially, their computing appetite has similarly grown. Lower-precision math has brought huge performance speedups, but they’ve historically required some code changes. A100 brings a new precision, Tensor Float 32 (TF32), which works just like FP32 while delivering speedups of up to 10X for AI—without requiring any code change.

FP64 Tensor Cores

A100 brings the power of Tensor Cores to HPC, providing the biggest milestone since the introduction of double-precision GPU computing for HPC. By enabling matrix operations in FP64 precision, a whole range of HPC applications that need double-precision math can now get a 2.5X boost in performance and efficiency compared to prior generations of GPUs.

FP16 Tensor Cores

A100 Tensor Cores enhance FP16 for deep learning, providing a 2X speedup compared to NVIDIA Volta^™ for AI. This dramatically boosts throughput and cuts time to convergence.

INT8 Precision

First introduced in NVIDIA Turing^™, INT8 Tensor Cores dramatically accelerate inference throughput and deliver huge boosts in efficiency. INT8 in the NVIDIA Ampere architecture delivers 10X the comparable throughput of Volta for production deployments. This versatility enables industry-leading performance for both high-batch and real-time workloads in core and edge data centers.

Turing Tensor Cores

Second Generation

NVIDIA Turing^™ Tensor Core technology features multi-precision computing for efficient AI inference. Turing Tensor Cores provide a range of precisions for deep learning training and inference, from FP32 to FP16 to INT8, as well as INT4, to provide giant leaps in performance over NVIDIA Pascal^™ GPUs.

LEARN MORE ABOUT TURING

Volta Tensor Cores

First Generation

Designed specifically for deep learning, the first-generation Tensor Cores in NVIDIA Volta^™ deliver groundbreaking performance with mixed-precision matrix multiply in FP16 and FP32—up to 12X higher peak teraFLOPS (TFLOPS) for training and 6X higher peak TFLOPS for inference over NVIDIA Pascal. This key capability enables Volta to deliver 3X performance speedups in training and inference over Pascal.

LEARN MORE ABOUT VOLTA

The Most Powerful End-to-End AI and HPC Data Center Platform

Tensor Cores are essential building blocks of the complete NVIDIA data center solution stack that incorporates hardware, networking, software, libraries, and optimized AI models and applications from NGC^™. The most powerful end-to-end AI and HPC platform, it allows researchers to deliver real-world results and deploy solutions into production at scale.

	A100	Turing	Volta
Supported Tensor Core Precisions	FP64, TF32, BFLOAT16, FP16, INT8, INT4, INT1	FP16, INT8, INT4, INT1	FP16
Supported CUDA^® Core Precisions	FP64, FP32, FP16, Bfloat16, INT8	FP64, FP32, FP16, INT8	FP64, FP32, FP16, INT8

Get Started

Order NVIDIA DGX^™ A100, the world’s most powerful AI system.

LEARN MORE