Developing Optimal CUDA Kernels on Hopper Tensor Cores
, NVIDIA
, Senior Deep Learning Architect, NVIDIA
Highly Rated
NVIDIA’s H100 introduces fourth-generation Tensor Cores to GPU computing, with over twice the peak performance of the previous generation. Realizing this performance requires writing CUDA kernels to optimize data movement through the memory hierarchy, exploit new architectural features for synchronizing concurrent data pipelines, and balance workloads. We'll present these techniques in detail so that CUDA developers can implement their own fast kernels. We'll also describe how these concepts are applied in CUTLASS 3.0, the next major release of NVIDIA’s open-source CUDA C++ template library for optimal matrix computations.