Accelerating Backward Data Gradient by Increasing Tensor Core Utilization in CUTLASS
, Senior Software Engineer, NVIDIA
Learn about improving the backward data gradient (Dgrad) performance by increasing Tensor Core utilization for strided problems — i.e., stride >= 2. Many machine-learning tasks require an efficient implementation of convolutions. Additionally, machine-learning training needs efficient implementation of forward, backward data (Dgrad), and backward weight (Wgrad). Implicit GEMM convolution is one of the ways to implement convolutions efficiently on a GPU. However, a naive implementation of implicit GEMM convolutions for Dgrad results in underutilizing Tensor Cores for the strided problem sizes (stride >= 2, Strided Dgrad). This results in sub-optimal performance and increased training times for popular workloads such as ResNet50, RNXT, and MaskRCNN. In this talk, we explore techniques to improve the performance of Strided Dgrad by up to 4x.