Accelerating Convolution with Tensor Cores in CUTLASS
, NVIDIA
Highly Rated
CUTLASS provides building blocks in the form of C++ templates to CUDA programmers who are eager to write their own CUDA kernels to perform deep learning computations. We'll focus on implementing 2-D and 3-D convolution kernels for NVIDIA's CUDA and Tensor cores. We'll describe the Implicit GEMM algorithm, then we will cover new CUTLASS components that form convolution matrices, and then compute their product using the highly optimized CUTLASS GEMM pipeline targeting CUDA and Tensor cores. Finally, we'll discuss performance and compositions of fused operations on the output of Convolution kernels. CUTLASS Convolution supports a wide range of data types (Half, Tensor Float 32 (TF32), BFloat16 (BF16), F32, complex, Int32, Int8, and Int4) and Tensor layouts (NHWC, NCxHWx). This talk enables advanced kernel writers who are interested to use and extend Convolutions for their custom use cases.