CUTLASS: Python API, Enhancements, and NVIDIA Hopper
, Senior Research Scientist, NVIDIA
, Principal Compute Architect, NVIDIA
Highly Rated
The latest release of CUTLASS delivers a new Python API for designing, JIT compiling, and launching optimized matrix computations from a Python environment. The functionality of CUTLASS has also been extended to include grouped and depthwise separable convolution, fused kernels for layernorm and multihead attention, and optimizations to grouped GEMM. Additionally, CUTLASS 2.11 takes advantage of new features on NVIDIA's Hopper architecture, including 2x faster FP64 Tensor Cores and FP8 numerical conversion. We'll describe implementation details of these computations and optimization techniques for achieving peak performance. We'll also provide a preview of CUTLASS 3.0, which offers an enhanced programming model for implementing tensor computations using CUDA.