CUTLASS: A Performant, Flexible, and Portable Way to Target Hopper Tensor Cores
, Senior Architect, NVIDIA
, Sr. Architect, NVIDIA
NVIDIA’s H100 introduced fourth-generation Tensor Cores to GPU computing, with over twice the peak performance of the previous generation. This session will build on our GTC’23 session. We'll describe how the latest version of CUTLASS leverages Hopper features for peak performance, covering major new features since its release last year including convolutions, fused epilogue visitors, Python interface, and more. Our discussion is aimed at those who wish to implement custom kernels for machine learning and HPC applications that achieve peak performance.