How to Understand and Optimize Shared Memory Accesses using Nsight Compute
, Senior Software Engineer, NVIDIA
Highly Rated
For efficiently optimizing your kernel's usage of shared memory the key ingredients are: (1) a basic mental model of the hardware implementation of shared memory on modern NVIDIA GPUs, (2) a clear definition of the available performance metrics for shared memory in Nsight Compute, and (3) a map of the hardware's behavior to the observed values of these performance metrics. We'll cover these three requirements and walk through the detailed information provided for shared memory accesses in the profile reports generated by Nsight Compute. We'll discuss concepts such as shared memory requests, wavefronts, and bank conflicts using examples of common memory access patterns, including asynchronous data copies from global memory to shared memory as introduced by the NVIDIA Ampere GPU architecture.