The Next Generation of GPU Performance in PyTorch with nvFuser
, Senior Engineering Manager, NVIDIA
Highly Rated
nvFuser is a fully automated GPU code generation system designed and implemented in PyTorch. nvFuser consumes graph representations of operations and transforms them into optimized GPU code. By generating largely shape agnostic code, nvFuser balances the generation of network specialized code with minimized recompilation under dynamic workloads. nvFuser avoids relying on high level operation descriptions like Softmax or LayerNorm, and instead uses low level definitions allowing nvFuser to maintain its performance across a variety of workloads like novel user-defined normalizations. nvFuser is already pervasive in Compiler projects in PyTorch, as it is already integrated in: Lazy Tensors, Torch Dynamo, AOT Autograd, and TorchScript. Improvements to nvFuser are ongoing, like adding support for advanced DL operators, and will be increasingly relied on for GPU performance in PyTorch moving forward.