Models, Libraries, and Frameworks Conference Sessions

Talks & Panels

Stable and Scalable FP8 Deep Learning Training on Blackwell [S72778]

Kirthi Shankar Sivamani | Sr. Deep Learning Performance Engineer | NVIDIA

Learn about the new advancements in low-precision training recipes coming with Transformer Engine 2.0 for Hopper and Blackwell GPUs. We'll dive deep into the training performance and convergence of large language models when using FP8.

Watch On Demand

Talks & Panels

FlexAttention: The Flexibility of PyTorch With the Performance of FlashAttention [S72236]

Driss Guessous | Machine Learning Engineer | Meta Platforms

Introducing FlexAttention: a novel PyTorch API that enables custom, user-defined attention mechanisms with performance comparable to state-of-the-art solutions. By leveraging the PyTorch compiler stack, FlexAttention supports dynamic modifications to attention scores within SDPA, achieving both runtime and memory efficiency through kernel fusion with the FlashAttention algorithm. Our benchmarks on A100 GPUs show FlexAttention achieves 90% of FlashAttention2's performance in forward passes and 85% in backward passes. On H100 GPUs, FlexAttention's forward performance averages 85% of FlashAttention3 and is ~25% faster than FlashAttention2, while backward performance averages 76% of FlashAttention3 and is ~3% faster than FlashAttention2. Explore how FlexAttention balances near-state-of-the-art performance with unparalleled flexibility, empowering researchers to rapidly iterate on attention mechanisms without sacrificing efficiency.

Watch On Demand

Show More

Talks & Panels

Keep Your GPUs Going Brrr : Crushing Whitespace in Model Training [S73733]

Syed Ahmed | Senior Software Engineer | NVIDIA

Alban Desmaison | Research Engineer | Meta

Aidyn Aitzhan | Senior Software Engineer | NVIDIA

Substantial progress has recently been made on the compute-intensive portions of model training, such as high-performing attention variants. While invaluable, this progress exposes previously hidden bottlenecks in model training, such as redundant copies during collectives and data loading time. We'll present recent improvements in PyTorch achieved through Meta/NVIDIA collaboration to tackle these newly exposed bottlenecks and how practitioners can leverage them.

Watch On Demand

Show More

Talks & Panels

Make My PyTorch Model Fast, and Show Me How You Did It [S73041]

Luca Antiga | CTO | Lightning AI

Thomas Viehmann | Principal Research Engineer | Lightning AI

PyTorch is popular in deep learning and LLMs for richness and ease of expressions. To make the most of compute resources, PyTorch models benefit from nontrivial optimizations, but this means losing some of their ease and understandability. Learn how with Thunder, a PyTorch-to-Python compiler focused on usability, understandability, and extensibility, you can optimize and transform (i.e., distribute across many machines) models while • leaving the PyTorch code unchanged • targeting a variety of models without needing to adapt to each of them • understanding each transformation step because the results are presented as simple Python code • accessing powerful extension code for your own optimizations with just one or a few lines of code We'll show how the combination of Thunder transforms and the NVIDIA stack (NVFuser, cuDNN, Apex) delivers optimized performance in training and inference on a variety of models.

Watch On Demand

Show More

Talks & Panels

Google AI: Research, Progress, and the Future of Ads [S73303]

Tris Warkentin | Director of Product Management | Google Deepmind

James Johnsrud | Google Research

Learn how cutting-edge mixture-of-experts (MoE) models from Google Deepmind are applied to solve problems in search and advertising use cases. We'll walk through how these models are developed and optimized for best performance on NVIDIA GPUs, using the NeMo Framework and TRT LLM. Learn how you can download these OSS Gemma models and apply them to similar problems in your business.

Watch On Demand

Talks & Panels

Advanced Multi-GPU Scaling: Communication Libraries [S72578]

Jiri Kraus | Principal Developer Technology | NVIDIA

Do you need to compute larger or faster than a single GPU allows, but a multi-GPU library providing the functionality you need isn't available? Learn how to scale your application to multiple GPUs and multiple nodes with the different available multi-GPU communication libraries. We'll introduce CUDA-aware MPI, NVSHMEM, and NCCL using a real-world application example.

Important: Near capacity, highly suggest arriving early. Attendees are let in on a first-come, first-served basis.

Watch On Demand

Show More

Talks & Panels

Horizontal Scaling of LLM Training with JAX [S73266]

Andi Gavrilescu | Sr. ENG Manager | Google

Matthew Johnson | Research Scientist | Google

Abhinav Goel | Senior Deep Learning Architect" | NVIDIA

Explore techniques to scale training of LLMs on GPUs with JAX to 10,000k chips and beyond. Learn about JAX reference workload MaxText.

Watch On Demand

Talks & Panels

From Models to Microservices: Agentic AI at Data Center Scale [S73211 ]

Nik Spirin | Director, Generative AI and LLMOps Platform | NVIDIA

Michael Balint | Director, Product Architecture | NVIDIA

AI is evolving from monolithic models to intelligent, modular microservices. Learn how NVIDIA collaborates with Google, Meta, Microsoft, Mistral, and other leading AI companies to develop enterprise-grade inference microservices (NVIDIA NIM) and blueprints optimized for seamless deployment across on-premises and cloud environments. By leveraging hardware-software co-optimization and defining the right abstractions, NVIDIA’s AI platform enables accelerated inference and faster time to market. You will gain practical insights into building production-ready AI solutions, starting from selecting pre-trained foundation models to customizing them on your enterprise data to connecting multiple models into an Agentic AI system and deploying them at the data center scale. This talk is an essential guide for anyone looking to leverage the power of AI in their business.

Watch On Demand

Show More

Talks & Panels

How Math Libraries Can Help Accelerate Your Applications on Blackwell GPUs [S72434]

Babak Hejazi | Senior Engineering Manager | NVIDIA

Azi Riahi | Principal Product Manager | NVIDIA

NVIDIA’s GPU-accelerated math libraries, which are part of the CUDA Toolkit and the HPC SDK, are constantly expanding, providing industry-leading performance and coverage of common compute workflows across AI, ML, and scientific computing. One of the main advantages of these libraries is the portability they offer to applications that use them across different generations of GPUs. We'll do a deep dive into some of the latest advancements in the latest Blackwell architecture GPUs, and how math libraries make it easy to access them to accelerate your applications. We'll cover many topics, including how new Tensor Core functionality like micro-scaling formats can be used through libraries like cuBLAS; and how multi-GPU multi-node (MGMN) libraries (e.g., cuSOLVERMp) offer improved performance and scalability on GB200 NVL72 systems leveraging the larger NVLINK domain.

Watch On Demand

Show More

Talks & Panels

GraphBolt: Overcoming Dynamic Shapes and Irregular Accesses in GNN Data Loading [S71514]

Muhammed Fatih Balin | Ph.D. Candidate | Georgia Institute of Technology

Hongzhi Chen | Software Development Manager | AWS

GraphBolt is a state-of-the-art graph neural network (GNN) data loader. It owes its speed to its ability to keep GPUs utilized at all times. Whether it's copy operations from host to device or host device synchronizations, we'll show how to hide the latencies of such operations through heavy use of pipelining. We'll dive deep into the internals of GraphBolt and show how we achieved its unmatched performance by utilizing various CUDA programming techniques.

Watch On Demand

Show More

Models, Libraries, and Frameworks Conference Sessions

Featured Sessions

Stable and Scalable FP8 Deep Learning Training on Blackwell [S72778]

FlexAttention: The Flexibility of PyTorch With the Performance of FlashAttention [S72236]

Keep Your GPUs Going Brrr : Crushing Whitespace in Model Training [S73733]

Make My PyTorch Model Fast, and Show Me How You Did It [S73041]

Google AI: Research, Progress, and the Future of Ads [S73303]

Advanced Multi-GPU Scaling: Communication Libraries [S72578]

Horizontal Scaling of LLM Training with JAX [S73266]

From Models to Microservices: Agentic AI at Data Center Scale [S73211 ]

How Math Libraries Can Help Accelerate Your Applications on Blackwell GPUs [S72434]

GraphBolt: Overcoming Dynamic Shapes and Irregular Accesses in GNN Data Loading [S71514]