Optimize Generative AI inference with Quantization in TensorRT-LLM and TensorRT
, Senior Deep Learning Engineer, NVIDIA
, Manager, AI/ML, NVIDIA
Because running inference of AI models at large scale is computationally costly, optimization techniques are crucial to lower the inference cost. Our tutorial presents the TensorRT Model Optimization toolkit — a gateway for algorithmic model optimization by NVIDIA. TensorRT Model Optimization toolkit provides a set of state-of-the-art quantization methods, including FP8, Int8, Int4 and mixed precisions, as well as hardware-accelerated sparsity, and bridges those methods with the most advanced NVIDIA deployment solutions such as TensorRT-LLM. This tutorial includes an end-to-end optimization-to-deployment demo for language models with TensorRT-LLM and Stable Diffusion models with TensorRT. You can download the notebooks here: nvidia_ammo-0.9.0.tar.gz.