Accelerated LLM Model Alignment and Deployment in NeMo, TensorRT-LLM, and Triton Inference Server
, Deep Learning Solutions Architect, NVIDIA
, Deep Learning Solutions Architect, NVIDIA
The demand for accelerated large language models (LLMs) has surged with the growing popularity of generative models. These models, often boasting billions of parameters, hold immense potential, but also pose challenges during large-scale deployments. Join us as we delve into the world of accelerated LLM model alignment using the NeMo Framework and inference optimization and deployment through NVIDIA's TensorRT-LLM and Triton Inference Server. We'll spotlight the innovative SFT and PEFT Fine-tuning (LoRA) approach, a key component for LLM alignment. We'll also uncover the intricacies of inference optimization using TensorRT-LLM, highlighting kv-caching, paged attention, in-flight batching, and the pivotal role it plays in making LLMs faster and more cost-effective. We'll take you through the crucial steps of fine-tuning, optimizing, and deploying LLaMA model in production environment using Triton Inference Server. Prerequisite(s):
Familiarity with Python, Large language Models, and Deep Learning Frameworks