Optimizing and Scaling LLMs With TensorRT-LLM for Text Generation
, Senior Solution Architect, NVIDIA
, Machine Learning Engineer, Grammarly
, Machine Learning Engineer, Grammarly
The landscape of large language models (LLMs) is evolving quickly. With model parameters and size increasing, optimizing and deploying LLMs for inference gets very complex. This requires a framework with better API support for easy extension, where there is little emphasis on memory management or CUDA calls. Learn how we used NVIDIA’s suite of solutions for optimizing LLM models and deploying in multi-GPU environments.