Training Optimization for LLM with NVIDIA NeMo and AWS
, Senior Applied Scientist, Amazon Store Foundational AI
, Senior Manager, Amazon Store Foundational AI
Training a large language model at scale while ensuring efficiency and reliability poses numerous challenges. During this presentation, we'll share our experience training LLMs at Amazon Search, utilizing NVIDIA's Nemo Framework in collaboration with AWS. We'll discuss the process of selecting the appropriate training framework, establishing the training infrastructure by harnessing the power of Nemo and AWS , and implementing zero-touch training through automated job monitoring and recovery mechanisms. Additionally, we'll share practical insights into fine-tuning hyperparameters and selecting model architectures to optimize training efficiency. Finally, we'll examine potential paths to further streamline the training process of Large Language Models.