The PyTorch distributed team share best practices for Large Scale Training on Google Cloud (Presented by Google Cloud)
, Software Engineering, Meta
, Customer Engineer - Machine Learning, Google Cloud
Distributed training at a large scale involves not only machine learning, but also system considerations to get the optimum performance on a given cluster. We'll discuss the best practices for PyTorch large language model distributed training using NVIDIA A100 GPUs on Google Cloud Platform. We'll present methods for improving NVIDIA Collective Communication Library performance, selection of (data and model parallel) distribution strategies based on model and batch sizes, and guidance on combining these with methods such as activation and parameter offloading.