Optimizing Parallelization and Overlap to Increase Training Efficiency using Megatron-Core
, Solution Architect, NVIDIA
, Foundation Model Training Director, Kuaishou Technology (pre-recorded)
We’ll first discuss how to analyze profiling results on thousands of GPUs. Next, we’ll show our optimization solutions on each parallelism based on the Megatron Core. We built a perf model to tune the SM resources to maximize overlapping communication and computation. Both Tensor parallel and data parallel can employ the perf model. For pipeline parallel, we summarized the regions could be overlapped. Then, we're exploring the optimization of efficient CUDA kernels, especially on Hopper. Finally, we're investigating dynamically adaptive parallelism and pipelining solutions for MoE.
We applied all the solutions to train our large language model with hundreds of billions of parameters. Compared with the baseline, the overlapped comm time increased 2.6x — the left comm have critical path that can’t be overlapped. End-to-end performance was improved by more than 25%. These analysis and optimization techniques can be widely applied to various models and training scales.