Setting HPC and Deep-learning Records in the Cloud with Azure
, Solutions Architect and Data Scientist, NVIDIA
, HPC/AI Benchmarking Team, Principal PM Manager, Azure Compute, Microsoft
Learn how Microsoft Azure, using NVIDIA A100-based virtual machine (VM) instances, has positioned itself as one of the top cloud providers for machine learning, deep learning, and high performance computing. In the past, Azure was able to show great scaling on AI workloads like BERT, as well as HPC workloads like HPL. We'll demonstrate how Azure was able to take this a step further and use the newly announced NDm_A100_v4 VM instances on Azure (powered by eight NVIDIA A100 80G GPUs and Infiniband HCAs) to create a competitive submission for MLPerf Training v1.1, as well as scaling HPL up to hundreds of nodes achieving over 30 petaflops of performance. A short demonstration will show how we used Azure’s CycleCloud services to provision a SLURM cluster in the cloud capable of near on-premises performance for workloads ranging from machine learning to deep learning to HPC, as well as highlighting some of the results from MLPerf Training v1.1 and the Top500 submission.