Scaling ML Workloads with PyTorch FSDP on Amazon EC2 (Presented by Amazon Web Services)
, Principal Solutions Architect for Compute and HPC, Amazon Web Services
, Software Engineer, Meta
, Software Engineer, Meta
How can users developing machine learning (ML) workloads start small and scale big on Amazon EC2? We'll review architectures and software stacks best suited to scale ML model training with PyTorch. AWS services like Amazon EC2, Amazon EKS, Elastic Fabric Adapter, and AWS ParallelCluster are key ingredients to a scalable ML architecture. Learn from success stories of customers who've scaled ML training across thousands of GPUs on EC2 and PyTorch.