A Developer’s Guide to Improving GPU Utilization and Reducing Deep Learning Costs (Presented by Amazon Web Services, Inc.)
, Amazon Web Services
Improving GPU and system resource utilization can dramatically improve deep learning training performance and reduce training costs. However, data scientists find it challenging to monitor system resource utilization, identify bottenecks, and correlate it to their training script and models. In many cases, data scientists may not even be aware that their resource utilization is suboptimal due to overprovisioning of resources or due to bottlenecks, and this can substantially drive up training costs. I'll show how you can use Amazon SageMaker Debugger to collect metrics when using frameworks such as TensorFlow and PyTorch without any code changes, and monitor CPU, GPU, network, and memory utilization in real time during training. You'll get all the information you need to identify bottlenecks, maximize resource utilization, improve training performance, and reduce costs.