Achieving K8S and Public Cloud Operational Efficiency using a New Checkpoint/Restart Feature for GPUs
, Vice President Strategic Partnerships, MemVerge
, CTO, MemVerge
CUDA 12.x driver enhancements will enable the open-source CRIU project to checkpoint and restart a GPU-based compute node. We'll provide a technical overview and demonstrate this new capability. This transparent checkpoint/hot restart feature can, in turn, facilitate node maintenance, node rightsizing, and workload migration/bursting. Greater operational efficiencies can then be achieved while minimizing production interruptions.