Merlin HugeCTR:使用 GPU 嵌入式缓存的分布式分层推理参数服务器 Merlin HugeCTR: Distributed Hierarchical Inference Parameter Server Using GPU Embedding Cache
, Senior GPU Computing Engineer, NVIDIA
, AI & Distributed Machine Learning Expert, NVIDIA
, Developer Technology Engineer , NVIDIA
In advertising/recommendation/search scenarios, the current mainstream algorithm adopts the model structure with embeddings and Deep Neural Network. Especially for doing CTR tasks, searching for embeddings is a highly parallel and memory-intensive step and is very suitable to run on GPUs. However, deep learning models in online advertising industries may have terabyte-scale parameters that do not fit in the GPU memory nor the CPU main memory on a computing node. In this talk we will cover two topics: - Introduce a Distributed Hierarchical GPU-based Inference Parameter Server, abbreviated as HugeCTR PS, for massive scale deep learning recommendation systems. We propose a hierarchical memory storage and model stream updating mechanism that utilizes GPU embedding cache, CPU memory and SSD as 3-layer hierarchical storage while all of the neural network training computations are processed on GPUs. - As the most important data structure of the HPS architecture, the GPU embedding cache is implemented on the NVIDIA GPU which is used to accelerate the embedding tables look-up process. The GPU embedding cache introduced in this talk offloads the majority workload of embedding table look-up to the GPU by utilizing the locality of embeddings that have been queried. This will lower down the latency of the CTR pipeline and improve the throughput.