Optimizing Inference Model Serving for Highest Performance at eBay
, Senior Engineering Manager, eBay
GPU is widely used across eBay to empower deep learning model training and inference. High-performance model inference is critical to business success in many cases. The GPU models need to serve high-volume site traffic with very restricted latency. To address these challenges, we built the unified inference platform (UIP) to facilitate domain model serving needs. UIP has two major design goals: (1) model development velocity and (2) maximize GPU utilization in order to meet customer requirements in latency or throughput. For the first goal, we build an automated continuous integration and continuous deployment (CI/CD) pipeline with Tekton and a model management system to manage various versions and data lineages. For the second performance goal, we adopted Triton as the major GPU model serving runtime. A lot of corresponding platform capabilities were built to leverage Triton to optimize GPU utilization, such as self-service model performance tuning and Triton-based online orchestration. Moreover, we optimized major models inside eBay with the DALI framework to accelerate preprocessing. Optimization significantly improved latency and throughput, saving a lot of GPU resources.