Inference of Large Language Models with NVIDIA Triton Inference Server (Presented by CoreWeave)
, Product Marketing Manager, NVIDIA
, CTO, CoreWeave
Leveraging the right infrastructure for serving inference can provide faster spin-up times and responsive auto-scaling, which is critical to users’ satisfaction and the ultimate success of your model. By using NVIDIA Triton Inference Server with the FasterTransformer backend, you can see up to 40% faster GPT-J inference over an implementation based on vanilla Hugging Face Transformers. We'll walk through the benchmarks for these products tested on EleutherAI’s GPT-J and GPT-NeoX on CoreWeave Cloud and discuss how cloud computing can expand access to the GPUs and servers you need to serve inference more efficiently.