Perplexity harnesses the power of NVIDIA’s hardware and software to solve this challenge. Serving results faster than one can read, pplx-api can achieve up to 3.1X lower latency and up to 4.3X lower first-token latency relative to other deployment platforms. Perplexity was able to lower costs by 4X by simply switching their external inference-serving API references to call pplx-api, resulting in savings of $600,000 per year.
Perplexity achieves this by deploying their pplx-api solution on Amazon P4d instances. At the hardware level, the underlying NVIDIA A100 GPUs make for a cost-effective and reliable option for scaling out GPUs with incredible performance. Perplexity has also shown that by leveraging NVIDIA H100 GPUs and FP8 precision on Amazon P5 instances, they can cut their latency in half and boost throughput by 200 percent compared to NVIDIA A100 GPUs given the same configuration.
Optimizing the software that runs on the GPU helps further maximize performance. NVIDIA TensorRT-LLM, an open-source library that accelerates and optimizes LLM inference, facilitates these optimizations for implementations like FlashAttention and masked multi-head attention (MHA) for the context and generation phases of LLM model execution. It also provides a flexible layer of customization for key parameters such as batch size, quantization, and tensor parallelism. TensorRT-LLM is included as a part of NVIDIA AI Enterprise, which provides a production-grade, secure, end-to-end software platform for enterprises building and deploying accelerated AI software.
Finally, to tackle scalability, Perplexity uses AWS’s robust integration with Kubernetes to scale elastically beyond hundreds of GPUs and ultimately minimize downtime and network overhead.
NVIDIA’s full-stack AI inference approach plays a crucial role in meeting the stringent demands of real-time applications. From NVIDIA H100 and A100 GPUs to the optimizations of NVIDIA TensorRT-LLM, the underlying infrastructure powering Perplexity’s pplx-api unlocks both performance gains and cost savings for developers.
Explore more about Perplexity on AWS on Air, where they discuss their product in-depth.