Llama3 LLM inference: Token-to-token latency (TTL) = 50 milliseconds (ms) real time, first token latency (FTL) = 2s, input sequence length = 2.048, output sequence length = 128 output, 8x NVIDIA HGX™ H100 air-cooled vs. GB200 NVL2 air-cooled single node, per-GPU performance comparison
Vector database search performance within RAG pipeline using memory shared by NVIDIA Grace CPU and Blackwell GPU. 1x x86, 1x H100 GPU, and 1x GPU from GB200 NVL2 node.
Data processing: A database join and aggregation workload with Snappy/Deflate compression derived from TPC-H Q4 query. Custom query implementations for x86, H100 single GU, and single GPU from GB200 NVL2 node: GB200 vs. Intel Xeon 8480+
Projected performance subject to change.