Efficient Inference of Extremely Large Transformer Models
, Member of Technical Staff, Cohere
, Machine Learning Team Lead, Cohere
The rise of transformer-based language models has seen a boom in model size, since these models’ performance scales extremely well with size. It's become increasingly critical and challenging to develop solutions to make inference on these models more efficient. We'll show how these behemoth multi-billion-parameter models are optimized for production and how the inference tech stack is established. We'll cover the key ingredients in making these models faster, smaller, and more cost-effective, including model compression, efficient attention, and optimal model parallelism on GPUs through NVIDIA’s FasterTransformer and Triton ecosystem.