FlashAttention: Fast and Memory-Efficient Exact Attention With IO-Awareness
, Chief Scientist, Together.AI
Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. A missing principle is making attention algorithms IO-aware — accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory and GPU on-chip SRAM. FlashAttention trains transformers faster than existing baselines, with 2-4x speedup on the attention kernel. FlashAttention enables longer context in transformers (4-16x longer than previous), yielding higher-quality models. We'll also describe recent improvements of FlashAttention: making use of new hardware features on A100 and H100 GPUs (another 2x speedup), as well as optimizations for long-context LLM inference (2-4x faster end-to-end inference time).
이벤트:
날짜:
레벨:
토픽:
업계:
NVIDIA technology: Cloud / Data Center GPU,CUDA,Hopper