FP8 is a natural progression for accelerating deep learning (DL) training beyond the 16-bit formats common in modern processors. DL applications require two 8-bit floating point (FP8) binary interchange formats, both supported by Hopper and Ada GPU architectures: E4M3 and E5M2. These types enable doubling the math throughput as well as reducing bandwidth pressure in half; however, their use requires some care due to their narrower range and lower precision compared to the 16-bit formats. We'll cover three aspects of FP8 for deep learning:
FP8 format details, their support in hardware and software Numerical aspects of FP8 mixed-precision for neural network training and inference Review of FP8 training results over a wide range of neural network tasks and architectures
We'll conclude with some guidelines for training performance, and you'll leave with a solid grasp of FP8 DL training and inference fundamentals.