Building Streaming End-to-end Speech Recognition Service with Triton Inference Server
, Manager of Speech Team, LINE Corp.
, Senior Solution Architect, NVIDIA
LINE has developed a fast and accurate speech recognition service using a state-of-the-art streaming end-to-end (E2E) model. Our service is required to return recognition results with low latency for each speech request posted by many clients. However, it's hard to satisfy that requirement using the E2E model with huge amounts of model parameters. To solve this, we used NVIDIA Triton Inference Server, which is a framework for building a fast inference server of machine learning models using GPUs. We'll give an overview of NVIDIA Triton Inference Server and explain how we applied it to our problem.