Inference
Deploy, run, and scale AI for any application on any platform.
Run inference on trained machine learning or deep learning models from any framework on any processor—GPU, CPU, or other—with NVIDIA Triton™ Inference Server. Part of the NVIDIA AI platform and available with NVIDIA AI Enterprise, Triton Inference Server is open-source software that standardizes AI model deployment and execution across every workload.
Deploy AI models on any major framework with Triton Inference Server—including TensorFlow, PyTorch, Python, ONNX, NVIDIA® TensorRT™, RAPIDS™ cuML, XGBoost, scikit-learn RandomForest, OpenVINO, custom C++, and more.
Maximize throughput and utilization with dynamic batching, concurrent execution, optimal configuration, and streaming audio and video. Triton Inference Server supports all NVIDIA GPUs, x86 and Arm CPUs, and AWS Inferentia.
Integrate Triton Inference Server into DevOps and MLOps solutions such as Kubernetes for scaling and Prometheus for monitoring. It can also be used in all major cloud and on-premises AI and MLOps platforms.
NVIDIA AI Enterprise, including NVIDIA Triton Inference Server, is a secure, production-ready AI software platform designed to accelerate time to value with support, security, and API stability.
Triton offers low latency and high throughput for large language model (LLM) inferencing. It supports TensorRT-LLM, an open-source library for defining, optimizing, and executing LLMs for inference in production.
Triton Model Ensembles allows you to execute AI workloads with multiple models, pipelines, and pre- and postprocessing steps. It allows execution of different parts of the ensemble on CPU or GPU, and supports multiple frameworks inside the ensemble.
PyTriton lets Python developers bring up Triton with a single line of code and use it to serve models, simple processing functions, or entire inference pipelines to accelerate prototyping and testing.
Model Analyzer reduces the time needed to find the optimal model deployment configuration - such as batch size, precision, and concurrent execution instances. It helps select the optimal configuration to meet application latency, throughput, and memory requirements.
Use the right tools to deploy, run, and scale AI for any application on any platform.
For individuals looking to access Triton’s open-source code and containers for development, there are two options to get started for free:
For enterprises looking to try Triton before purchasing NVIDIA AI Enterprise for production, there are two options to get started for free:
Use the right tools to deploy, run, and scale AI for any application on any platform, or explore more development resources.
Talk to an NVIDIA product specialist about moving from pilot to production with the security, API stability, and support of NVIDIA AI Enterprise.
Sign up for the latest news, updates, and more from NVIDIA.