Generative AI-Powered Visual AI Agents

Discover a collection of reference workflows that use Vision Language Models to deliver rich, interactive visual perception capabilities to a range of industries.

Workloads

Computer Vision / Video Analytics

Industries

Retail/ Consumer Packaged Goods
Manufacturing
Smart Cities/Spaces
Healthcare and Life Sciences

Business Goal

Return on Investment
Innovation

Products

NVIDIA Metropolis
NVIDIA AI Enterprise

Power A New Wave Of Applications

Traditional video analytics applications and their development workflows are typically built on fixed-function, limited models that are designed to detect and identify only a select set of predefined objects. With generative AI and foundation models, you can now build applications with fewer models that have incredibly complex and broad perception and rich contextual understanding. This newer generation of Vision Language Models (VLMs) is giving rise to smart, powerful visual AI agents.

What Is A Visual AI Agent?

A visual AI agent can combine both vision and language modalities to understand natural language prompts and perform visual question-answering. For example, answering a broad range of questions in natural language that can be applied against a recorded or live video stream. This deeper understanding of video content enables more accurate and meaningful interpretations, improving the functionality of video analytics applications and the interpretation of real-world scenarios. These agents promise to unlock entirely new industrial application possibilities.

Streamline Every Industrial Operation

Highly perceptive, accurate, and interactive visual AI agents will be deployed throughout our factories, warehouses, retail stores, airports, traffic intersections, and more. This will have a tremendous impact on operations teams looking to make better decisions using richer insights generated from  natural interactions. Managers and operations teams will communicate with these agents in natural language, all powered by generative AI and large Vision Language Models with NVIDIA NIM™ microservices at their core.

Develop With NVIDIA NIM

NVIDIA NIM is a set of inference microservices that includes industry-standard APIs, domain-specific code, optimized inference engines, and enterprise runtime. It delivers multiple VLMs for building your visual AI agent that can process live or archived images or videos to extract actionable insight using natural language. We’ve created a reference workflow of a visual AI agent that you can try out to accelerate your development process.

Use NVIDIA VIA Microservices With NIM

NVIDIA VIA microservices are cloud-native building blocks for accelerating the development of visual AI agents powered by VLMs and NIM, whether deployed at the edge or cloud. One example is a summarization microservice used to build visual AI agents that process large amounts of videos and produce curated summaries.

These microservices are available for download, with more on the way to help build new services.

Build Edge Agents with Jetson Platform Services

Developers can build visual AI agents powered by the NVIDIA Jetson™ edge AI platform using the new feature of NVIDIA JetPack™—Jetson Platform Services. The generative AI application is completely running on an NVIDIA Jetson Orin™ device that’s capable of detecting events to generate alerts and facilitate interactive Q&A sessions.

Build Visual AI Agents

Explore the reference workflow, powered by multiple Visual Language Models, to easily build your visual AI Agent.