To understand the importance of VLMs, it’s helpful to know how past computer vision (CV) models work. Traditional convolutional neural network (CNN)-based CV models are trained for a specific task on a bounded set of classes. For example:
- A classification model that identifies whether an image contains a cat or a dog
- An optical character detection and recognition CV model that reads text in an image but doesn’t interpret the format or any visual data within a document
Previous CV models were trained for a specific purpose and did not have the ability to go beyond the task or set of classes they were developed for and trained on. If the use case changed at all or required a new class to be added to the model, a developer would have to collect and label a large number of images and retrain the model. This is an expensive, time-consuming process. Additionally, CV models don't have any natural language understanding.
VLMs bring a new class of capabilities by combining the power of foundation models, like CLIP , and LLMs to have both vision and language capabilities. Out of the box, VLMs have strong zero-shot performance on a variety of vision tasks, like visual question-answering, classification, and optical character recognition. They are also extremely flexible and can be used not just on a fixed set of classes but for nearly any use case by simply changing a text prompt.
Using a VLM is very similar to interacting with an LLM. The user supplies text prompts that can be interleaved with images. The inputs are then used to generate text output. The input prompts are open-ended, allowing the user to instruct the VLM to answer questions, summarize, explain the content, or reason with the image. Users can chat back and forth with the VLM, with the ability to add images into the context of the conversation. VLMs can also be integrated into visual agents to autonomously perform vision tasks.