Multimodal large language models (MLLMs) are deep learning algorithms that can understand and generate various forms of content ranging across text, images, video, audio, and more.
MLLMs expand the capabilities of traditional large language models (LLMs), which are primarily focused on processing and generating text. By integrating multiple types of data, MLLMs enable more complex and versatile applications that require the synthesis and interpretation of both textual and nontextual information.
This means that MLLMs can interpret a variety of data domains, including:
The world is multimodal, and human interaction with digital content isn't limited to text.
MLLMs reflect this diversity with their ability to ingest, understand, and generate many data types, making AI interactions more natural and effective.
MLLMs are important, as they help AI tools bridge the gap between human interaction and technology. The capabilities to understand and interpret different modalities of data give rise to more impactful and fascinating applications in everyday life. For example, health care providers can leverage MLLMs to help evaluate a patient’s X-rays and medical files, then suggest personalized treatments or determine possible diagnoses for a patient.
The value of these models reaches far beyond health care. MLLMs can ingest data from nuanced documentation such as PDFs with a variety of data—including diagrams, charts, and images. This opens up use cases in everything ranging from education to enterprises, where employees can leverage a chatbot to improve their workflows and productivity.
Similar to LLMs, MLLMs apply self-attention mechanisms (which compute attention scores) that reflect how relevant different parts of the input data are to other parts. In an MLLM, self-attention empowers the model to see how characters in the text (one type of modality) relate to parts of an image (another modality). Because the self-attention mechanism doesn't capture sequence order, positional encodings are needed to understand the meaning of data in a sequence (such as temporal sequences in video data). Without this, the model would interpret the data in an unordered manner, potentially causing it to lose its meaning.
The MLLM training process can require extensive computational resources due to large neural networks, which have billions of parameters. To manage this degree of complexity, data and model parallelism empower the computational workload to be separated among different GPUs to create an efficient training process.
Because an MLLM can process multiple modalities, there needs to be a way for all these modalities to be combined. Encoders help make this integration possible.
For each modality, specific encoders are used to transform that type of input data (e.g., text, images, audio) into embeddings in a shared high-dimensional vector space. The embeddings for each modality are combined into a joint embedding space, to easily transform from one embedding to another.
An example of an encoder and decoder in the scenario of MLLMs are with the audio encoder and image decoder. Most simply, the encoder will capture contextual information and translate input data into embeddings, while the decoder takes those embedded representations and generates them in the target modality. An audio encoder will take a voice recording and translate that into a series of feature vectors embedded in the vector space representing the voice recording. The image decoder would take the specific image embeddings from the joint embedding space and generate the desired output.
LLMs trained on large quantities of textual data powered the initial wave of generative AI tools. This allowed LLMs to effectively generate articles, emails, and code snippets because of their ability to map responses they had seen before in novel queries.
While LLMs serve as the brain—providing language and context understanding—MLLMs go beyond the status quo of typical LLMs as they power the generation of a broad range of data modalities.
The shift from traditional LLMs to their more advanced multimodal counterparts involves changes not only in capabilities but also in underlying architecture, applications, and training and fine-tuning approaches.
Large Language Models | Multimodal Large Language Models | |
Data Process | Text-only data encoded by text tokenizers | Multimodal data that requires separate encoders for each modality |
Model Architecture | A single-transformer architecture | Separate encoders for each modality, followed by a fusion module that projects the encoded representations into a unified embedding space |
Training Objective | Language modeling objectives like next-token prediction | Often uses contrastive learning objectives that aim to align the representations of different modalities |
Inference Computation Complexity | Quadratic to the input sequence length | The complexity of text-only LLMs with the added computation from encoding inputs and decoding outputs from multiple modalities simultaneously |
Modality Encoders | Only text data gets processed, so no need for modality encoders | Convert images, audio, and other nontext data into embeddings that reflect the content’s meaning |
Input Projector | Typically process textual embeddings directly derived from text data alone, without the need for alignment from other modalities | Aligns the encoded representations from various modalities with text data to create a unified input that the language model can process |
LLM Backbone | Processes textual data | Processes aligned multimodal inputs using pretrained knowledge to perform tasks such as reasoning, comprehension, and content generation |
Output Projector | No need to convert embeddings back to other modalities | Maps the model’s output embeddings back to the target modality for generating nontextual outputs |
Modality Generator | Models don’t have modality generators because they’re not designed to handle nontextual outputs | Produces outputs in individual modalities, typically using latent diffusion models (LDMs) |
LLMs are typically built on a single-transformer architecture, optimized for processing sequential data and managing long-range text dependencies. This architecture enables LLMs to proficiently understand and generate language. In contrast, MLLMs employ a more complex design that includes separate encoders for each modality, such as transformers for text and convolutional neural networks (CNNs) for images. These separate encoders capture and encode information specific to each modality. Following this, a fusion module integrates these encoded representations into a unified embedding space. This architecture allows MLLMs to seamlessly incorporate features from diverse data types, facilitating a holistic understanding of multimodal input data.
LLMs and MLLMs have distinct training processes. As a traditional text-based model, LLMs are trained using extensive datasets consisting of books, articles, and webpages. The goal is to teach the model to predict the next word in a sequence, enabling it to generate coherent and contextually relevant text. This process begins with data collection and preprocessing to clean and prepare the data for training. Transformers are the preferred architecture for LLMs due to their effectiveness in handling sequential data and long-range dependencies.
On the other hand, MLLMs, such as GPT-4V, are designed to learn from multiple data types, like images and text. This training is more complex, as it involves linking different modalities, such as associating the image of a dog with the word “dog” or generating descriptive text for an economic chart. The training process for MLLMs integrates techniques like CNNs for image processing with transformers for text, ensuring the model can handle and integrate features from both modalities effectively.
Because of their architectural differences, the computational demands of training LLMs versus MLLMs also vary. LLMs require substantial GPU resources to manage the large-scale parallel processing needed for handling billions of parameters and large datasets. The self-attention mechanism in transformers, with its quadratic complexity, further increases these requirements. MLLMs, however, have even higher computational demands. Besides the challenges associated with transformers, they need additional processing power for CNNs used in image handling. The integration of different modalities often involves cross-attention techniques, adding to the computational load. Therefore, training MLLMs typically calls for more sophisticated hardware configurations and innovations in model architecture to optimize efficiency.
Exploring the complexities of MLLMs reveals a range of challenges, from architectural intricacies to the nuances of data management and computational demands.
Some of the key challenges include:
With the opportunity to process various modalities through MLLMs, there are incredible applications across various industries.
Selecting an appropriate development framework is essential for working with MLLMs. It’s important to choose a framework that not only supports the specific modalities relevant to respective projects but also fits well with the existing technology stack and development practices.
NVIDIA NeMo™ is an end-to-end platform for developing custom generative AI. The NeMo framework provides a comprehensive library designed to facilitate the creation and fine-tuning of MLLMs across various data modalities.
The effectiveness of an MLLM heavily depends on the quality and alignment of the multimodal data it’s trained on. This involves collecting datasets that include aligned pairs or groups of different modalities, such as text-image pairs or video with captions. Proper preprocessing and normalization of this data is crucial to ensure that it can be effectively used to train the model.
Leveraging a pretrained model can significantly reduce the need for extensive computational resources and provide a shortcut to achieving effective results. Fine-tuning this model on a specific dataset allows it to adapt to the particular characteristics and requirements of an application, enhancing its performance and relevance.
Once the model is set up, it’s important to test it extensively with real-world data and scenarios. This testing phase is critical to understanding how well the model performs and identifying any areas where it may need further refinement. Continuous iteration based on performance feedback is key to developing a robust MLLM that reliably meets your objectives.
Deploying an MLLM involves integrating it into a suitable operational environment where it can receive inputs and generate outputs as required. Post-deployment, it’s important to monitor the model’s performance continuously and adjust its configuration as needed to maintain its effectiveness and efficiency.