In-Depth Guide to Visual Language Models

In-Depth Guide to Visual Language Models

Visual Language Models
❗ To learn how to use Visual Language Models for Medical applications, checkout our blog on Building Medical AI assistants with Visual LLMs

The evolution of Visual Large Language Models (VLLMs) began with the Transformer architecture introduced in the “Attention Is All You Need” paper by Vaswani et al., which revolutionized natural language processing by enabling efficient self-attention mechanisms. Building on this, OpenAI's CLIP model combined visual and textual data, demonstrating strong zero-shot learning capabilities. Models like VilBERT and VisualBERT extended this by integrating visual and linguistic inputs, improving multimodal interactions. Recent advancements include LLaVA and SigCLIP, which enhance visual grounding and fine-grained visual concept recognition. Techniques like fine-grained reward modeling in the ViGoR framework and benchmarks such as CODIS further enhance VLLMs' real-world applicability. This blog will explore these developments in detail, highlighting their impact and potential applications.

Foundational Concepts

Convolutional Neural Network(CNN)

Convolutional Neural Networks (CNNs) have revolutionized image processing and computer vision tasks. They are specifically designed to process and interpret visual data by mimicking the way humans perceive images. A CNN consists of multiple layers, including convolutional layers, pooling layers, and fully connected layers. The convolutional layers apply filters to the input image to create feature maps that highlight various aspects of the image, such as edges, textures, and shapes. Pooling layers then reduce the spatial dimensions of these feature maps, retaining the most important information while reducing computational complexity. Fully connected layers at the end of the network combine these features to make predictions.

Vision Transformers (ViTs)

Vision Transformers are a recent innovation in image processing that adapt the transformer architecture, originally designed for natural language processing, to handle visual data. Unlike CNNs, which process images through local receptive fields, ViTs divide an image into fixed-size patches and treat each patch as a token, similar to words in a sentence. These tokens are then processed by a standard transformer encoder, which captures global context through self-attention mechanisms. ViTs have shown state-of-the-art performance on various image classification tasks, particularly when trained on large datasets.

Advantage of using Vision Transformer (ViT’s)  over  CNN

Choosing Vision Transformer (ViT) over Convolutional Neural Networks (CNNs) for image processing tasks depends on the specific requirements and characteristics of the application. ViT is advantageous due to its ability to capture long-range dependencies and contextual information across the entire image through self-attention mechanisms, which are often more challenging for CNNs that rely on local receptive fields. This capability allows ViT to excel in tasks requiring a holistic understanding of the image, such as image classification, object detection, and segmentation. Additionally, ViT has shown impressive performance on large datasets and benefits significantly from extensive pre-training on vast amounts of data.

However, CNNs remain highly efficient for many real-time applications due to their optimized convolution operations, which are well-suited for current hardware accelerators like GPUs. CNNs' hierarchical feature extraction is particularly effective for tasks involving spatial hierarchies and local patterns, making them ideal for applications such as real-time video processing, edge computing, and mobile devices where computational efficiency and lower latency are critical.

Technical Concepts

Before we get into the heart of working we need to understand the technical terms and concepts in the Visual Large Language model architecture. Let’s understand why these layers or concepts are important and what is their role in the architecture.

 Multi-Modal Embeddings

Multi-modal embeddings involve representing data from different modalities (e.g., text and images) in a unified embedding space. This allows the model to understand and relate information from multiple modalities simultaneously, leading to more accurate and efficient information retrieval and analysis across various domains. An implementation of this concept is the ImageBind model, which maps diverse data types like text, images, audio, and even sensor data into a single embedding space. This unified approach enhances the model's ability to perform tasks that require cross-modal understanding and integration, making it particularly powerful for applications such as cross-modal search, multimodal content generation, and comprehensive data analysis.

Techniques for Aligning Visual and Textual Data

Joint Embedding Space

Techniques like CLIP (Contrastive Language-Image Pre-training) project both visual and textual data into a common embedding space. This is achieved by training the model on pairs of images and corresponding text descriptions. Joint embedding space is a more efficient and flexible option compared to a cross-modal transformer for several reasons. First, it provides a unified representation where both visual and textual data are projected into a common space, enabling straightforward comparison and retrieval tasks. This approach simplifies the architecture and reduces computational complexity since it avoids the need for separate, intricate attention mechanisms for each modality as required in cross-modal transformers. Additionally, joint embedding spaces are highly effective in scenarios like image-text matching, where the goal is to find correspondences between different types of data. They facilitate quick and efficient retrieval of relevant information by leveraging learned associations in a shared latent space. This can lead to faster inference times and lower resource consumption, making joint embedding spaces more suitable for real-time applications and large-scale deployments.

Cross-Modal Transformers

Models such as ViLBERT and LXMERT use separate encoders for each modality and align the representations using transformer layers that allow for cross-modal interactions. The model consists of two parallel streams for visual (green) and linguistic (purple) processing that interact through novel co-attentional transformer layers. This structure allows for variable depths for each modality and enables sparse interaction through co-attention. Dashed boxes with multiplier subscripts denote repeated blocks of layers.

Why embedding space is considered important 

Embedding spaces enable VLLMs to perform zero-shot learning  where the model recognizes and categorizes new, unseen data based on its relationship to known data. It also improves cross-modal retrieval  and facilitates fine-grained understanding where retrieving  relevant information across different modalities, enhancing applications like image captioning and visual question answering and enables nuanced and detailed understanding of visual concepts through fine-grained embeddings.

Attention Mechanisms in VLLMs


Allows the model to weigh the importance of different parts of a single input (e.g., different words in a sentence or different regions in an image) relative to each other. This mechanism helps in capturing long-range dependencies and contextual relationships within the same modality. In transformer-based models like BERT and Vision Transformers (ViT), self-attention helps in understanding the contextual relevance of different tokens (words or image patches).

The self-attention mechanism in Transformers allows the model to dynamically determine the relevance of each word in a sentence by assigning attention scores, which are learned during training. Each word in the input sequence is first converted into an embedding and combined with positional encoding to incorporate information about the word's position. The self-attention mechanism then computes three vectors for each word: Query, Key, and Value. The attention score between each pair of words is calculated using the dot product of their Query and Key vectors, normalized via a softmax function to produce attention weights.

These weights dictate the focus each word should have on other words in the sequence. Multiple attention heads are used to capture different aspects of the relationships between words. The refined representations produced by these heads are then passed through several layers, allowing the model to understand complex word relationships and contexts better. For example, in the sentence "The animal didn't cross the street because it was too tired," the self-attention mechanism would likely assign higher attention scores between "it" and "animal" rather than "it" and "street," based on their contextual relevance. This iterative refinement through multiple layers enables the model to perform tasks like coreference resolution effectively, focusing more on contextually relevant words to minimize the loss and produce accurate outputs.


Enables the model to align and integrate information from two different modalities. It helps in establishing correspondences between visual features and textual descriptions. In models like ViLBERT, cross-attention layers allow for the fusion of visual and textual information, enabling the model to generate coherent and contextually relevant outputs.

In cross-attention, the Query (Q) vectors are derived from one modality, such as text, while the Key (K) and Value (V) vectors come from another modality, such as images. For the sentence "The animal didn't cross the street because it was too tired," the word "it" would generate a Query vector from its textual embedding. The image associated with the sentence is divided into regions, each represented by feature vectors serving as the Key and Value vectors. The cross-attention mechanism calculates attention scores by taking the dot product of the Query vector (text) and the Key vectors (image regions), which is then scaled and passed through a softmax function to produce normalized attention weights. 

These weights indicate the relevance of each image region to the word "it." The Value vectors are then weighted by these attention scores and summed, producing a context vector that integrates relevant visual features into the textual context. This process helps the model correctly associate "it" with "animal" rather than "street" by focusing on the visual regions related to the animal. The output from the cross-attention block is a combined representation that enhances the model's understanding of the sentence by incorporating both visual and textual information, allowing for better performance on tasks like visual question answering and image captioning.

Contrastive Learning

Contrastive learning is a self-supervised learning technique that focuses on learning representations by distinguishing between similar (positive) and dissimilar (negative) pairs. This method is particularly effective in scenarios where it is crucial to differentiate between closely related data points. In models like CLIP (Contrastive Language-Image Pre-training) developed by OpenAI, contrastive learning is applied to large datasets of image-text pairs. CLIP uses separate encoders for images and text, projecting both into a shared embedding space. The training objective, driven by a contrastive loss function, aims to maximize the similarity between the embeddings of positive pairs (correct image-text pairs) while minimizing the similarity between negative pairs (incorrect image-text pairs). This approach enables the model to perform well on various tasks without task-specific fine-tuning, demonstrating strong zero-shot learning capabilities in the image below compared to other state of the art deep learning models in different datasets.

The benefits of contrastive learning in VLLMs include the development of robust and generalizable representations, the ability to handle zero-shot learning scenarios, and scalability with large datasets. However, it also presents challenges such as the need for efficient negative sampling, high computational costs, and the reliance on high-quality data. The implementation of contrastive learning in CLIP, as detailed in the paper "Learning Transferable Visual Models From Natural Language Supervision" by Alec Radford et al., highlights these strengths and challenges, showcasing the potential and limitations of this powerful technique. For more detailed insights, refer to the original CLIP paper.

How does a Visual Large Language Model Work


The Backbone in a Visual Large Language Model (VLLM) is the fundamental neural network tasked with extracting essential features from input images. Typically, this backbone is a sophisticated convolutional neural network (CNN) like ResNet or a Vision Transformer (ViT). These networks process the image, converting it into a high-dimensional tensor of visual features (Fv), which encapsulates critical spatial and semantic details. These extracted features form the basis for subsequent stages, facilitating the model's ability to interpret and manipulate the visual data effectively for diverse tasks.

Language-Guided Image Tokenizer

The Language-Guided Image Tokenizer  is crucial for integrating visual and textual information within the VLLM framework. This component operates by initially receiving visual features (Fv) from the Backbone and textual features ( Ft ) from a text encoder, often a Transformer model like BERT. Using a cross-attention mechanism, it aligns and combines these modalities, producing language-guided image tokens ( T ). These tokens are enriched with both visual and contextual data, enabling the model to understand and respond accurately to the tasks specified by the accompanying language instructions.

Random Query

The Random Query component represents the VLLM's capability to handle a wide array of tasks flexibly. This feature allows the model to process various vision-only and vision-language tasks dynamically. By introducing randomness, the model can adapt to different inputs and instructions, showcasing its robustness and versatility in generating appropriate outputs. This adaptability is key to the model's performance across diverse applications, enabling it to handle novel and unexpected scenarios effectively.

Language Instructions (\<text\>)

Language Instructions are the natural language prompts that guide the VLLM on what specific tasks to perform. These instructions provide detailed descriptions of the tasks, such as "Describe the image <image> in detail" for vision-language tasks or "For each object in the image <image> that belongs to the class set <class>, output a tuple with the class label and coordinates" for vision-only tasks. The instructions are parsed into a machine-readable format, directing the model on how to interpret the visual data and generate the required outputs.

Open-Ended Task Decoder with LLM

The Open-Ended Task Decoder with LLM  is the component that interprets the language-guided image tokens ( T ) and generates the final output based on the provided instructions. This decoder utilizes the capabilities of large language models (LLMs) like GPT to process integrated tokens and leverage its extensive language understanding to produce meaningful results. Whether classifying tokens for object detection or generating sequences for tasks like image captioning, this decoder can adapt its outputs to the specified formats, ensuring flexibility and accuracy in addressing a variety of vision-centric tasks.

Desired Output

The Desired Output is the end result produced by the VLLM, tailored to the task defined by the language instructions. This output can take various forms depending on the task, such as class labels and bounding box coordinates for object detection, descriptive text for image captioning, or text-based answers for visual question answering. The ability to generate such a wide range of outputs demonstrates the VLLM's versatility and effectiveness in integrating and processing both visual and textual information to meet diverse application needs.

How to Train a Visual LLM

Meta released an amazing guide on training Visual Language Models.

Training a Vision-Language Large Model (VLLM) involves several crucial steps to ensure the model effectively associates textual descriptions with visual elements (grounding) while managing computational resources efficiently. Grounding can be enhanced by using bounding box annotations to teach the model where objects are located in the images, employing contrastive learning techniques with negative captioning to distinguish between correct and incorrect text-image pairs, and ensuring high-quality, diverse datasets. Optimizing data quality by pruning low-quality or duplicate entries and improving caption quality with synthetic data generation techniques are also essential steps.

Managing GPU resources is critical due to the significant computational requirements for training VLLMs. High-quality datasets reduce the need for extensive compute power, and efficient training techniques like masking and optimized data loading can speed up the process. Leveraging pre-trained models for fine-tuning instead of training from scratch can also help manage costs. Libraries like torch.compile and xformers optimize attention mechanisms, while Fast Forward Computer Vision (FFCV) helps in creating faster-loading data files.

Important considerations to keep in mind when training VLMs. Data is one of the most important aspects of training VLMs. Having a diverse and balanced dataset is important for learning good world models that can span enough concepts. It is also important to remove duplicates which occur a lot within large-scale datasets, this will save a lot of compute time and mitigate the risks of memorization. In addition, pruning the data is also an important component since we want to be sure that the captions are indeed related to the image content. Lastly, improving the caption quality is crucial to enhance VLMs performance. Grounding VLMs is another important step to ensure that the VLMs correctly associate words with specific concepts. Two common grounding methods leverage either bounding boxes or negative captions. Lastly, alignment is a much-needed step to ensure that the model is producing answers that are expected from a human point of view.

How many GPU’s are required for Training

The compute resources required for training a VLLM significantly influence the budget needed for such projects. Models like CLIP and OpenCLIP have utilized more than 500 GPUs, which equates to costs in the hundreds of thousands of dollars—often inaccessible for most companies or academic labs. However, by using high-quality datasets and leveraging efficient techniques like masking strategies, training a contrastive model like CLIP on hundreds of millions of images from scratch can be done with as few as 64 GPUs, costing around $10K in compute. If the VLM leverages existing pre-trained image or text encoders, or LLMs, the cost of learning a mapping should be much lower.

Steps for Training VLLM

Data Preparation

Collect and preprocess a diverse set of image-text pairs, ensuring that the dataset is both extensive and varied to cover a wide range of concepts. This includes ensuring high-quality captions using synthetic data generation techniques if necessary. Removing duplicates and low-quality samples from the dataset is crucial to save computational resources and prevent the model from memorizing redundant information, which can degrade its performance and efficiency. Proper data curation and preparation form the foundation for successful VLLM training.

Model Architecture: 

Choose an appropriate model architecture based on the specific requirements of your task, whether it is contrastive, masking, or generative models. For effective grounding, consider models that leverage bounding boxes to explicitly indicate object locations or those that use negative samples to teach the model to distinguish between correct and incorrect text-image pairs. The choice of architecture should align with the end goals of the VLLM, such as image retrieval, caption generation, or both.

Training Process: 

Implement contrastive learning techniques to align text and image representations effectively. This involves training the model to push the representations of matching image-text pairs closer together while pushing non-matching pairs further apart. Additionally, implement masking strategies to improve training efficiency and model performance by randomly masking parts of the input data and training the model to predict the masked content. Fine-tune pre-trained models to reduce computational costs, leveraging existing knowledge to expedite the training process and achieve better initial performance.

Optimization Techniques: 

Apply efficient attention mechanisms and data loading optimizations to ensure that the training process is as fast and effective as possible. Utilize libraries like torch.compile and xformers, which offer significant speed improvements for model training. Regularly evaluate the model's performance and adjust hyperparameters as needed to ensure optimal results. Optimizing these aspects can greatly reduce the overall training time and computational costs while maintaining or improving model performance.

Fine-Tuning and Evaluation:

Fine-tune the model on specific downstream tasks to ensure it performs well in practical applications. This involves adjusting the model parameters based on specific task requirements, such as image classification, caption generation, or retrieval tasks. Evaluate the model using benchmarks like zero-shot and retrieval tasks to ensure it generalizes well across different scenarios. Regular performance evaluations help in identifying and addressing potential issues early, ensuring the model is robust and reliable for real-world applications.

Improving Grounding

Grounding in a Visual Large Language Model (VLLM) refers to the process of associating textual descriptions with specific visual elements within an image. This involves identifying and linking parts of the text to corresponding objects or regions in the visual data, enabling the model to understand and interpret images in the context of the provided language. Grounding helps the model make accurate predictions and generate relevant outputs by ensuring that the visual and textual components are meaningfully connected, enhancing tasks like object detection, image captioning, and visual question answering.

Grounding is a significant challenge in the VLM and generative model literature. It primarily addresses the issue of models not fully understanding text prompts, which can result in ignoring parts of the prompt or generating hallucinated content not present in the prompt. Challenges in grounding include understanding spatial relations (e.g., left or right of an object), handling negations, counting, and recognizing attributes like colors or textures. Although no single method can completely solve grounding issues, several techniques can improve grounding performance.You can also try out the model in this link.

Using Bounding Box Annotations:

Models like X-VLM leverage bounding box annotations, incorporating box regression and Intersection over Union (IoU) loss to accurately locate and align visual concepts with their corresponding textual descriptions. By knowing where objects are in images and the associated captions, the model can better associate text with visual clues, improving grounding. X-VLM is trained on datasets like COCO, Visual Genome, SBU, and Conceptual Captions, using up to 16 million images. This extensive training data with bounding box annotations enables X-VLM to excel in tasks like image-text retrieval, visual reasoning, visual grounding, and image captioning.

Negative Captioning:

Contrastive objectives use negative samples to mitigate collapse, enhance generalization, and improve discriminative feature learning. By contrasting positive pairs (similar or related samples) with negative pairs (dissimilar or unrelated samples), models develop a nuanced understanding of data, grasping underlying patterns that distinguish different classes or categories. Recent works have shown that using negative samples can mitigate various problems in VLMs. For instance, the ARO benchmark evaluates VLMs on their ability to correctly associate images with captions, using negative samples to test the model's understanding of incorrect pairings. This approach has shown that VLMs significantly benefit from the differentiation capabilities fostered by exposure to negative samples, leading to more accurate and contextually aware models. 

Recent Research in Visual LLM 

LLama 3-v

Llama 3-V leverages the SigLIP model to embed visual information, distinguishing itself by employing a pairwise sigmoid loss instead of a contrastive loss. Input images are transformed into patch embeddings, which are aligned with textual tokens via a projection block using self-attention mechanisms. This joint representation, combining visual and textual tokens, is processed through Llama3. Unlike models such as Llava that utilize a single linear layer for image embeddings, Llama 3-V's dual self-attention blocks capture intricate patterns in the data, enabling superior multimodal understanding and performance on various benchmarks. This architecture is particularly optimized for cost-effective training and inference, maintaining high performance with significantly lower computational resources.


VisualBERT processes image regions, extracted using an object detector, and treats these regions as visual tokens alongside text tokens. Both text and visual features are embedded into a shared space using token, segment, and position embeddings. These embeddings are then passed through multiple Transformer layers, allowing self-attention mechanisms to align elements of text with corresponding image regions. The model is pre-trained with masked language modeling, predicting masked words using visual context, and sentence-image prediction, determining if a text matches an image. This joint processing enables VisualBERT to capture rich interactions between visual and textual data, making it effective for tasks like Visual Question Answering and Visual Commonsense Reasoning. Unlike CLIP, which uses separate encoders for images and text aligned through contrastive learning, VisualBERT uses a single Transformer for richer interaction and grounding of visual and textual information.


The LLaVA (Large Language and Vision Assistant) model combines a pre-trained language model (Vicuna) with a visual encoder (CLIP's ViT-L/14). The visual encoder processes an input image to generate visual features, which are then projected into the language embedding space using a trainable linear layer. These visual tokens are combined with language instruction tokens and fed into the language model to generate responses. Unlike CLIP, which aligns visual and textual representations using contrastive learning, LLaVA directly integrates visual features into the language model's embedding space for end-to-end vision-language tasks.

Idefics 2

The Idefics 2 model integrates visual and language processing to generate contextually informed text responses. The architecture comprises three main components: a Vision Encoder, a Vision-Language Connector, and an LLM (Large Language Model) Decoder. The Vision Encoder processes input images to extract high-dimensional visual features. These features are then transformed by the Vision-Language Connector into language embedding tokens that align with the LLM’s word embedding space. This transformation allows the visual data to be seamlessly integrated with textual information. The LLM Decoder takes these integrated tokens, along with any language instructions, to generate coherent text responses.

This approach differs from models like CLIP, which uses separate encoders for images and text to align their representations in a shared embedding space primarily for retrieval tasks. In contrast, Idefics 2 focuses on generating language responses informed by visual data, leveraging a direct transformation layer to bridge the visual and textual modalities effectively. This enables tasks such as describing images, answering visual questions, and generating narratives based on visual inputs.

Idefics 2 is an 8 billion parameter vision-language model designed to excel in Optical Character Recognition (OCR) and document data extraction, such as reading bills and invoices. It outperforms larger models with its efficient architecture, supporting image resolutions up to 980 × 980 pixels. The model has been pre-trained on over 6TB of OCR data, enhancing its ability to accurately extract text from images. This makes it particularly effective for automating data entry and managing document workflows, thanks to its improved visual reasoning and document understanding capabilities. You can also try out the model in this link.

Benchmarking and Evaluation of Vision-Language Large Models (VLLMs)

Common Datasets

Standard benchmarks for evaluating Vision-Language Large Models (VLLMs) often utilize widely recognized datasets such as COCO (Common Objects in Context) and Visual Genome. 

COCO dataset includes over 200,000 labeled images with annotations for object detection, segmentation, and captioning. It is extensively used for evaluating image captioning, object detection, and segmentation tasks. The dataset is detailed in "Microsoft COCO: Common Objects in Context" by Lin et al. (2014) .

Visual Genome  dataset contains over 100,000 images with dense annotations of objects, attributes, and relationships, making it suitable for tasks requiring detailed scene understanding. "Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations" by Krishna et al. (2017) describes this dataset .

The main evaluation metrics are accuracy, f1 score, exact match.Accuracy measures the proportion of correct predictions made by the model and is commonly used in classification tasks. F1 Score is the harmonic mean of precision and recall, providing a balance between the two metrics, making it particularly useful for imbalanced datasets. Exact Match (EM) Score measures the percentage of predictions that match the ground truth exactly, often used in tasks like question answering and retrieval.

Recent Benchmarks

CODIS evaluates a model's ability to disambiguate images based on contextual information. This benchmark assesses how well models can understand and interpret images in context, rather than in isolation. "CODIS: A Benchmark for Context-Dependent Image Disambiguation" by Peng et al. (2022) provides a comprehensive overview of this benchmark .

Fine-Grained Visual Concept Recognition benchmark involves recognizing detailed and specific visual concepts within images, often requiring models to differentiate between subtle differences. It tests the model's ability to understand fine-grained details and nuances in visual data. Relevant research includes "Fine-Grained Recognition: A Survey" by Wei et al. (2019) .

By leveraging these benchmarks and evaluation metrics, researchers can systematically assess the performance of VLLMs, identify areas for improvement, and ensure that models are robust and effective across a variety of tasks and datasets.

Transform Your Visual Data with Custom VLLMs

If you're looking to leverage the power of Visual Large Language Models (VLLMs) for your business or research needs, can help you build a custom VLLM tailored to your specific requirements. Whether it's enhancing image classification, improving visual question answering, or integrating sophisticated visual and textual data analysis into your applications, our team of experts is here to assist you every step of the way. Contact today to learn how we can transform your vision into reality with cutting-edge VLLM technology. Contact us now!

Subscribe to stay informed

Subscribe to our newsletter to stay updated on all things AI!
Awesome, you subscribed!
Error! Please try again.