Building Medical AI assistants with Visual LLMs

Building Medical AI assistants with Visual LLMs

AI in Medical Industry

Medical image analysis has become an integral part of modern healthcare, enabling clinicians to make informed decisions and improve patient outcomes. However, accurately segmenting medical images presents several challenges, including the costly and time-consuming task of manually annotating datasets for training deep learning models. Additionally, while large language models (LLMs) have shown promise in various domains, they often lack the specialisation required for precise medical image analysis, leading to suboptimal performance in critical tasks like tumour detection.

Existing medical LLMs, such as MedPaLM and MediTron, have demonstrated promising results in various medical domains. However, these models often lack the specialised architecture required for precise medical image segmentation. The proposed solution, which combines the strengths of U-Net and visual foundation models, has the potential to outperform existing medical LLMs in tasks like tumour detection, where precision is paramount.

In this article, we'll show you how to build a personalised Med VLLM for your particular use case and dataset.

What are Visual Large Language Models (VLLMs)?

Visual Large Language Models (VLLMs) are a class of deep learning models that excel at processing and understanding visual data, such as images and videos. These models are trained on vast amounts of visual data and can extract high-level features and semantic information from images. VLLMs have shown promising results in various computer vision tasks, including image classification, object detection, and image segmentation. The approach of Visual LLMs are inspired from the CLIP (Contrastive Language-Image Pretraining) model developed by OpenAI in 2021.

In the field of medicine,Visual LLMs can show promising results in tasks like diagnostic imaging, surgical assistance. However, their effectiveness can be hindered by challenges in accurately interpreting medical images, especially without proper image segmentation. This limitation often results in decreased accuracy in critical tasks such as disease identification and surgical planning.Therefore, it led us to a solution from the U-Net Model.Let’s understand what U-net is and how well it can tackle the difficulties faced in Visual LLM approach in further below.

What is U-NET ?

The U-Net architecture is known for its effectiveness in semantic segmentation tasks in medical imaging analysis. In the context of tumour detection, the encoder component of U-Net acts as a feature extractor, capturing intricate details within the medical images that may indicate the presence of tumours. The decoder reconstructs the segmented regions based on these features, enabling precise delineation of tumour boundaries and shapes, thus aiding in accurate tumour detection and analysis.

The incorporation of skip connections in the U-Net architecture enhances image processing by facilitating the flow of high-resolution information between encoder and decoder layers, preserving fine-grained details and subtle features of tumours throughout the segmentation process.

Why use U-Net Model in Medical Imaging?

Recent studies have shown that integrating VLLMs and U-Net can lead to significant improvements in medical image segmentation accuracy. For example, the Mamba-UNet architecture, which combines the strengths of U-Net with the long-range dependency modelling capabilities of the Mamba architecture, has achieved state-of-the-art results on the ACDC MRI Cardiac segmentation dataset and the Synapse CT Abdomen segmentation dataset. This architecture has been recognized for its ability to capture intricate details and broader semantic contexts within medical images, making it highly suitable for our proposed project.

The table below provides the exact numerical values, further demonstrating the superior performance of  the U-Net architecture achieves higher accuracy compared to other models like DeepLabV3 across various metrics such as Dice Similarity Coefficient (DSC), F1-score, and Intersection over Union (IoU) from a research paper:

By leveraging the power of VLLMs and U-Net, our project aims to develop an efficient and accurate medical image segmentation framework that addresses the challenges and limitations of existing approaches. The integration of these cutting-edge technologies, combined with the innovative approach, has the potential to revolutionise the field of medical image analysis and significantly improve patient outcomes.

Using Visual Large Language Models (VLLMs) in Medical Imaging and Healthcare

Visual Large Language Models (VLLMs) have the potential to revolutionise medical imaging and healthcare by seamlessly integrating advanced language processing capabilities with visual data analysis. These cutting-edge AI models, such as BiomedCLIP and ChatGPT-4, Med-Palm 2 ,Med Flamingo have demonstrated remarkable performance in various medical imaging tasks, including diagnostic analysis, image segmentation, and report generation. Let’s see how VLLMs can help doctors and patients.

Generating Comprehensive Personalised Medical Reports with RadiologyData

One of the key advantages of VLLMs in medical imaging is their ability to process and understand multimodal data efficiently. These models can analyse a wide range of medical images, such as X-rays, CT scans, MRIs, and histopathological slides, and extract relevant information to aid in diagnosis and treatment planning which was significantly proved in the research paper. By leveraging the power of language models, VLLMs can also incorporate patient history, medication records, and other relevant textual data to provide a more comprehensive and personalised assessment.The below graph taken from the research paper shows a comparison of human and LLM.In this paper they have used Flamingo-80B  model where Flamingo-80B is quite less than the human accuracy where the approach of integrating  with U-Net Model  can eventually improve the existing metrics.

More Efficient and Faster Diagnostic Processes

VLLMs can significantly streamline diagnostic processes by enabling doctors to quickly analyse medical images and generate detailed reports. Using specific prompts, doctors can direct the model to focus on specific anatomical regions, structures, or tissues, allowing for targeted and efficient analysis. This interactive approach enhances the accuracy of image interpretation and reduces the time required for diagnosis, ultimately leading to faster treatment initiation and improved patient outcomes.

VLLMs can also assist doctors in differential diagnostics and creating a clinical plan for the patient. Because VLLMs can process a lot of medical history of a patient, they can be much better at informing the doctor of the current status of the patient and a bullet point summary of previous history of medications and health problems faced by the patient and then with a longer context understanding the doctor can proceed with next steps with suitable treatment.

Automating Medical Reporting and Integrating Patient History

One of the most promising applications of VLLMs in medical imaging is the generation of comprehensive medical reports. These models can analyse a set of medical images, such as X-rays, CT scans, or MRIs, and generate detailed reports summarising the observed abnormalities, their locations, and potential implications for diagnosis or treatment. By automating the report generation process, VLLMs can significantly reduce the workload of radiologists and other medical professionals, allowing them to focus on more critical tasks.To read more about the specific work please read the research paper.

VLLMs can also incorporate patient history and medication records to provide a more holistic assessment of the patient's health. By combining visual data from medical images with textual information from patient records, VLLMs can identify potential correlations, flag potential drug interactions, and provide personalised treatment recommendations. This integration of multimodal data can lead to more accurate diagnoses and more effective treatment plans.

How to build a Custom Medical Visual LLM


The integration of a chat interface,allowing medical practitioners to easily interact with the system using both text and images. By leveraging the multimodal capabilities of large language models (LLMs), the approach can effectively process and analyse both textual queries and medical images, such as MRI scans.

When a medical practitioner obtains an MRI image, they can simply upload it to the chat interface. The approach then proceeds with the necessary preprocessing steps, including normalisation, augmentation, and noise reduction, to ensure optimal performance of the U-Net architecture. The chat interface provides a user-friendly and accessible way for doctors to interact with the system, enabling them to input specific prompts or questions related to the uploaded MRI image. 

This interactive approach allows medical professionals to direct the model's focus to particular anatomical regions, structures, or tissues of interest, facilitating targeted and efficient analysis.

By combining the chat interface with the powerful capabilities of VLLMs and U-Net, The approach generates detailed medical reports that summarise the observed abnormalities, their locations, and potential implications for diagnosis or treatment. These reports are easily accessible through the chat interface, allowing doctors to quickly review the findings and make informed decisions about patient care.

U-Net Model

The U-Net architecture is particularly well-suited for the task of tumour segmentation in medical images. Its unique encoder-decoder structure and skip connections enable precise localization and accurate delineation of tumour regions.


Encoder: Capturing Tumour Context

The encoder component of U-Net acts as a feature extractor, processing the input image and capturing relevant contextual information about the tumour. It utilises a series of convolutional and pooling layers to extract hierarchical features at different scales. These features include tumour texture, shape, location, and surrounding anatomical structures.

As the encoder progresses through the layers, it gradually reduces the spatial dimensions of the feature maps while increasing the number of feature channels. This allows the model to capture broader contextual information about the tumour and its relationship with the surrounding tissues.

Decoder: Precise Tumour Localization

The decoder part of U-Net is responsible for precisely localising the tumour region within the image. It takes the encoded features from the encoder and progressively samples them to restore the original spatial dimensions. At each decoding stage, the decoder combines the upsampled features with the corresponding high-resolution features from the encoder via skip connections.

This combination of upsampled features and skip-connected features enables the decoder to precisely localise the tumour boundaries and reconstruct the segmented tumour region. The decoder's ability to precisely localise the tumour is crucial for accurate tumour volume measurement, delineation for targeted therapy, and classification based on tumour appearance and location.

Skip Connections: Preserving Tumour Details

Skip connections play a vital role in preserving fine-grained tumour details throughout the U-Net architecture. These connections directly link the encoder and decoder layers, allowing the decoder to access high-resolution spatial information from the corresponding encoder layers.

By retaining these details, skip connections ensure that critical tumour characteristics, such as irregular shapes, heterogeneous textures, and subtle boundaries, are not lost during the encoding and decoding process. This preservation of tumour details is essential for accurate tumour segmentation and subsequent analysis.

The combination of the encoder's contextual feature extraction, the decoder's precise localization capabilities, and the skip connections' preservation of tumour details makes the U-Net architecture a powerful tool for tumour segmentation in medical imaging. By leveraging these strengths, researchers have developed highly accurate and robust tumour segmentation models that significantly aid in diagnosis, treatment planning, and monitoring.

Visual Language Model (VLM)

In this multimodal model architecture, we are focusing on utilising the system for medical purposes, specifically for detecting tumours in MRI images.

The architecture begins with the vision encoder, which in this case is the CLIP ViT-L/336px model. This vision encoder processes the input MRI image, extracting high-level visual features. These features are then passed through the vision-language connector, a Multi-Layer Perceptron (MLP), which bridges the visual features with the language model.

The language model, represented by Vicuna v1.5 with 13 billion parameters, interprets these visual features in the context of medical knowledge. The tokenizer and embedding components convert the visual features into a numerical format that the language model can process effectively.

When an MRI image is input into the system, the vision encoder captures detailed information about the image, identifying potential areas of concern. The vision-language connector then translates these visual details into a form that the language model can understand and analyse.

Upon receiving a user query such as, "What is unusual about this MRI image?" The system utilises the combined capabilities of the vision encoder and language model to detect abnormalities, such as tumours. The language model analyses the encoded features and generates a detailed response, pinpointing the exact location and nature of the tumour within the MRI image. This integrated approach leverages the strengths of both visual and textual data processing, resulting in precise and efficient tumour detection in medical imaging.

U-Net compared toTraditional Methods in terms of Speed and Accuracy

Traditional medical approaches primarily focus on a person's disability, condition, and limitations, often overlooking their personal and psycho-social needs. These methods rely on symptom analysis and diagnostic tests to establish a diagnosis, aiming to address and treat underlying conditions. In contrast, traditional computer vision methods, such as those employing algorithms like SIFT, SURF, and BRIEF, rely on human-engineered feature extraction, which can be limited by the predefined features chosen for each image which would be really time consuming and time would be a barrier for a quick response to the medical situation.

The medical vision approach, particularly with the integration of U-Net, presents a prominent solution for medical image analysis. U-Net, with its encoder-decoder architecture and skip connections, excels in semantic segmentation tasks, allowing the model to capture intricate details within medical images. This U-Net-based methodology has demonstrated groundbreaking advancements in medical image analysis, offering improved performance indicators and structural characteristics.

Advancements and Potential of VLLMs in Medical Imaging

Research teams are actively exploring the use of LLMs in medical imaging. For instance, a study by Huang et al. found that LLMs can outperform specialised, task-specific models in certain contexts. Another study by Li et al. introduced a cross-modal clinical graph transformer for ophthalmic report generation, showcasing the potential of VLLMs in specialised medical domains.

As research progresses, more sophisticated and specialised VLLMs tailored to specific medical imaging tasks will likely emerge. Collaborating with medical professionals ensures these models meet the unique needs and challenges of healthcare.

In summary, Visual Large Language Models can significantly enhance medical imaging by streamlining diagnostic processes, generating comprehensive reports, and integrating patient history and medication records. Despite existing challenges, the improved accuracy, efficiency, and personalised care offered by VLLMs make them a promising tool for the future of healthcare.

Ready to Revolutionise Medical Imaging?

Experience the future of medical image analysis with our advanced vision solutions. Let us guide you through the process of building a personalised medical visual LLM tailored to your specific use case and dataset. Contact us  today to embark on your journey towards precise and efficient medical image analysis. We can help you transform the landscape of healthcare and improve patient outcomes.

Subscribe to stay informed

Subscribe to our newsletter to stay updated on all things AI!
Awesome, you subscribed!
Error! Please try again.