Understanding and Training IP Adapters for Diffusion Models

Understanding and Training IP Adapters for Diffusion Models

Mayank
Stable Diffusion

Text prompts alone are not very good at explaining visual concepts and styles. You can imagine trying to explain Michelangelo's art through words; it's simply not possible. We humans don’t think in terms of words when we imagine images. The same goes for the diffusion models.

Diffusion models can generate amazing images from text prompts alone, but sometimes that's not enough. Image inputs are needed to properly follow a very specific style, which cannot be properly conveyed in text. IP-Adapter allows diffusion models to accept images as a part of their prompts, which is not directly possible, as diffusion models are trained as text-to-image only.  

What is IP-Adapter 

IP-Adapter is a small, trainable add-on that lets a text-to-image diffusion model accept pictures as well as words. Normally, you type something like “a fluffy cat on a chair” and hope the model generates the exact stripes, fur texture, and velvet glow you had in mind. But the model will not do that because text alone often isn’t precise enough. The model might give you a different fur pattern, the wrong chair, or completely miss the mood you imagined. You can try improving the prompt, but it’ll be very difficult to get the output to match your imagination, if not impossible.

That's where the IP adapter helps. The adapter takes your reference image, extracts the key details using an image encoder, and feeds them directly into the diffusion process. It translates the visual language of images into a form the model already understands, so the generated result matches both your text and your picture.

So, unlike heavy fine-tuning or ControlNets, IP-Adapter is lightweight. You can plug it into existing diffusion models without breaking their text-to-image ability.

Other Ways to Use Images in Diffusion Models

There are obviously other ways to sneak image information into the pipeline. Others specialize in style transfer or identity preservation, which lets the model mimic a painting style or reproduce a specific person across scenes.

Various methods specialize in style transfer, structural control, or identity preservation, which allows the models to mimic painting styles, maintain specific compositions, or reproduce particular subjects across scenes.

  • ControlNet: It focuses on structural control (poses, edges, and depth) rather than style and content. Even though it's great for maintaining specific compositions, it doesn't capture artistic styles or textures as effectively.
  • Full Fine-tuning: It requires retraining the entire model on specific styles or concepts, which is computationally expensive and can overwrite existing knowledge.
  • LoRA: LoRA adapts models to learn specific subjects or styles, but you still need to craft detailed text descriptions, and it often stumbles when handling complex artistic styles or generalizing to new scenarios.

​​Primer on diffusion

To understand how the IP Adapter works, you first need to revisit how diffusion models generate images and where text and image information get plugged in. You can think of a diffusion model as having several key components where information flows through different processing stages to create the final image.

Diffusion models create images out of random chaos; they begin with freshly sampled Gaussian noise, then iteratively remove it under the guidance of a text prompt. With each step the network predicts a slightly cleaner image, nudging pixels toward the described scene. Here's how the core components interact:

  • CLIP Text Encoder: This converts your text prompt into mathematical representations called embeddings. Since CLIP was trained on millions of image-text pairs, these embeddings don’t just capture word meanings; they also carry an understanding of what those words should look like visually.
  • U-Net: It is the main component that generates the image. It starts with random noise in latent space (a compressed version of the image, for example, 64×64×4 instead of 512×512×3) and, at each step, predicts the noise in this latent while using the text embeddings as guidance.
  • Sampler: Takes the U-Net’s noise prediction and, using a numerical rule (for example, DDIM), subtracts part of the noise and updates the latent from 𝑧 𝑡 → z t−1 to gradually remove noise, step by step.
  • VAE Decoder: Once denoising is complete, you have a clean latent 𝑧₀. The VAE decoder converts this latent back into a full-resolution pixel image. It basically acts as an image translator.

How IP adapter works 

The key to IP-Adapter's functionality lies in understanding how cross-attention layers operate within the U-Net architecture. 

Cross-Attention in IP adapter

Cross attention allows one sequence or modality to focus on and extract information from another. It connects your text embeddings and the developing image. Unlike self-attention, where the queries, keys, and values all come from the same source, in cross-attention the queries (Q) are taken from one sequence (for example, a developing image’s latent features), while the keys (K) and values (V) come from another sequence, such as text embedding.

At each denoising step, this mechanism determines which words from your prompt should influence which regions of the image. However, this text-to-image attention works perfectly for text prompts but is designed only for text embeddings.

IP-Adapter extends this by introducing dual cross-attention branches: the original text cross-attention, where K and V come from text embeddings, and a new image cross-attention, where K and V come from image encoder features instead. This allows the model to simultaneously pay attention to both your text prompt and visual references, creating a more comprehensive understanding of what you want to generate.

IP-Adapter's Decoupled Attention

This decoupled approach creates two distinct but complementary pathways that work together to produce more precise and controllable image generation. The unreasonable effectiveness of decoupling comes from the fact that when you pass image features into text attention, you force the model to find a common representation space (solving the alignment fallacy) since text and image features carry very different types of information. Keeping these separate, the IP adapter lets each modality contribute its strengths without interference.

Pathway 1: Semantic Control (Original Text Cross-Attention)

The first cross-attention mechanism maintains Stable Diffusion's original strength in semantic control. This pathway tells the U-Net what to generate by handling object identity, recognition, scene composition, layout, and a kind of high-level conceptual understanding.

Pathway 2: Visual Style Control (Image Cross-Attention)

The second cross-attention mechanism introduces an entirely new visual control path that tells the U-Net how the output should look. This pathway focuses on visual style and aesthetic qualities, texture details and surface properties, and color schemes and artistic techniques. It also captures compositional elements like visual hierarchy and more complex artistic concepts such as the metamorphosis and transformation of familiar objects.

Training Process

The training process of IP adapters is straightforward. Rather than updating the entire model architecture, the training process employs a selective parameter strategy that maximizes efficiency while preserving the foundational capabilities of the pre-trained diffusion model.

Selective Parameter Training Architecture

The training process strategically freezes the majority of model parameters, specifically the pre-trained U-Net diffusion, CLIP text encoder, and CLIP image encoder, while exclusively training the newly introduced components. This includes approximately 22 million parameters,  which includes the image cross-attention layers, linear projection modules that map image embeddings to the appropriate dimensional space, and associated LayerNorm layers for training stability. This approach makes sure only 3-5% of the total model parameters undergo training, which dramatically reduces computational requirements.

Training Pipeline

The training pipeline begins with data preparation, where reference images are processed through the frozen CLIP image encoder to generate image embeddings, while corresponding captions are encoded via the CLIP text encoder. During each training iteration, Gaussian noise is added to target images at randomly sampled timesteps, creating the noisy latents that serve as input to the U-Net. 

The forward pass incorporates both conditioning modalities through parallel cross-attention mechanisms: text cross-attention operates as Attention(Q=latent_features, K=text_embeddings, V=text_embeddings), while the new image cross-attention functions as Attention(Q=latent_features, K=image_embeddings, V=image_embeddings). 

The optimization process minimizes L2 loss between the model's noise predictions and ground truth noise, with gradients updating only the trainable adapter parameters while leaving the frozen base model untouched.

Training your own IP adapter

We prepare a dataset of images (and captions) that represent the kinds of visual content and styles you want the adapter to handle. Then we fine-tune using the standard diffusion noise-prediction objective (L2 on predicted noise).

This lightweight approach lets you specialize in domains like architectural photography or specific artistic styles without computational overhead or risk of degrading the base model's capabilities.

You can get started by trying out these code snippets below for training an ip adapter

git clone https://github.com/tencent-ailab/IP-Adapter

Then download accelerate using

pip install accelerate

You also need to make your own dataset into a json file.

Here’s a helper script to produce a data.json with default captions as filenames.

import os
import json

# paths
image_folder = "dataset/images"
output_json = "dataset/data.json"

# create a list for entries
data = []

# loop through images
for img_name in os.listdir(image_folder):
    if img_name.lower().endswith((".jpg", ".png", ".jpeg")):
        # default caption = filename (you can edit later)
        caption = os.path.splitext(img_name)[0]
        
        data.append({
            "file_name": img_name,
            "text": caption
        })

# save as json
with open(output_json, "w") as f:
    json.dump(data, f, indent=2)

print(f"Saved {len(data)} entries to {output_json}")

Then set image_encoder_path path to a pretrained image encoder (e.g., CLIP-ViT) and 

And image_path folder to where the images in your dataset are stored.

Then run this script below, it trains an IP adapter that connects an image encoder (like CLIP) with stable diffusion v1.5 so the diffusion model can be conditioned on both text prompts and reference images.

accelerate launch --num_processes 8 --multi_gpu --mixed_precision "fp16" \
  tutorial_train.py \
  --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5/" \
  --image_encoder_path="{image_encoder_path}" \
  --data_json_file="{data.json}" \
  --data_root_path="{image_path}" \
  --mixed_precision="fp16" \
  --resolution=512 \
  --train_batch_size=8 \
  --dataloader_num_workers=4 \
  --learning_rate=1e-04 \
  --weight_decay=0.01 \
  --output_dir="{output_dir}" \
  --save_steps=10000

You can get started by trying out these code snippets below for testing out these pretrained IP adapters models in Google Colab for inference.

Here are the codes below

import torch
from diffusers import AutoPipelineForImage2Image
from PIL import Image


pipeline = AutoPipelineForImage2Image.from_pretrained(
 "stabilityai/stable-diffusion-xl-base-1.0",
 torch_dtype=torch.float16
).to("cuda")


pipeline.load_ip_adapter("h94/IP-Adapter",
                        subfolder="sdxl_models",
                        weight_name="ip-adapter_sdxl.bin")

image = Image.open("image.png").convert("RGB")


image_embeds = pipeline.prepare_ip_adapter_image_embeds(
   ip_adapter_image=image,
   ip_adapter_image_embeds=None,
   device="cuda",
   num_images_per_prompt=1,
   do_classifier_free_guidance=True,
)


torch.save(image_embeds, "image_embeds.ipadpt")

Inference code block:

pipeline.set_ip_adapter_scale(0.8)
image_embeds = torch.load("image_embeds.ipadpt")


pipeline(
   prompt="a girl drinking milkshake near volcano attempting to fly",
   image=image,
   ip_adapter_image_embeds=image_embeds,
   negative_prompt="deformed, ugly, low res, bad anatomy, worst quality",
   num_inference_steps=100,
   generator=torch.Generator(device="cuda").manual_seed(42),
).images[0]

IP-Adapter vs. Other Methods

IP-Adapter is usually orders of magnitude cheaper/faster to get working than full fine-tuning and somewhat cheaper than training a ControlNet while giving better style transfer than naive adapters and far better composability than full fine-tuning.

Full Fine-Tuning: This means retraining the entire diffusion model on a specific dataset. It needs cloud GPU and tons of data, and the model loses its text-prompt ability because of facing issues like catastrophic forgetting, where generalization erodes as the model forgets prior knowledge, and after fine-tuning we can’t combine it with other tricks (like ControlNet) without retraining. Some research also showed that these models often “forget” general, broad features across the reverse process when fine-tuned on a specific domain, especially in earlier (closer to raw noise) steps. It’s also called the chain of forgetting, which harms the generalization.

ControlNet: It was originally designed for low-level structural control (edges, depth, pose, etc.). Issues with ControlNet come because it doesn’t inherently convey style or content from a reference image. Strong ControlNet weights can also completely dominate the model, producing stiff or similar outputs that lack natural variation. There has been some research in combining IP-adapter with Depth ControlNet to preserve structure while restyling an image.

Challenges of using IP Adapter

Feature Mixing: Sometimes IP-Adapter picks up unintended elements from your reference image. If you want just the lighting style from a portrait, you might accidentally get the person's clothing or background elements too.

Complex Compositions: When your reference image has multiple distinct elements, IP-Adapter might struggle to understand which aspects you actually want to transfer to the new generation.

Style-Content Entanglement: IP-Adapter sometimes struggles to separate artistic style from subject matter, making it challenging to extract pure aesthetic qualities without copying content elements.

Limited Semantic Control: Unlike text prompts, you can't precisely specify which aspects of an image to emphasize or ignore.

Alternative and Specialized Methods

DreamBooth: When you need perfect identity preservation across multiple images, DreamBooth fine-tunes the entire model to learn specific subjects through custom tokens. More computationally expensive but offers precise control over character consistency.

Textual Inversion: Creates custom embedding tokens for specific concepts or styles, allowing precise text-based control over learned visual elements. Lightweight but limited to concepts that can be tokenized.

ControlNet + IP-Adapter Combination: Many practitioners combine both—ControlNet handles structural elements (pose, composition), while IP-Adapter manages aesthetic qualities (style, lighting, mood).

We talk about these methods in our Product Image Generation Blog.

Real-World Applications

It's used a lot in real-world use cases where preserving fine visual detail while staying compatible with large pre-trained text-to-image models matters.

1. Identity preservation: IP adapter variants like IP-Adapter-FaceID focus on creating highly realistic and consistent depictions of individuals across a wide array of contexts and styles.

2. Multi-Subject Style Transfer: The ICAS (IP-Adapter and ControlNet-based Attention Structure) framework is recent research towards creating more complex and coherent stylized scenes.

3. Video / identity-preserving video:- Researchers have adopted Image-conditioning ideas (IP-Adapter style) to video diffusion so a reference image is preserved across frames while motion is generated

4. Spatial/3D cues via depth & maps: Researchers also combine IP-Adapter conditioning with depth maps/multi-view cues so the adapter encodes explicit 3D layout, which improves 3D placement, occlusion handling, and realism.

CTA

Looking to train stable diffusion models on your images? Please reach out to Mercity to build stable diffusion based products. We have done a lot of research in diffusion based image generation and also around optimizing diffusion models.

Subscribe to stay informed

Subscribe to our newsletter to stay updated on all things AI!
Subscribe
Awesome, you subscribed!
Error! Please try again.