How to use Stable Diffusion to generate product images

How to use Stable Diffusion to generate product images

Stable Diffusion

Diffusion models have revolutionized generative AI, enabling diverse content creation from textual inputs. Among them, Stable Diffusion emerged as a breakthrough. It enhances the quality and efficiency of content generation. It finds use in various domains, from art and design to marketing. In this article, we will explore stable diffusion and how you can use it to create product images to enhance your business.

What is Stable diffusion?

Stable Diffusion by StabilityAI is a text-to-image deep-learning model. It is open source under the Apache License 2.0. Its 3.0 models are still under development. Stable Diffusion was trained on the LAION-5B dataset derived from CommonCrawl. It is based on a latent diffusion process. It starts with an image with noise and then gradually adds detail to the image until it becomes a realistic image. It runs on any hardware with a GPU having at least 8 GB VRAM compared to only cloud-accessible models such as DALL-E. 

Recently Stability AI released Stable Diffusion XL, the best open-source diffusion model so far. SDXL generates stunning images at a resolution of 1024 x 1024 and also uses a refiner model to increase quality and control over generated images.

How does Stable Diffusion Work?

Stable diffusion has two core parts, the Diffusion Process and the Reverse Diffusion Process. The diffusion process adds noise to images and generates slightly noisy images with every step. The reverse diffusion process then reverses this process by predicting the amount of noise required to subtract from the noisy image to get the original image. This process is then repeated several times until a coherent image is obtained.

Let’s dive deeper into both of the concepts:

Forward Diffusion

As mentioned above, forward diffusion is the process of adding noise to an image. This happens in steps, so a number of steps are predetermined, and then a distribution of strength of noise is calculated for those steps. The higher the step, the more noise is added. This is an iterative process, but the authors of the LDM paper show this can be done in one go. Given a step and the image, you can skip the previous iterations of adding noise.

The authors also note that instead of adding noise directly to the image, we can encode the image and do the same operation in a latent space instead and then decode the image later on. This HUGELY reduces the computational power required to train and use diffusion models.

This process is used to generate training samples for the reverse diffusion process.

Reverse Diffusion

Once we have the training samples from the forward diffusion process. We can start training the generative model to generate images. The generative model generates images by removing noise from a completely noisy image. This happens in an iterative fashion. The generative model predicts the noise that has to be subtracted from the noisy image to get an actual image. Then a fraction of this noise is subtracted from the noisy image. The resultant image is then again fed to the network. This process is repeated for a set amount of steps.

Alternatives to Stable Diffusion


DALL-E is a text-to-image model by OpenAI. It has two model versions, DALL-E and DALL-E 2. DALL-E 2 is an encoder-decoder architecture. The text encoder takes text as input and generates text embeddings. They are passed to a prior model which is a diffusion model. It generates the corresponding CLIP image embeddings. The embeddings are passed to an image decoder which then generates actual images from the embeddings.

Figure From Medium

The embeddings come from CLIP (Contrastive Language-Image Pretraining). CLIP is trained on millions of images and their captions, to understand the relation between text and images. The model is designed to test how well a given caption matches an image, rather than predicting captions based on images. CLIP generates text and image encodings of each image-caption pair. It then calculates the cosine similarity of each of these embeddings. It minimizes the similarity between incorrect pairs and maximizes that of correct pairs. It then freezes and DALL-E 2 moves to the next task. CLIP guides the prior that takes text embeddings and turns them into image embeddings.

Lastly, a decoder generates an image. It uses a modified diffusion model called GLIDE (Guided Language to Image Diffusion for Generate and Editing). Glide improves diffusion models by adding text inputs. It creates images based on both text and diffusion methods. This model is adapted to decode the CLIP image embeddings into coherent images, maintaining the essence of the original prompts. A diffusion model, employed by GLIDE, ensures the creation of photorealistic images.

DALL-E still faces some limitations. For example, it might struggle to create images where text and visuals align coherently. It also faces challenges in linking attributes to objects or generating complex scenes. Additionally, it may inherit biases from the data it is trained on. However, it is still a powerful generative AI model that can create realistic and creative images from text descriptions. 


Midjourney is a self-funded and independent generative AI program. It can generate high-quality images from text descriptions. It is hosted by an independent research lab, Midjourney, Inc., and operates entirely on Discord or third-party servers. To use Midjourney, you need to have a Discord account. You do not need any specialized hardware or software, and you do not need to download any files. Currently, the lab is working to make it accessible through a web interface.

Midjourney is a closed-source program. No proper knowledge of its underlying working is available. However, it can be said that it uses large language and diffusion models. The former is used by Midjourney to understand the meaning of a text prompt. The language model then converts the prompt into a vector, which is used to guide a diffusion process. The diffusion process gradually generates an image that is consistent with the meaning of the prompt. To generate images, users input prompts using the /imagine command. Advanced techniques, such as using Upscale, Vary, and Redo buttons, empower users to enhance results. Midjourney's unique artistic style is evident in its outcomes, similar to paintings rather than photographs.

Midjourney is still in the beta stage. It continually refines its algorithms, introducing new model versions every few months. Midjourney requires payment upfront, unlike many other image generators. This is because its image generation process is resource-intensive. It requires the use of GPUs and large memory for denoising. Midjourney is a promising AI art generation program that is still under development.

Techniques to generate product images using Stable Diffusion

Textual Inversion

Textual Inversion is a method used in machine learning, specifically in text-to-image generation. It's a way to introduce new concepts to a model using a few example images. The model learns these new concepts by creating new 'words' in the embedding space of the text encoder. These words can then be used in text prompts to generate images with a high degree of control. The technique was first introduced in this research paper and has since been applied to various models, including the Stable Diffusion models.

The assumption here is that the embedding space of the text encoder is vast enough to encode the aspects of the new images that one would want to introduce and generate using the image decoder.

To generate product images using this process, one can upload a few images of the product, preferably in different positions and lighting. Once we have that we can start training the model on our new images, but the embeddings for the new token will be the only trainable params. This means all the loss of the model will be focused on learning the embeddings that are required to generate the given images. This will find the best vectors responsible in the text encoder latent space which can best represent the given images or concepts in the image.

The advantage of this process is that we are not expanding the space of the model. We are learning the new embedding required specifically for our product. This means we can use the same model for multiple products, just assigning a new word to every specific product.


Dreambooth is another method to add new concepts to image generative models. This one came out of Google in 2022. Similar to Textual Inversion, DreamBooth also lets you introduce new objects/concepts in the model and assign unique identifiers to them. Once the model has learned the new concept, you can use the associated unique identifier to generate images of the object in different settings and scenarios.

Although similar in concept and goal, DreamBooth differs completely in approach from Textual Inversion. Instead of just finetuning the text embedding space and learning the embedding for the new object. We finetune the whole network, in a few-shot fashion. And that’s The core problem here is to finetune the network in a few-shot fashion. Authors show that you can use this method with as little as 3 images, although the more the better. But finetuning with such a small amount of data can lead to major issues in the network. The authors introduce a few ways to make this possible.

To generate custom product images using this process, you would first need a few images of the product or person that you want to target. Then, along with that, you would need some images of the same class as the product. The class of the product is basically what the product is, bag, shoes, water bottle, etc. As shown in the diagram above, the training is comprised of the new images AND the images from the same class. This is necessary to make sure that the model retains the original information and doesn’t overfit. To ensure the balance between preserving the old concepts and learning the new concepts, the authors introduce a class-preservation loss, which is a loss term with an additional parameter to control the weight of the loss from the old images. One can reduce or increase this parameter according to the needs. This also helps in preventing language drift. This is the phenomenon when a language model trained on a large corpus is finetuned on a smaller more specific corpus, it starts to lose the syntactical and semantic knowledge of the language.

Along with the class preservation loss, the authors also put emphasis on the specific technique to use when building prompts. They suggest using prompts in the specific format of: “A [V] [class noun]” where [V] is the unique identifier, and the [class noun] is the class the object belongs to. Using class nouns helps greatly with learning as it helps the model tie the properties of the new images to something that it has already learned. This is because this way the image and the embeddings are more closely related right off the bat instead of being learned slowly.

These two are the core of the finetuning with dreambooth. Once these are in place, one can finetune the smaller resolution model with class preservation loss and class noun prompting technique. And then finetune the larger model with the new images to ensure the fidelity of the generated images.

Dreambooth vs Textual Inversion

Dreambooth when compared to Textual Inversion shows much better results. This is primarily because Dreambooth finetunes the whole network, instead of just the text encoder space. But because of this, finetuning with dreambooth can be notoriously difficult. It is very easy to overfit and can lead to language drift. There are many hyperparameters to control. Many people have been running experiments, you can read this amazing article from hugging face here.  Another big issue with Dreambooth is the high number of trainable parameters. This issue can be solved by finetuning with Peft Techniques like LoRA.

This being said, DreamBooth is hugely superior to Textual Inversion. If you want to generate product images using Stable Diffusion, definitely use DreamBooth finetuning with LoRA, but if you only need the model to learn the basic concept, without very high accuracy, Textual Inversion would be better.


Outpainting is a very basic method of extending a passed image. Stable Diffusion is able to perform this operation using the techniques described in another paper, LaMa - Large Mask Inpainting. The authors generate data and evaluate the performance of the stable diffusion model based on the LaMa paper. 

This is very important to note that Outpainting is an extension of Inpainting, which is a technique to remove parts from images, not add to them. This works by passing the original image and a mask image to the model, the model will then erase the parts from the original image which are highlighted in the mask. LaMa was a SOTA method at the time, new papers like Feature Refinement have come out since.

To perform outpainting using this, the mask is made bigger than than original image and is added around the image, not over the original image. This forces the model to add to the image and hence extend the original image. 


Control net is not a technique to extend images, but rather a technique to control the output of a generative model. It is a type of generative model that uses a control vector to control the output of the model. This control vector is a set of parameters that are used to control the output of the model. The control vector is used to control the output of the model in terms of the desired features, such as color, texture, shape, etc. This allows for more control over the output of the model and can be used to generate more realistic images. 

Control net can be combined with dream booth, textual inversion, and other stable diffusion models to generate finely controlled images of desired products. Here is an example of generating shoe images using nothing but scribble:

Why Use Stable Diffusion To Generate Product Images?

In rapidly evolving product marketing, visual content is essential for businesses to stand out from the competition. Stable Diffusion is a cutting-edge image generation technique that can help businesses create high-quality product images quickly and easily. It offers several advantages over traditional image creation methods.

Costly-Effective Than Product Photography

Traditional product photography can be expensive, especially when you factor in the cost of hiring a photographer, renting equipment, and paying for post-processing. Stable Diffusion can help you eliminate these costs by generating high-quality product images from text prompts. Also with traditional product photography, there can be a long wait between the time you take the photos and the time you receive them. Stable Diffusion can generate images much more quickly, so you can get your products listed online faster. You are not limited by the time and location of the photoshoot. Stable Diffusion allows you to generate images of any product, in any setting, at any time. This gives you more flexibility to create the perfect images for your marketing campaigns. Traditional product photography can be inconsistent, depending on the photographer's skills and the lighting conditions. Stable Diffusion can help you create consistent, high-quality images every time.

Adaptable To Trends

In dynamic markets, where trends are constantly changing, businesses need to be adaptable. Stable Diffusion can help businesses quickly align with the latest trends and product variations. It does so by generating realistic images from simple sketches, textual descriptions, or input images. It makes it ideal for tasks such as image inpainting, style transfer, and upscaling. It can also be used for complex image segmentation tasks. This involves dividing an image into distinct regions based on contrasts, colors, or features. The iterative nature makes it particularly effective for this task. This is because it can gradually refine the segmentation results until they are precise and intricate. Its inherent adaptability to evolving trends makes it a valuable asset for businesses that want to stay ahead of the competition.

Figure From Reddit

Efficient And Speed

Efficiency and speed are essential qualities for businesses in today's competitive landscape. Stable Diffusion can help businesses achieve both of these goals by accelerating the image creation process. Traditional methods of image creation can be time-consuming and expensive. Businesses may need to hire photographers or graphic designers to create high-quality images. However, Stable Diffusion can generate realistic images from simple text descriptions or input images. This can save businesses time and money, while also giving them more control over the process. It can generate multiple images in a short amount of time. This allows businesses to quickly create various images for different marketing campaigns or e-commerce platforms.

Customizable And Diverse

Stable Diffusion offers a wide range of customization and diversity, making it easy to create images that meet specific requirements. For businesses, it can generate high-quality visuals for advertising, creative projects, and product design, and streamline image and video editing, increasing efficiency. It gives users and product designers more control over design choices, quickly generating new and engaging designs. Stable Diffusion can also be used to extract appealing designs and color palettes for web pages, apps, and themes.

In marketing, Stable Diffusion networks can develop new designs for logos, promotional materials, and content illustrations. For example, a furniture store could use Stable Diffusion to create images of its products in different room settings, helping customers visualize how the furniture would look in their homes. A clothing brand could use Stable Diffusion to create images of its clothes on different models, helping customers see how the clothes would look on them.

Efficiency In Iteration 

Stable Diffusion streamlines the iterative design process by swiftly generating multiple product images with slight variations, such as different colors, poses, or backgrounds. It allows designers to quickly compare different options and make informed decisions about the final design. It is easier to refine the design. For example, if a designer is not happy with the color of a product, they can quickly generate a new image with a different color. It can save a lot of time and effort compared to traditional methods of design, which often require manual editing of images. Stable Diffusion can also be used to optimize visual assets. For example, it can be used to resize images for different platforms or to add text or graphics to images. It can help to ensure that visual assets are consistent and effective across all channels.

Want to build Stable Diffusion Products?

If you want to build products with Stable diffusion or other with image generation algorithms, reach out to us.  We have a ton of experience in working with Stable Diffusion, VQ-GANs, VAEs, and other generative AI technologies. We would love to see how we can help you.

Subscribe to stay informed

Subscribe to our newsletter to stay updated on all things AI!
Awesome, you subscribed!
Error! Please try again.