Guide to fine-tuning LLMs using PEFT and LoRa techniques

Large Language Models (LLMs) like GPT are getting only larger in size. Even open-source models like MPT and Falcon have reached 30 and 40 billion parameters respectively. With size, the capabilities and complexities of these models have also increased. But this increased complexity and model size can also create challenges. Training larger models requires more extensive data sets, and as the model grows, more parameters must be tuned. This can be very compute-heavy and as a result costly too. This is where fine-tuning comes in. Fine-tuning is a technique that allows for the re-purposing of pre-trained models and can help reduce the complexity of building larger models.

‍

In this blog, we will discuss advanced fine-tuning techniques like PEFT (Parameter Efficient Fine-Tuning) and see how they can save you a ton of time and money on training massive LLMs.

‍

What is Fine-tuning?

Fine-tuning is the process of taking a model that is already trained on some task and then tweaking it to perform a similar task. It is often used when a new dataset or task requires the model to have some modifications, or when the model is not performing well on a specific task.

‍

For example, a model trained to generate stories can be fine-tuned to generate poems. This is possible because the model has already learned how to generate casual language and write stories, this skill can also be used to generate poems if the model is tweaked properly.

How does Fine-tuning work?

As mentioned, fine-tuning is tweaking an already-trained model for some other task. The way this works is by taking the weights of the original model and adjusting them to fit a new task.

‍

Models when trained learn to do some specific task, for example, GPT-3 has been trained on a massive dataset and as a result, it has learned to generate stories, poems, songs, letters, and a lot of other things. One can take this ability of GPT-3 and fine-tune it on a specific task like generating answers to customer queries in a specific manner.

‍

There are different ways and techniques to fine-tune a model, the most popular being transfer learning. Transfer learning comes out of the computer vision world, it is the process of freezing the weights of the initial layers of a network and only updating the weights of the later layers. This is because the lower layers, the layers closer to the input, are responsible for learning the general features of the training dataset. And the upper layers, closer to the output, learn more specific information which is directly tied to generating the correct output.

‍

Here is a quick visualization of how fine-tuning works:

Alammar, J (2018). The Illustrated Transformer [Blog post].

Why use Fine-Tuning?

As the model size increases, it becomes more costly and time-consuming to train it. And with more size it requires more training data, otherwise, models usually overfit and generate poor results in a production environment. Fine-tuning allows us to not run into these issues by efficiently using a pre-trained model for our purposes. Here are some reasons why you should consider fine tuning instead of training a model from scratch:

Larger models generalize to downstream tasks well

We all know how large models like GPT-3 and GPT-4 can perform really well on complicated tasks. This is because they have very sophisticated architectures and are trained on massive datasets, this helps them generalize on a lot of tasks really well. These models understand the underlying properties of language and that helps them learn any new tasks with minimal effort like prompt engineering.

‍

But if we want to use these models for some very specific tasks, like building a legal contract generator, you should probably fine-tune the model instead of using prompt engineering. This is because a model performing well in a very general task like language generation will perform well in a downstream task like generating legal contracts.

‍

Cheaper than training a whole model

As mentioned before, these large models can be very expensive to train from scratch. Also very time-consuming. It is always cheaper to train an already-trained model. This also allows you to leverage what is already out there instead of doing everything yourself. Most of the time good datasets can be very hard and time-consuming to build. Open-source models like MPT and LLaMA have already been trained and made sure that they work well by some of the best researchers out there. It is very easy to load and train them in a cloud infrastructure.

Good for online training

One of the biggest challenges in AI is to keep the model up to date with the latest data. Models when deployed in production can start degrading in performance if not updated regularly. For example, if you deploy an AI model to predict customer behavior in a store, it might stop performing well once the store is restocked with products with different prices or if they introduce new products in the store. This is a classic example of how changes in data can drastically change the performance of a model.

‍

Fine-tuning can help you to keep updating the model with the latest data without having to re-train the whole model. This makes it possible to deploy models in production without much effort and cost. This is called online learning or online training and is absolutely necessary for any model in production.

‍

What is PEFT?

PEFT, Parameter Efficient Fine-Tuning, is a set of techniques or methods to fine-tune a large model in the most compute and time-efficient way possible, without losing any performance which you might see from full fine-tuning. This is done because with models growing bigger and bigger like BLOOM which has a whopping 176 billion parameters, it is almost impossible to finetune them without spending tens of thousands of dollars. But it is sometimes almost necessary to use such big models for better performance. This is where PEFT comes in. It helps you solve the problems faced during such big models.

‍

Here are some PEFT techniques:

‍

Why PEFT?

As mentioned above, it has become a necessity to fine-tune and use bigger models when it comes to production-grade applications. PEFT techniques allow you to fine-tune the models efficiently and save money and time as a result. This is done by fine-tuning only the most important and relevant parameters in the neural network. The techniques introduce new parameters in the network or freeze the whole model except for some parts to make it easier to train the model.

‍

Transfer Learning

Transfer learning is when we take some of the learned parameters of a model and use them for some other task. This sounds similar to fine-tuning but is different. In finetuning, we re-adjust all the parameters of the model or freeze some of the weights and adjust the rest of the parameters. But in fine-tuning, we use some of the learned parameters from a model and use them in other networks. This gives us more flexibility in terms of what we can do. For example, we cannot change the architecture of the model when fine-tuning, this limits us in many ways. But when using transfer learning, we use only a part of the trained model, which we can then attach to any other model with any architecture.

How Transfer Learning Works

Transfer learning has been a common practice in the computer vision world for a very long time now. This is because of the nature of the visual models and how they learn. In CNN models, the early layers extract more general features like edges and curves, whereas the later layers extract more complicated features like whole eyes and faces. This is because the receptive field of CNNs grows as they are stacked on top of each other.

‍

‍

Let’s say for example you are trying to train a neural network to classify if a vehicle in front of you is a car or a motorbike. This is a very basic task. But let’s say you have very limited data and you don’t want to train your model too much. Here is what a basic CNN network looks like.

‍

‍

There are 2 major parts of the network here, the CNN head and the later fully connected layers. As mentioned, CNN layers extract representations of the data which then are used by the fully connected network to classify the image. Here we can use any other CNN network trained on a similar classification problem and use that as the CNN head for this new problem.

‍

‍

Here as you can see, we are using transfer learning by using the weights of a network pretrained to classify the car type. We are only freezing the first two layers of the CNN network, and leaving the latter two free to be updated during the training process. This makes sure that the CNN head of the model learns new features from the images which might be necessary for the new task we are training the model for.

‍

Transfer learning is also often seen in NLP tasks with LLMs where people use the encoder part of the transformer network from a pretrained model like T5 and train the later layers.

Adapters

Adapters were one of the first parameter-efficient fine-tuning techniques released. In the paper, they showed that you can add more layers to the pre-existing transformer architecture and only finetune them instead of the whole model. They showed that this technique resulted in similar performance when compared to complete fine-tuning.

‍

‍

On the left, there is the modified transformer architecture with added adapter layers. You can see adapter layers are added after the attention stack and the feed-forward stack. And on the right, you can see the architecture of the adapter layer itself. The adapter layer comprises a bottleneck architecture, it takes the input and narrows it down to a smaller dimension representation and then passes it through a non-linear activation function, and then scales it back up to the dimension of the input. This makes sure that the next layer in the transformer stack will be able to receive the generated output from the adapter layer.

‍

In the paper, the authors show that this method of fine-tuning is comparable to complete fine-tuning while consuming much less compute resources and training time. They were able to attain 0.4% of full fine-tuning on the GLUE benchmark while adding 3.6% of the parameters.

‍

‍

LoRA - Low-Rank Adaptation

LoRA is a similar strategy to Adapter layers but it aims to further reduce the number of trainable parameters. It takes a more mathematically rigorous approach. LoRA works by modifying how the updatable parameters are trained and updated in the neural network.

‍

Let’s explain mathematically, you can skip to the next paragraph if you are not interested. We know that the weights matrices of a pretrained neural network are full rank, meaning each weight is unique and can't be made by combining other weights. But in this paper authors showed that when pretrained language models are adjusted to a new task the weights have a lower “intrinsic dimension”. Meaning, that the weights can be represented in a smaller matrix, or that it has a lower rank. This in turn means that during backpropagation, the weight update matrix has a lower rank, as most of the necessary information has already been captured by the pre-training process and only task-specific adjustments are made during fine-tuning.

‍

A much simpler explanation is that during finetuning only a very few weights are updated a lot as most of the learning is done during the pretraining phase of the neural network. LoRA uses this information to reduce the number of trainable parameters.

‍

The image above gives a visual representation of what LoRA is doing. The ΔW_AxBis the weight updation matrix, these are the changes needed to be applied to the neural network in order for it to learn a new task. This matrix can be broken down into two matrices and then we can only train them and then use them to get back our weight updation matrix. As you can see in the image, the matrix is broken down into matrices with columns and rows r, it can be understood as the rank of the weight updation matrix if it was actually trained. The bigger the rank, the more parameters will be updated during training.

Efficiency of LoRA

Authors in the paper show that LoRA can outperform full finetuning with only 2% of total trainable parameters.

‍

‍

As for the number of parameters it trains, we can largely control that using the rank r parameter. For example, let’s say the weight updation matrix has 100,000 parameters, A being 200 and B being 500. The weight updation matrix can be decomposed into smaller matrixes of lower dimensions, A being 200 x 3 and B being 3 x 500. This gives us 200 x 3 + 3 x 500 = 2100 trainable parameters only, which is only 2.1% of the total number of parameters. This can be further reduced as we can decide to only apply LoRA to specific layers only.

‍

As the number of parameters trained and applied are MUCH smaller than the actual model, the files can be as small as 8MB. This makes loading, applying, and transferring the learned models much easier and faster.

‍

You can read the LoRA paper if you want to learn more and do a deeper dive into the topic.

LoRA in Stable Diffusion

One of the most interesting use cases of LoRA can be shown in image generation applications. Images have an inherent style that can be visually seen. Instead of training massive models to get specific styles of images out of models, users can now only train LoRA weights and use them with techniques like Dreambooth to achieve really good quality images with a lot of customizability.

‍

LoRA weights can also be combined with other LoRA weights and be used in a weighted combination to generate images that carry multiple styles. You can find a ton of LoRA adapters online and load them into your models on CivitAI.

‍

‍

IA3 - Infused Adapter by Inhibiting and Amplifying Inner Activations

IA3 is an adapter-based technique that is somewhat similar to LoRA. The goal of the authors was to replicate the advantages of ICL (in context learning or Few-Shot prompting) without the issues that come with it. ICL can get messy in terms of cost and inference as it requires prompting the model with examples. Longer length prompts require more time and computation to process. But ICL is perhaps the easiest way to get started working with models.

‍

IA3 works by introducing rescaling vectors that target the activations of the model. A total of 3 vectors are introduced, l_v, i_k,and l_ff. These vectors target the value, keys in the attention layer, and the non-linear layer in the dense layers. These vectors are multiplied elementwise to the default values in the model. Once injected, these parameters are then learned during the training process, while the rest of the model remains frozen. These learned vectors essentially rescale or optimize the targeted pretrained model weights for the task at hand.

‍

‍

So far this seems like a basic adapter type PEFT method. But that’s not all. The authors also use 3 loss terms to enhance the learning process. The 3 losses are L_LM, L_UL, and L_LN. L_LM is the standard cross-entropy loss, which increases the likelihood of generating the correct response. Then there is L_UL which is Unlikelihood Loss. This loss term reduces the probability of incorrect outputs using Rank Classification. Finally, we have L_LN, which is a length-normalized loss that applies a softmax cross-entropy loss to length-normalized log probabilities of all output choices. Multiple losses are used here to ensure faster and better learning of the model. Because we are trying learn using few-shot examples, these losses are necessary.

Now let’s talk about two very important concepts in IA3. Rank Classification and Length Normalization.

‍

In Rank Classification a model is asked to rank a set of responses by their correctness. This is done by calculating the probability scores for the potential responses. The L_ULis then used to reduce the probability of the wrong responses and as a result, increase the probability of the correct response. But with Rank classification, we face a critical problem, which is that the responses with fewer tokens will rank higher, because of how probability works. A smaller amount of generated tokens ensures a higher probability as the probability of every generated token is < 1. To fix this, the authors propose dividing the score of the response by the number of tokens in the response. Doing this will normalize the scores. One very important thing to note here is that normalization is done over log probabilities, not raw probabilities. Log probabilities are negative and between zero to one.

‍

Efficiency of IA3

IA3 just like LoRA reduces the number of trainable parameters. But instead of using low-rank matrices, IA3 uses rescaling vectors. This reduces the trainable parameters to about 0.01%, compared to LoRA's > 0.1%, for the T0 model trained in the paper. The frozen state of the LLM also provides us with the option of having multiple adapters for multiple use cases. Also, because the authors used element-wise multiplication, it is super easy to merge the adapter to the LLM weights because of the commutative property of multiplication.

‍

‍

The above figure shows that IA3 performs better than LoRA and barely affects the FLOPs. This makes IA3 a highly efficient and desirable technique. Also because IA3 is an additive adapter technique, just like LoRA we can target specific parts of the model and decide where to introduce the rescaling vectors. This helps us reduce the training time and even more.

‍

P-Tuning

The P-tuning method aims to optimize the representation of the prompt which is passed to the model. In the P-Tuning paper, the authors emphasize how prompt engineering is a very strong technique when working with large language models. The p-Tuning method builds up on top of prompt engineering and tries to further improve the effectiveness of a good prompt.

‍

P-Tuning works by creating a small encoder network for your prompt that creates a soft prompt for your passed prompt. To tune your LLM using P-tuning, you are supposed to create a prompt template that represents your prompt. And a context x which is used in the template to get label y. This is the approach mentioned in the paper. The tokens used for the prompt template are trainable and learnable parameters, these are called pseudo tokens. We also add a prompt encoder which then helps us update pseudo tokens to the specific task at hand. The prompt encoder is usually a bi-LSTM network that learns the optimal representation of the prompt for the model and then passes the representation to it. The LSTM network is attached to the original model. Only the encoder network and the pseudo tokens are trained here, the weights of the original network remain unaffected. Once the training is done, the LSTM head is discarded as we have the h_i which can be used directly.

‍

In short, the prompt encoder only changes the embeddings of the passed prompt to better represent the task, everything else remains unchanged.

‍

Efficiency of P-Tuning

In terms of efficiency, P-tuning is just as good as any other method. In the paper, the authors show that P-Tuning was able to perform better than full fine-tuning on most of the benchmarks. It can be said that P-Tuning is comparable to the full fine-tuning of large language models.

‍

‍

But there is a core issue when it comes to P-Tuning. P-Tuning is a prompt optimization technique, it optimizes the prompt that is passed to the bigger model. This means that we are still largely based on the large model in terms of capability. If a model has not been trained on sentiment classification optimizing sentiment classification prompts using P-Tuning will not do a lot of good to the model. P-Tuning is an assistive technique. It is always very important to pick a model that can do the required task out of the box “well” with some prompt engineering, and then further optimize it.

‍

Prefix Tuning

Prefix tuning can be considered the next version of P-Tuning. The authors of P-Tuning published a paper on P-Tuning V-2 addressing the issues of P-Tuning. In this paper, they implemented the Prefix tuning introduced in this paper. Prefix tuning and P-Tuning do not have a lot of differences but can still lead to different results. Let’s dive into a deeper explanation.

‍

‍

In P-Tuning, we added learnable parameters only to the input embeddings but in Prefix Tuning we add them to all the layers of the network. This ensures that the model itself learns more about the task it is being finetuned on. We append learnable parameters to the prompt and to every layer activation in the transformer layers. The difference from P-Tuning is that instead of completely modifying the prompt embeddings, we only add very few learnable parameters at the start of the prompt at every layer. Here’s a visual explanation:

‍

‍

At every layer in the transformer, we concatenate a soft prompt with the input which has learnable parameters. These learnable parameters are tuned using a very small MLP, only 2 fully connected layers. This is done because in the paper authors note that directly updating these prompt tokens is very sensitive to learning rate and initialization. The soft prompts increase the number of trainable parameters but substantially increase the learning ability of the model too. The MLP or fully connected layers can be dropped later as we only care about the soft prompts, which will be appended to the input sequences during inference and will guide the model.

‍

‍

Efficiency of Prefix Tuning

Prefix tuning shows massive gains over P-Tuning. And as the model size increases, these gains increase too. This is perhaps because there are more trainable parameters for larger models. In the chart, you can see the authors compare the performance of P-Tuning, full finetuning, and Prefix tuning. Prefix tuning performs better than or as well as P-tuning in almost all tasks. In many cases, it performs even better than Full fine-tuning!

‍

‍

One big reason why prefix tuning works really well is that the number of trainable parameters is not limited only to the input sequence. Learnable parameters are added at every layer, making the model much more flexible. Prefix tuning, unlike P-tuning, not only affects the prompt tokens but also the model itself. This allows the model to learn more. But this approach is still largely based on the prompt. It is still suggested to take a model that can perform the task and only then optimize it, as that will lead to much better results. As for the size of parameters, the number of trained parameters increase substantially, from 0.01% to 0.1 to 3% parameters. But the size of parameters still remains small enough to be transferred and loaded easily and quickly.

‍

Prompt Tuning

Prompt tuning was one of the first papers to build upon the idea of finetuning only with soft prompts. The ideas of P-Tuning and Prefix Tuning come from this paper. Prompt tuning is a very simple and easy-to-implement idea. It involves prepending a specific prompt to the input and using virtual tokens or new trainable tokens for that specific prompt. These new virtual tokens can be finetuned during the process to learn a better representation of the prompt. This means that the model is tuned to understand the prompt better. Here is a comparison of prompt tuning with full fine-tuning from the paper:

‍

‍

Here you can see that full model tuning requires multiple copies of the model to exist if we want to use the model for multiple tasks. But with Prompt Tuning, you only need to store the learned virtual tokens of the prompt tokens. So for example, if you use a prompt like “Classify this tweet: {tweet}” the goal will be to learn new better embeddings for the prompt. And during inference, only these new embeddings will be used to generate the outputs. This allows the model to tune the prompt to help itself generate better outputs during inference.

‍

Efficiency of Prompt Tuning

The biggest advantage of using prompt tuning is the small size of learned parameters. The files can be in KBs. As we can determine the dimension size and number of parameters to use for the new tokens, we can greatly control the number of parameters we are going to learn. In the paper, the authors show how even with a very small number of trainable tokens method performs really well. And the performance only goes up as bigger models are used. You can read the paper here.

‍

‍

Another big advantage is that we can use the same model without any changes for multiple tasks, as the only thing being updated are the embeddings of the prompt tokens. Meaning you can use the same model for a tweet classification task and for a language generation task without any changes to the model itself, given the model is big and sophisticated enough to perform those tasks. But a big limitation is that the model itself doesn’t learn anything new. This is purely a prompt optimization task. This means if the model has never trained on a sentiment classification dataset, prompt tuning might not be of any help. It is very important to note that this method optimizes the prompts, not the model. So, if you cannot handcraft a hard prompt that can do the task relatively well, there is no use of trying to optimize for a soft prompt using prompt optimization techniques.

‍

LoRA vs Prompt Tuning

Now we have explored various PEFT techniques. Now the question becomes whether to use an additive technique like Adapter and LoRA or you use a Prompt based technique like P-Tuning and Prefix Tuning.

‍

On comparing LoRA vs P-Tuning and Prefix Tuning, one can say for sure LoRA is the best strategy in terms of getting the most out of the model. But it might not be the most efficient based on your needs. If you want to train the model on a much different task than what it has been trained on, LoRA is without a doubt the best strategy for tuning the model efficiently. But if your task is more or less already understood by the model, but the challenge is to properly prompt the model, then you should use Prompt Tuning techniques. Prompt Tuning doesn’t modify many parameters in the model and mainly focuses on the passed prompt instead.

‍

One important point to note is that LoRA decomposes the weight updation matrix into smaller rank matrices and uses them to update the weights of the model. Even though trainable parameters are low, LoRA updates all the parameters in the targeted parts of the neural network. Whereas in Prompt Tuning techniques, a few trainable parameters are added to the model, this usually helps the model adjust to and understand the task better but does not help the model learn new properties well.

‍

LoRA and PEFT in comparison to full Finetuning

PEFT, Parameter Efficient Fine Tuning, is proposed as an alternative to full Finetuning. For most of the tasks, it has already been shown in papers that PEFT techniques like LoRA are comparable to full finetuning, if not better. But, if the new task you want the model to adapt to is completely different from the tasks the model has been trained on, PEFT might not be enough for you. The limited number of trainable parameters can result in major issues in such scenarios.

‍

If you are trying to build a code generation model using a text-based model like LLaMA or Alpaca, you should probably consider fine-tuning the whole model instead of tuning the model using LoRA. This is because the task is too different from what the model already knows and has been trained on. Another good example of such a task is training a model, which only understands English, to generate text in the Nepali language.

‍

Why you should Fine-tune models for your business use case

Finetuning model is an important step for any business that wants to get the most out of its machine-learning applications. It allows you to customize the model to your specific use case, which can lead to improved accuracy and performance. It saves time, money, and resources by eliminating the need to build a new model from the ground up. Fine-tuning lets you optimize the use of your proprietary data, adjusting the model to better fit your available data, and even incorporating new data if needed. This ensures a more accurate model that better serves your business needs. Here are some more benefits:

‍

Customization: Fine-tuning allows you to tailor the model to your specific needs, enhancing accuracy and performance.
Resource Efficiency: It saves time, money, and resources by eliminating the need to build a new model from scratch.
Performance Boost: Fine-tuning enhances the performance of the pretrained model using your unique datasets.
Data Optimization: It lets you make the most of your data, adjusting the model to better fit your available data, and even incorporating new data if needed.

‍

But as the size of models grows to billions of parameters fine-tuning itself can be a challenge. The PEFT techniques we discussed in this blog help to reduce the time and resources needed to fine-tune a model. It helps speed up the training process by making use of the pretrained weights and parameters and allows you to fine-tune the model more efficiently. Also, using PEFT, you can easily transfer models over the internet and even use the same model for multiple purposes. PEFT opens up a whole new world of possibilities for businesses that want to make the most of their machine-learning applications.

Want to Train Custom LLMs with PEFT?

If you want to build or train custom LLMs or Chatbots, we can help you fine-tune them to your specific needs. We have done a ton of work on building custom chatbots and training large language models. Contact us today and let us build a custom LLM that revolutionizes your business.