Understanding Low-Rank Adaptation (LoRA) for Efficient Fine-Tuning of Large Language Models_hydra: multi-head low-rank adaptation for paramete-优快云博客

本文链接：https://blog.youkuaiyun.com/bbbeoy/article/details/140716311

Fine-tuning large language models (LLMs) like GPT-3, LLama, and Mistral is essential for adapting these models to specific tasks in NLP. However, full fine-tuning which updates all the parameters of the pre-trained model is often impractical due to the very large number of parameters involved, making it computationally expensive and storage-intensive. For example, GPT-3 has 175 billion trainable Parameters and full fine-tuning it gets very inconvenient if at all possible. Low-rank adaptation (LoRA) offers a more efficient alternative by significantly reducing the number of trainable parameters without compromising model performance and, unlike adapters, no additional inference latency.

Fine Tuning for DownStream Task

In this blog, I will try to explain the learnings that I gained while Reading LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS paper.

Following will be the index of the blog:

Introduction to Low-Rank Adaptation (LoRA)
Theory and Concept of LoRA
Mathematical Background
Application of LoRA in Large Language Models
Example Calculation
Implementation of LoRA in Python
Advantages of LoRA
How does LoRA achieve a 99% reduction in trainable parameters? Explained with Example
Conclusion
References

Introduction to Low-Rank Adaptation (LoRA)

Fine-tuning large language models requires updating all the model parameters, which is increasingly infeasible as model sizes grow and every other day new model with much larger trainable parameters is being released. LoRA tackles this issue by freezing the pre-trained model weights and injecting trainable low-rank decomposition matrices into each layer (it is subjective and can also be made to effect only particular layers as per Computation budget) of the Transformer architecture. This approach dramatically reduces the number of trainable parameters required for downstream tasks.

Challenges of Full Fine-Tuning

Full fine-tuning involves retraining all the parameters of a pre-trained model. For large models like GPT-3, which has 175 billion parameters, this process is resource-intensive. Storing multiple fine-tuned versions for different tasks can be impractical due to the enormous storage requirements. Moreover, the computational cost of updating and maintaining such large models is prohibitive, making it difficult for smaller organizations to utilize these powerful models.

Need for Efficient Adaptation Techniques

Efficient adaptation techniques are crucial to make the benefits of large pre-trained models accessible to a broader audience. LoRA addresses this need by offering a method that reduces the computational and storage demands of fine-tuning while preserving or even improving(as claimed in Original Paper’s Abstract) the model’s performance on specific tasks.

Theory and Concept of LoRA(will try to explain in simplest way)

Explanation of LoRA

LoRA stands for Low-Rank Adaptation. It works by freezing the pre-trained weights of a model and introducing trainable low-rank matrices that adapt the model to new tasks. This method not only reduces the number of trainable parameters but also maintains high model quality, often performing on par with or better than full fine-tuning.

LoRA Explained

Importance in Reducing the Number of Trainable Parameters

The key advantage of LoRA is its ability to minimize the number of parameters that need to be trained. By focusing on low-rank adaptations, LoRA enables efficient training with significantly fewer resources, which is crucial for deploying large models such as GPT-3.

Image from [3]

How LoRA Integrates with Existing Architectures

Although LoRA can be implemented to any neural network’s dense layers but authors kept Transformers Language models Under focus as shown in the image below. In Transformer, LoRA integrates seamlessly with existing Transformer architectures. It injects trainable rank-decomposition matrices into each layer without altering the original architecture’s fundamental structure. This allows for easy implementation and compatibility with a wide range of pre-trained models.

Image from [3]

Comparison with Other Adaptation Techniques

Other adaptation techniques, like adapter layers and prefix tuning, also aim to reduce the number of trainable parameters. However, these methods often introduce additional inference latency, and computation overhead due to additional layers added in the network as shown in the image below and do not always match the performance of full fine-tuning. LoRA overcomes these limitations by maintaining the original model’s inference speed and achieving competitive performance with a fraction of the trainable parameters.

Image from [refs]

Mathematical Background

Detailed Explanation of Low-Rank Matrix Decomposition

As per inspiration from the paper[1], A neural network typically contains many dense layers with full-rank weight matrices. When adapting these models to specific tasks, we hypothesize that the updates to these weights have a low “intrinsic rank.” For a pre-trained weight matrix W₀ ∈ ℝ^{d × k}, LoRA constrains its update ΔW by representing it with a low-rank decomposition.

Wₙ=W₀+ΔW=W₀+BA
where B ∈ ℝ^{d × r} and A ∈ ℝ^{r × k}, with r(as per paper experimentation it can be as low as 1 )≪min⁡(d,k)

Image from [3]

Let’s understand it in simple terms,

Let’s consider a layer with pre-trained weights W and an input x. The output of this layer can be expressed as h = Wx

As mentioned earlier, we will freeze the pre-trained weights W, meaning they will not be updated during backpropagation. Instead, we introduce an additional term ΔW, and all updates will be applied to ΔW. The new equation becomes:

h=(W+ΔW)x=Wx+ΔWx

Here, W ∈ ℝ^{d × k} and ΔW ∈ ℝ^{d × k}.

Now, we have two options for handling ΔW :

Update the entire ΔW matrix, which would require substantial resources and computational power and is same as full-finetuning.
Decompose ΔW into two smaller matrices using the LoRA technique.

The first option is essentially the same as the basic method and is resource-intensive. The second option is more efficient: we decompose ΔW into two smaller matrices B and A such that [2]

ΔW=BA
where, B ∈ ℝ^{d × r} A ∈ ℝ^{r × k}, and r is much smaller than both d and k.

Image by [Image ref ]

An important point regarding r is that, as we increase r, it increases the number of trainable parameters, and then training LoRA roughly converges to training the original model

Image from [3]

Since the matrices A and B are much smaller than ΔW, updating them requires less memory and computational power during backpropagation. Given that the weights have an intrinsically low rank, we don’t need a lot of information to represent them. We use the rank r as a hyperparameter to determine the rank of the decomposed matrices.

Key points to note:

During training, matrix A is initialized with random Gaussian values, and matrix B is initialized with zeros. This makes ΔW=BA initially zero. The term BAx is scaled by α/r, where α is a constant related to r(typically set to the initial value of r and not tuned). This scaling reduces the need to retune hyperparameters when r changes.
At inference time, the updated weights can be merged with the base pre-trained weights, resulting in zero additional latency.
If there are multiple use cases fine-tuned on the same base model, we can swap the updated weights for each use case without reloading the base model.

Application of LoRA in Large Language Models

Layers and Weights Targeted

We can apply LoRA to any subset of weight matrices in a neural network to reduce the number of trainable parameters. In the Transformer architecture, there are four weight matrices in the self-attention module (W_q, Wₖ, Wᵥ, Wₒ) and two in the MLP module. We treat W_q (or Wₖ, Wᵥ) as a single matrix of dimension d_model ×d_model, even though the output dimension is usually sliced into attention heads.[3]

In the original paper, the authors focused their research on only adapting the attention weights (W_q , Wₖ, Wᵥ, Wₒ) for downstream tasks and freezing the MLP modules (so they are not trained in downstream tasks) both for simplicity and parameter efficiency.

During the experimentation query (W_q) and value (Wᵥ) weight matrices combined have shown excellent results. However independent queries and key weights result in lower performance. Also, rank r kept at 4 if using two weights at a time and r kept at 8 if fine-tuning single weight metrics at a time.

Image from [3]

Example Calculation

Step-by-Step Calculation Process of Applying LoRA

Let’s consider a hypothetical example to illustrate the process:

Original Weight Matrix: W ∈ ℝ^{768 × 768}
Rank-Decomposed Matrices: B ∈ ℝ^{768 × 4} and A ∈ ℝ^{4 × 768}
Initialization: B is initialized to zeros and A to random values.
Update Rule: During training, update only A and B, keeping W frozen.

Hypothetical Example with a Dataset

Suppose we have a model with dimensions d=768 and k=768, and we choose r=4. The original weight matrix W has 589,824 parameters, while the rank-decomposed matrices B and A have only 6,144 parameters combined. This represents a significant reduction.

Implementation of LoRA in Python

The good thing is that to apply LoRA to an LLM, we already have a library (peft) that implements LoRA, we just have to pass the value of variables accordingly. Example Code to apply LoRA using peft library

import torch
import transformers
import argparse
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from peft import (
        get_peft_model, 
        prepare_model_for_kbit_training, 
        LoraConfig
    )
from trl import SFTTrainer

# Set the model name and load the pre-trained model and tokenizer
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(model_name,
                                            load_in_8bit=True,
                                            device_map="auto"
                                            )
tokenizer = AutoTokenizer.from_pretrained(model_name)


# Function to generate a prompt for the model
def generate_prompt(dialogue, summary=None, eos_token="</s>"):
  instruction = "Summarize the following:\n"
  input = f"{dialogue}\n"
  summary = f"Summary: {summary + ' ' + eos_token if summary else ''} "
  prompt = (" ").join([instruction, input, summary])
  return prompt

# Example of generating a prompt from training data
print(generate_prompt(data_train[0]["dialogue"], data_train[0]["summary"]))

# Tokenize the input prompt and generate output using the model
input_prompt = generate_prompt(data_train[50]["dialogue"])
input_tokens = tokenizer(input_prompt, return_tensors="pt")["input_ids"].to("cuda")
with torch.cuda.amp.autocast():
  generation_output = model.generate(
      input_ids=input_tokens,
      max_new_tokens=1000,
      do_sample=True,
      top_k=10,
      top_p=0.9,
      temperature=0.3,
      repetition_penalty=1.15,
      num_return_sequences=1,
      eos_token_id=tokenizer.eos_token_id,
    )
op = tokenizer.decode(generation_output[0], skip_special_tokens=True


# LoRA configuration to decompose ∆W into smaller matrices
lora_config = LoraConfig(
        r=8, #rank 
        lora_alpha=8, # alpha value
        lora_dropout=0.1,
        target_modules=["q_proj","k_proj","v_proj","o_proj"],
        bias="none",
        task_type="CAUSAL_LM",
    )

# Adjust the tokenizer and prepare the model for LoRA training
tokenizer.add_special_tokens({"pad_token": "<PAD>"})
model.resize_token_embeddings(len(tokenizer))

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)


# Define training arguments
output_dir = "practise"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
per_device_eval_batch_size = 4
eval_accumulation_steps = 4
optim = "adamw"
save_steps = 10
logging_steps = 10
learning_rate = 5e-4
max_grad_norm = 0.3
max_steps = 50
warmup_ratio = 0.03
evaluation_strategy="steps"
lr_scheduler_type = "constant"

training_args = transformers.TrainingArguments(
            output_dir=output_dir,
            per_device_train_batch_size=per_device_train_batch_size,
            gradient_accumulation_steps=gradient_accumulation_steps,
            optim=optim,
            evaluation_strategy=evaluation_strategy,
            save_steps=save_steps,
            learning_rate=learning_rate,
            logging_steps=logging_steps,
            max_grad_norm=max_grad_norm,
            max_steps=max_steps,
            warmup_ratio=warmup_ratio,
            group_by_length=True,
            lr_scheduler_type=lr_scheduler_type,
            ddp_find_unused_parameters=False,
            eval_accumulation_steps=eval_accumulation_steps,
            per_device_eval_batch_size=per_device_eval_batch_size,
        )

# Function to format training prompts
def formatting_func(prompt):
  output = []

  for d, s in zip(prompt["dialogue"], prompt["summary"]):
    op = generate_prompt(d, s)
    output.append(op)

  return output


# Set up the SFTTrainer with the model, datasets, and training arguments
trainer = SFTTrainer(
    model=model,
    train_dataset=data_train,
    eval_dataset=data_val,
    peft_config=lora_config,
    formatting_func=formatting_func,
    max_seq_length=1024,
    tokenizer=tokenizer,
    args=training_args
)

# Pre-process the model by upcasting layer norms to float 32 for more stable training
    if "norm" in name:
        module = module.to(torch.float32)

# Train the model and save the final version
trainer.train()
trainer.save_model(f"{output_dir}/final")

from peft import PeftModel

# Load the fine-tuned model for inference
peft_model_id = "practise/checkpoint-40"
peft_model = PeftModel.from_pretrained(model, peft_model_id, torch_dtype=torch.float16, offload_folder="lora_results/lora_7/temp")

# Generate output using the fine-tuned model
input_prompt = generate_prompt(data_train[50]["dialogue"])
input_tokens = tokenizer(input_prompt, return_tensors="pt")["input_ids"].to("cuda")
with torch.cuda.amp.autocast():
    generation_output = peft_model.generate(
        input_ids=input_tokens,
        max_new_tokens=100,
        do_sample=True,
        top_k=10,
        top_p=0.9,
        temperature=0.3,
        repetition_penalty=1.15,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
      )
op = tokenizer.decode(generation_output[0], skip_special_tokens=True)
print(op)

Advantages of LoRA

Parameter Reduction: LoRA drastically reduces the number of parameters that need to be trained, often down to 0.01% of the original model’s parameters. This reduction is achieved by using the low-rank adaptations, making it feasible to fine-tune large models with limited resources.
Computational Efficiency: Training becomes faster and requires less memory, reducing the hardware barrier and allowing for more parallel experiments. This efficiency enables researchers and practitioners to experiment with large models without incurring prohibitive costs.
Catastrophic Forgetting: Full fine-tuning of the model leads to catastrophic forgetting of the model and deviates it towards a single direction, but creating a LoRA adapter preserves the original knowledge of the model while creating task-specific adapters.
Using a single base model we can have multiple LoRA adapters for various tasks and using these adapters as required by the task.

Image from [3]

How does LoRA achieve a 99% reduction in trainable parameters?

The significant reduction in the number of trainable parameters when using LoRA comes from the fact that LoRA replaces the full-weight matrix updates with much smaller low-rank matrices. Here’s a detailed explanation of how this reduction is achieved:

Weight Decomposition in LoRA

In a transformer model, each weight matrix W typically has dimensions d_model×d_model (for simplicity, assume square matrices). When using LoRA, each update to W is decomposed into two smaller matrices A and B:

A has dimensions d_model × r
B has dimensions r × d_model

where r is the rank, which is much smaller than d_model.

Comparison of Parameter Counts

Without LoRA:

The number of parameters in the full-weight matrix W is d_model × d_model.

With LoRA:

The number of parameters in the low-rank matrices A and B is d_model × r + r × d_model = 2 × d_model × r.

Reduction Factor

Given that r is much smaller than d_model (e.g., r might be 1/100th or even less of d_model), the number of parameters in A and B is drastically reduced. Let’s calculate the reduction factor:

Without LoRA: d_model × d_model
With LoRA: 2 × d_model × r

The ratio of parameters with LoRA to the original number of parameters is:

Example Calculation

Suppose d_model =1024 and r = 4:

Without LoRA: 1024 × 1024=1,048,576 parameters per weight matrix.
With LoRA: 2×1024×4=8192 parameters per weight matrix.

The reduction factor is:

or approximately a 99.22% reduction in the number of parameters.

Applying LoRA to All Query Weights

When applying LoRA to the query weights in all layers, this reduction is consistent across the entire model. If a transformer has 12 layers and each layer has separate query, key, value, and output projection weights, applying LoRA to only the query weights would still result in a significant reduction. Here’s why:

Each layer’s query weights are reduced by approximately 99% due to the low-rank decomposition.
Since only the query weights are being adapted (not the key, value, or output projection weights), the overall reduction is proportional to the fraction of the total parameters they represent.

Practical Considerations

In practice, the total number of trainable parameters includes contributions from all parts of the model (e.g., embeddings, layer norms, etc.). However, the dramatic reduction in the weights of the largest matrices (e.g., the attention mechanism weights) often dominates the overall parameter count.

Conclusion

I would like to conclude the blog by reframing the concluding words of the Original LoRA paper as:

Fine-tuning large language models is often prohibitively expensive due to the hardware requirements and the storage costs associated with hosting multiple instances for different tasks. To address this, LoRA is an efficient adaptation strategy that maintains high model quality without introducing inference latency or reducing input sequence length. LoRA enables rapid task-switching by sharing most of the model parameters, making it ideal for deployment as a service. While Paper focus on Transformer language models, the principles of LoRA can be applied to any neural network with dense layers.

References

[1] https://arxiv.org/abs/2012.13255
[2] https://medium.com/@lokeshtodwal/demystifying-lora-q-lora-ea267abff48
[3] https://arxiv.org/abs/2106.09685

Thank you for taking the time to read and engage with this article. Your support in the form of following me and clapping on the article is highly valued and appreciated. If you have any queries or doubts about the content of this article, please do not hesitate to reach out to me via email at manindersingh120996@gmail.com. You can also connect with me on LinkedIn.

https://medium.com/@manindersingh120996/understanding-low-rank-adaptation-lora-for-efficient-fine-tuning-of-large-language-models-082d223bb6db