The landscape of AI and natural language processing has dramatically shifted with the advent of Large Language models (LLMs). This shift is characterized by advancements like Low-Rank Adaptation (LoRA) and its more advanced iteration, Quantized LoRA (QLoRA), which have transformed the fine-tuning process from a compute-intensive task into an efficient, scalable procedure.
The Advent of LoRA: A Paradigm Shift in LLM Fine-Tuning
LoRA represents a significant advancement in the fine-tuning of LLMs. By introducing trainable adapter modules between the layers of a large pre-trained model, LoRA focuses on refining a smaller subset of model parameters. These adapters are low-rank matrices, significantly reducing the computational burden and preserving the valuable pre-trained knowledge embedded within LLMs. The key aspects of LoRA include:
- Low-Rank matrix structure: Shaped as (r x d), where ‘r’ is a small rank hyperparameter and ‘d’ is the hidden dimension size. This structure ensures fewer trainable parameters.
- Factorization: The adapter matrix is factorized into two smaller matrices, enhancing the model’s function adaptability with fewer parameters.
- Scalability and adaptability: LoRA balances the model’s learning capacity and generalizability by scaling adapters with a parameter α and incorporating dropout for regularization.
Quantized LoRA (QLoRA): Efficient Fine-Tuning on Intel Hardware
QLoRA advances LoRA by introducing weight quantization, further reducing memory usage. This approach enables the fine-tuning of large models, such as the 70B-parameter Llama 2, on a single GPU , like Intel® Data Center GPU Max Series 1100 with 48 GB VRAM, which was considered impossible previously. QLoRA’s main features include:
- Memory efficiency: Through weight quantization, QLoRA substantially reduces the model’s memory footprint, crucial for handling large LLMs.
- On-the-fly dequantization: It temporarily dequantizes the quantized weights for computations, focusing only on adapter gradients during training.
Fine-Tuning with QLoRA on Intel Hardware
The fine-tuning process starts with setting up the environment and installing the necessary packages, including bigdl-llm for model loading, parameter-efficient fine-tuning (PEFT) for LoRA adapters, Intel® Extension for PyTorch* for training using Intel discrete GPUs, Hugging face Transformers for fine-tuning, and datasets for loading the dataset. We will walk through the high-level process of fine-tuning an LLM to improve its capabilities. As an example, we will generate SQL queries from natural language input, focusing on general QLoRA fine-tuning. For detailed explanations, check out the full notebook that takes you from setting up the required Python* packages, loading the model, fine-tuning, and inferencing the fine-tuned LLM to generate SQL from text, on Intel® Developer Cloud and also here.
Model Loading and Configuration for Fine-Tuning
The foundation model is loaded in a 4-bit format using bigdl-llm, significantly reducing memory usage. This step enables fine-tuning large models like Llama 2 70 for example,
from bigdl.llm.transformers import AutoModelForCausalLM
# Loading the model in a 4-bit format for efficient memory usage
= AutoModelForCausalLM.from_pretrained(
model "model_id", # Replace with your model ID
="nf4",
load_in_low_bit=False,
optimize_model=torch.float16,
torch_dtype=["lm_head"],
modules_to_not_convert )
Learning Rate and Stability in Training
Selecting an optimal learning rate is critical in QLoRA fine-tuning to balance training stability and convergence speed. This decision is vital for effective fine-tuning outcomes as a higher learning rate can lead to instabilities with the training loss abnormally drop zero after few steps.
from transformers import TrainingArguments
# Configuration for training
= TrainingArguments(
training_args =2e-5, # Optimal starting point; adjust as needed
learning_rate=4,
per_device_train_batch_size=200,
max_steps# Additional parameters...
)
During the fine-tuning process, there is a notable rapid decrease in the loss after just a few steps, which then gradually levels off, reaching a value near 0.6 at approximately 300 steps as seen in the graph below:
Text-to-SQL Conversion: Prompt Engineering
With the fine-tuned model, we can convert natural language queries into SQL commands, a vital capability in data analytics and business intelligence. To fine-tune the model, we must carefully convert the data into a structured prompt like below to form an instruction dataset with Input, Context and Response fields:
# Function to generate structured prompts for Text-to-SQL tasks
def generate_prompt_sql(input_question, context, output=""):
return f"""You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.
You must output the SQL query that answers the question.
### Input:
{input_question}
### Context:
{context}
### Response:
{output}"""
Diverse Model Options
The notebook supports an array of models, each offering unique capabilities for different fine-tuning objectives:
Model Inference with QLoRA: A Comparative Approach
The true test of any fine-tuning process lies in its inference capabilities. In the case of the implementation, the inference stage not only demonstrates the model’s proficiency in task-specific applications but also allows for a comparative analysis between the base and the fine-tuned models. This comparison sheds light on the effectiveness of the LoRA adapters in enhancing the model’s performance for specific tasks.
Model Loading for Inference
For inference, the model is loaded in a low-bit format, typically 4-bit, using the bigdl-llm library. This approach drastically reduces the memory footprint, making it suitable to run multiple LLMs with high parameter count on a single resource-optimized device such as the Intel® Data Center Max GPU 1100. The following code snippet illustrates the model loading process for inference:
from bigdl.llm.transformers import AutoModelForCausalLM
# Loading the model for inference
= AutoModelForCausalLM.from_pretrained(
model_for_inference "finetuned_model_path", # Path to the fine-tuned model
=True, # 4 bit loading
load_in_4bit=True,
optimize_model=True,
use_cache=torch.float16,
torch_dtype=["lm_head"],
modules_to_not_convert )
Running Inference: Comparing Base vs Fine-Tuned Model
Once the model is loaded, we can perform inference to generate SQL queries from natural language inputs. This process can be conducted on both the base model and the fine-tuned model, allowing you to directly compare the outcomes and assess the improvements brought about by fine-tuning with QLoRA:
# Generating a SQL query from a text prompt
= generate_sql_prompt(…)
text_prompt # Base Model Inference
= base_model.generate(text_prompt)
base_model_sql print("Base Model SQL:", base_model_sql)
# Fine-Tuned Model Inference
= finetuned_model.generate(text_prompt)
finetuned_model_sql print("Fine-Tuned Model SQL:", finetuned_model_sql)
Following a 15-minute session training itself, the fine-tuned model demonstrates enhanced proficiency in generating SQL queries that reflect the given questions more accurately than the base model. With additional training steps, we can anticipate further improvements in the model’s response accuracy:
Finetuned Model:
Base Model:
LoRA Adapters: A Library of Task-Specific Enhancements
One of the most compelling aspects of LoRA is its ability to act as a library of task-specific enhancements. These adapters can be fine-tuned for distinct tasks and then saved. Depending on the requirement, a specific adapter can be loaded and used with the base model, effectively switching the model’s capabilities to suit different tasks. This adaptability makes LoRA a highly versatile tool in the realm of LLM fine-tuning.
Check out the notebook on Intel Developer Cloud
I invite AI practitioners and developers to explore the full notebook on the Intel Developer Cloud, where you can experiment with and explore the capabilities of fine-tuning LLMs using QLoRA on Intel hardware with Intel AI software optimizations. Once you log into Intel Developer Cloud, go to the “Training Catalog”. Under “Gen AI Essentials” in the catalog, you can find the LLM fine-tuning notebook and other notebooks.
Repo
You can find the full code and other related notebooks here.