LoRA and Parameter-Efficient Fine-Tuning
The Problem with Full Fine-Tuning
The most straightforward approach is full fine-tuning — updating all model parameters.
But there's a big problem: a 7B model requires about 60GB VRAM for full fine-tuning (FP16). That's far beyond most consumer GPUs. A 70B model? You'd need a GPU cluster.
Is there a way to achieve similar results with fewer resources?
LoRA: Low-Rank Adaptation
LoRA is the most popular parameter-efficient fine-tuning method. Its core idea is elegant:
Don't modify the original model weights — add a small set of trainable parameters alongside them.
Intuitive Understanding
Imagine a model weight matrix W as a huge matrix (say 4096×4096). Full fine-tuning updates all 16 million parameters.
LoRA's approach: decompose the update into a product of two small matrices.
Original matrix W: 4096 × 4096 (16M parameters)
LoRA update ΔW = A × B
A: 4096 × 16 (65K parameters)
B: 16 × 4096 (65K parameters)
Total: 130K parameters ← only 0.8% of the original!
The 16 here is the rank — LoRA's key hyperparameter. Higher rank means more trainable parameters and more expressive power, but also higher training cost.
Why This Works
Research found that weight changes during fine-tuning have low-rank properties — despite billions of parameters, the actual "effective degrees of freedom" needed for fine-tuning are far fewer than the total parameter count. LoRA exploits exactly this property.
QLoRA: Going Further
QLoRA combines quantization with LoRA, further reducing memory requirements:
- Quantize the base model to 4-bit (dramatically reduces memory)
- Apply LoRA on top of the quantized model
- Only LoRA parameters are trained (in FP16 precision)
Full fine-tuning 7B model: ~60 GB VRAM
LoRA fine-tuning 7B model: ~16 GB VRAM
QLoRA fine-tuning 7B model: ~6 GB VRAM ← a single RTX 4060 works!
QLoRA makes fine-tuning 7B or even 13B models on consumer hardware possible.
Three Approaches Compared
| Full Fine-tuning | LoRA | QLoRA | |
|---|---|---|---|
| Trainable params | 100% | 0.1–1% | 0.1–1% |
| 7B model VRAM | ~60 GB | ~16 GB | ~6 GB |
| Training speed | Slowest | Fast | Fast |
| Quality | Best (theoretically) | Near full FT | Slightly below LoRA |
| Hardware | Multi-GPU | Single GPU (16GB+) | Single GPU (8GB+) |
| Best for | Ample resources | Most scenarios | Limited resources |
For most developers, QLoRA is the most practical choice.
Key LoRA Hyperparameters
Rank (r)
Determines LoRA's "capacity."
| Rank | Trainable Params | Use Case |
|---|---|---|
| 8 | Minimal | Simple tasks (style adjustment) |
| 16 | Moderate | Most tasks |
| 32 | More | Complex tasks |
| 64+ | Many | Approaching full fine-tuning |
Start with 16. Increase if results aren't good enough.
Alpha (α)
Scaling factor controlling LoRA update strength. Typically set to 2× the rank (alpha = 2 × rank).
Actual update = ΔW × (alpha / rank)
Target Modules
Which modules LoRA applies to. Transformers have various weight matrices:
target_modules = [
"q_proj", # Query projection
"k_proj", # Key projection
"v_proj", # Value projection
"o_proj", # Output projection
"gate_proj", # FFN gate
"up_proj", # FFN up projection
"down_proj", # FFN down projection
]
At minimum, include q_proj and v_proj. Attention layer projections have the most significant impact. If resources allow, include all linear layers.
Dropout
Regularization to prevent overfitting. Typically 0.05–0.1. Increase slightly with small datasets.
A Complete LoRA Configuration
from peft import LoraConfig
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
Using LoRA Weights
After training, LoRA saves as a small file (typically tens of MB):
base_model/ # Original model (several GB)
lora_adapter/ # LoRA weights (tens of MB)
adapter_config.json
adapter_model.safetensors
Two usage modes:
1. Dynamic loading (overlay at inference)
from peft import PeftModel
model = load_base_model("meta-llama/Llama-3.1-8B")
model = PeftModel.from_pretrained(model, "path/to/lora_adapter")
Advantage: one base model can pair with multiple LoRAs, switching on demand.
2. Merge weights (before deployment)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("merged_model/")
Advantage: no inference overhead, can be quantized to GGUF for deployment.
Key Takeaways
- LoRA is the current mainstream fine-tuning method — achieves near full fine-tuning quality with less than 1% of parameters.
- QLoRA further lowers the barrier — fine-tune 7B models with an 8GB GPU. Most practical for developers.
- Key hyperparameters: rank=16, alpha=32 is a good starting point. Target modules should include at least q_proj and v_proj.
- LoRA weights are small (tens of MB) and hot-swappable — one base model can serve multiple task-specific LoRAs.