The Training Process

Training Tools Overview

Several mainstream tools for fine-tuning LLMs:

Tool	Characteristics	Best For
Hugging Face (transformers + PEFT + TRL)	Most flexible, component-based	Full control scenarios
Axolotl	YAML config-driven, simplified workflow	Quick fine-tuning without much code
Unsloth	Speed and memory optimized	Limited resources, efficiency-focused
LLaMA Factory	Web UI, beginner-friendly	Prefer graphical interface

Recommended path: start with Unsloth or Axolotl to get running, then learn the full Hugging Face stack for complete control.

Quick Start with Unsloth

Unsloth is one of the most efficient fine-tuning tools — 2–5x faster than standard implementations, 60% less memory.

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# 1. Load model (auto-applies QLoRA)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.1-8B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

# 2. Configure LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
)

# 3. Load data
dataset = load_dataset("json", data_files="training_data.jsonl")

# 4. Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    args=TrainingArguments(
        output_dir="./output",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        warmup_steps=10,
        logging_steps=10,
        save_steps=100,
        fp16=True,
    ),
)

trainer.train()

# 5. Save
model.save_pretrained("./lora_adapter")

Key Hyperparameters

Learning Rate

Controls the magnitude of each parameter update. The most important hyperparameter in fine-tuning.

Too high → Unstable training, loss oscillates or diverges
Too low → Learning too slow, needs more epochs

Method	Recommended LR
Full fine-tuning	1e-5 to 5e-5
LoRA	1e-4 to 3e-4
QLoRA	2e-4 (Unsloth default)

Epochs

Number of complete passes through the training data.

Too few → Underfitting, model hasn't learned enough
Too many → Overfitting, model "memorizes answers" instead of "learning patterns"

Recommended: 1–3 epochs. With lots of data, 1 may suffice. With less data, go up to 3. Beyond 5 usually means overfitting.

Batch Size

How many samples per update. Limited by VRAM.

Effective batch size = per_device_train_batch_size × gradient_accumulation_steps

per_device_train_batch_size = 4
gradient_accumulation_steps = 4
# Effective batch size = 4 × 4 = 16

When VRAM can't handle a large batch, use gradient accumulation to simulate it.

Warmup

Learning rate gradually increases from 0 to the set value at training start. Prevents large early updates from destroying pre-trained knowledge.

Recommended: 3–10% of total steps, or 10–100 steps

Monitoring Training

Loss Curves

The most important metric during training is loss:

Normal loss curve:
- Rapid decline → slow decline → plateau

Overfitting signals:
- Training loss keeps dropping
- Validation loss starts rising ← danger!

Underfitting signals:
- Loss drops very slowly or barely moves
- Final loss value is still high

Practical Monitoring

args = TrainingArguments(
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=100,
    load_best_model_at_end=True,
)

Visualize with TensorBoard or Weights & Biases:

tensorboard --logdir ./output/runs

Common Issues

Overfitting

Symptoms: Training loss is very low but real-world performance is poor; model tends to "copy-paste" training data responses.

Solutions: Fewer epochs, more data, higher dropout, lower learning rate, smaller rank.

Catastrophic Forgetting

Symptoms: Model improves on target task but general capabilities degrade noticeably — forgets math, can't write code anymore.

Solutions: Mix general-purpose data into training, lower learning rate, fewer epochs, use LoRA instead of full fine-tuning (LoRA naturally mitigates this).

Abnormal Loss Curves

Loss oscillating: Learning rate too high — reduce it. Loss not decreasing: Learning rate too low, or data format issues (model isn't learning what you intended). Loss spike: Possibly hit anomalous data — check data quality.

Key Takeaways

Start with Unsloth or Axolotl — they simplify configuration, getting fine-tuning running in dozens of lines.
Learning rate is the most important hyperparameter. LoRA: 2e-4, full fine-tuning: 2e-5.
1–3 epochs is usually sufficient. Beyond 5 likely means overfitting.
Loss curves are the core tool for judging training status. Training loss down + validation loss up = overfitting.
Catastrophic forgetting is a common fine-tuning pitfall. Mitigate with LoRA, low learning rate, and mixing in general data.