Evaluation and Benchmarking
Why Evaluation Is Hard
Traditional ML evaluation is straightforward — accuracy, F1, AUC all have clear definitions. But for generative models, "a good answer" is subjective.
The same question may have countless "correct" answers. Evaluation isn't about "right or wrong" but "good or not."
Automated Metrics
Perplexity
How "surprised" the model is by test data. Lower is better.
import math
eval_results = trainer.evaluate()
perplexity = math.exp(eval_results["eval_loss"])
print(f"Perplexity: {perplexity}")
Limitation: measures token prediction ability, not answer quality. Low perplexity doesn't necessarily mean good answers.
BLEU and ROUGE
These measure overlap between generated and reference text:
- BLEU: How many n-grams in the generated text appear in the reference
- ROUGE: How many n-grams in the reference appear in the generated text
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score(
"Model generated answer here",
"Reference answer here"
)
Major limitation: These only check word overlap, not semantics. "The cat sat on the mat" and "A feline rested on the rug" mean the same thing but score poorly.
Bottom line: BLEU and ROUGE are largely useless for LLM evaluation. Know them but don't rely on them.
LLM-as-Judge
Use a strong model to evaluate your fine-tuned model's output:
judge_prompt = """Rate the quality of this AI response from 1-5.
Dimensions:
- Accuracy: Is the information correct?
- Relevance: Does it answer the user's question?
- Completeness: Does it cover key points?
- Language quality: Is it clear and fluent?
User question: {question}
AI response: {response}
Provide a score (1-5) and brief reason. Output JSON:
{{"score": <score>, "reason": "<reason>"}}"""
def evaluate_with_llm(question, response, judge_model="gpt-4o"):
result = llm.generate(
judge_prompt.format(question=question, response=response),
model=judge_model
)
return json.loads(result)
Comparative Evaluation
More reliable than absolute scoring — have the judge do A/B comparison:
compare_prompt = """Here are two answers to the same question. Which is better?
Question: {question}
Answer A: {response_a}
Answer B: {response_b}
Which is better? Output "A", "B", or "tie" with a brief explanation."""
Comparative evaluation reduces scoring subjectivity — humans find it easier to judge "which is better" than "how good."
LLM Judge Limitations
- Position bias: Model may favor the first or last answer
- Self-preference: GPT-4 may prefer GPT-style responses
- Mitigation: Evaluate with A/B swapped and average the results
Human Evaluation
The most reliable but most expensive evaluation method.
Evaluation Protocol
Task: Evaluate customer support AI response quality
Each sample includes:
- User question
- AI response
Scoring dimensions (1-5 each):
1. Accuracy: Is the information correct?
2. Helpfulness: Does it resolve the user's issue?
3. Tone: Is it professional and friendly?
4. Conciseness: Does it avoid unnecessary verbosity?
Evaluator requirements:
- At least 2 people independently evaluate each sample
- Disagreements > 2 points require discussion to reach consensus
Practical Execution
- Randomly sample 50–100 items from the test set
- Generate responses from both the fine-tuned and baseline models
- Shuffle — evaluators don't know which is which (blind evaluation)
- Calculate scores and win rates
Task-Specific Evaluation Sets
Building a dedicated evaluation set for your specific task is the most valuable investment:
eval_set = [
{
"input": "I bought a phone case and don't like it, can I return it?",
"expected_behavior": [
"Confirm refund is possible (within 30 days)",
"Ask for order number",
"Explain the refund process",
],
"bad_behavior": [
"Refuse the refund",
"Try to upsell other products",
"Respond with unrelated content",
]
},
]
def evaluate_on_eval_set(model, eval_set):
results = []
for case in eval_set:
response = model.generate(case["input"])
expected_hits = sum(
1 for behavior in case["expected_behavior"]
if behavior_present(response, behavior)
)
bad_hits = sum(
1 for behavior in case["bad_behavior"]
if behavior_present(response, behavior)
)
results.append({
"expected_score": expected_hits / len(case["expected_behavior"]),
"bad_score": bad_hits / len(case["bad_behavior"]),
})
return results
A/B Testing
The ultimate test in production:
Traffic split:
- 50% users → fine-tuned model
- 50% users → baseline model
Monitor:
- User satisfaction scores
- Task completion rate
- Human escalation rate
- Response latency
Evaluation Strategy Summary
| Method | Cost | Reliability | When to Use |
|---|---|---|---|
| Perplexity | Low | Low | During training |
| LLM-as-Judge | Medium | Medium | Quick iteration |
| Task eval set | Medium | High | After each fine-tune |
| Human evaluation | High | High | Before key decisions |
| A/B testing | High | Highest | Before/after launch |
Recommended combo: loss during training + eval set after each fine-tune + LLM Judge + human spot-check before launch.
Key Takeaways
- BLEU/ROUGE are largely useless for LLM evaluation. Don't rely on them.
- LLM-as-Judge is the most practical evaluation method — use GPT-4 to evaluate fine-tuned model output. Comparative evaluation is more reliable than absolute scoring.
- Building a dedicated evaluation set is the highest-value investment. Clearly define what "good" and "bad" look like.
- Combine multiple evaluation methods — no single method reliably evaluates generation quality.
- A/B testing is the ultimate production test. No offline evaluation substitutes for real user feedback.