Dataset Preparation
Quality Over Quantity
The #1 factor in fine-tuning results isn't model size or training parameters — it's data quality.
100 high-quality samples typically outperform 1,000 low-quality ones. High quality means:
- Accurate input-output correspondence
- Outputs represent your ideal responses
- Consistent style and format
- No errors or contradictions
Data Formats
Instruction Tuning Format
The most common format. Each sample has an instruction and expected output:
{
"instruction": "Translate the following text to formal business English",
"input": "We wanna talk about working together",
"output": "We would like to discuss potential collaboration opportunities with your organization."
}
Chat Format
Better for conversational applications:
{
"conversations": [
{"role": "system", "content": "You are a professional customer support assistant"},
{"role": "user", "content": "I want a refund"},
{"role": "assistant", "content": "I'd be happy to help with your refund. Could you please provide your order number? And may I ask the reason for the refund?"}
]
}
Common Data Formats
| Format | Description | Used For |
|---|---|---|
| Alpaca | instruction + input + output | General instruction tuning |
| ShareGPT | Multi-turn conversations array | Chat model fine-tuning |
| OpenAI JSONL | messages array (role + content) | OpenAI fine-tuning API |
| JSONL | One JSON object per line | General purpose |
How Much Data Do You Need
Rules of thumb:
| Amount | Effect |
|---|---|
| < 50 | Usually insufficient, consider few-shot |
| 50–200 | Noticeable effect, suitable for style/format adjustments |
| 200–1,000 | Good results, suitable for most fine-tuning scenarios |
| 1,000–10,000 | Very good results, suitable for complex tasks |
| > 10,000 | Evaluate if worth it (diminishing returns) |
Key insight: the return on data investment isn't linear. Going from 100 to 500 samples usually brings large improvements; going from 5,000 to 10,000 may bring very little.
Data Collection Strategies
1. Collect from Real Scenarios
The best data comes from your actual business:
- Human agent conversation logs
- Expert-written responses
- Reviewed and approved outputs
raw_data = export_from_crm()
training_data = []
for record in raw_data:
training_data.append({
"conversations": [
{"role": "user", "content": record["customer_message"]},
{"role": "assistant", "content": record["agent_reply"]}
]
})
2. Generate Synthetic Data with LLMs
When real data is insufficient, use strong models to generate training data:
prompt = """Generate training data for customer support fine-tuning.
Scenario: User inquiring about refund-related issues
Requirements:
- Generate 10 different user questions and support replies
- User questions should be diverse (different phrasings, situations)
- Support replies should be professional, helpful, following company policy
- Output in JSON format
Company refund policy: 30-day refund window, order number required, 3-5 day review."""
synthetic_data = llm.generate(prompt)
Guidelines for synthetic data:
- Generate with strong models, train weak models — use GPT-4/Claude to create data for fine-tuning Llama 8B
- Human review required — generated data needs quality checks
- Mix with real data — synthetic data works best combined with real data
3. Augment Existing Data
Enhance existing data by varying phrasing while preserving semantics:
augmentation_prompt = """Rewrite this user question in 3 different ways, keeping the same meaning:
Original: "I'm not happy with my purchase, I want my money back"
Output 3 rewrites:"""
Data Cleaning
Noise in training data gets learned by the model. Cleaning steps:
Deduplication
from sklearn.metrics.pairwise import cosine_similarity
embeddings = embed_all(data)
duplicate_pairs = []
for i in range(len(embeddings)):
for j in range(i+1, len(embeddings)):
if cosine_similarity([embeddings[i]], [embeddings[j]])[0][0] > 0.95:
duplicate_pairs.append((i, j))
Quality Filtering
- Remove responses that are too short or too long
- Remove data with obvious errors
- Remove inconsistently formatted data
- Remove responses with unnecessary disclaimers ("As an AI...", "I cannot...")
Consistency Checks
Ensure instructions and responses are stylistically consistent:
- Similar questions should have similarly styled answers
- No contradictions between data points
- Uniform output formatting
Dataset Splitting
train_data = data[:int(len(data) * 0.9)] # 90%
eval_data = data[int(len(data) * 0.9):] # 10%
The validation set monitors overfitting during training — if training loss drops but validation loss rises, you're overfitting.
Key Takeaways
- Data quality > data quantity. 100 high-quality samples usually beat 1,000 low-quality ones.
- Real-scenario data is the best source. Synthetic data is a useful supplement but needs human review.
- 50–200 samples can show results, but more high-quality data brings better outcomes.
- Data cleaning and consistency checks are non-negotiable. Noisy data will be faithfully learned by the model.
- Always maintain a validation set. Without one, you can't tell if the model is learning or overfitting.