How Models Are Trained
Three Stages
Training a model like ChatGPT isn't a single step — it happens in three stages:
- Pre-training: Learning language itself
- Supervised Fine-tuning (SFT): Learning to have conversations
- RLHF: Learning human preferences
Each stage solves a different problem with different data. Understanding this pipeline helps you understand why models excel at some tasks and struggle with others.
Stage 1: Pre-training
This is the most resource-intensive stage. The model learns one thing from massive amounts of text: predict the next token.
Training data sources include:
- Internet web pages (Common Crawl, etc.)
- Books, papers, Wikipedia
- Code repositories (GitHub, etc.)
- Other public text data
The scale is staggering. GPT-3's training data contained roughly 300 billion tokens, and subsequent models use even more. Training requires thousands of GPUs running for weeks to months, costing millions to hundreds of millions of dollars.
After pre-training, you get a base model. It has learned language, knowledge, and reasoning patterns, but it can't hold a conversation — if you ask it a question, it might generate more questions rather than answer you.
Stage 2: Supervised Fine-tuning (SFT)
To turn the base model into a useful assistant, it's fine-tuned on human-annotated conversation data.
This data looks like:
User: Explain what recursion is
Assistant: Recursion is a programming technique where a function calls itself during execution...
Human annotators write high-quality responses, and the model learns this question-answer pattern. After SFT, the model starts understanding instructions and producing structured responses.
SFT uses far less data than pre-training (typically tens of thousands to hundreds of thousands of conversations), but it fundamentally changes the model's behavior.
Stage 3: RLHF
RLHF (Reinforcement Learning from Human Feedback) is the key step that makes models "helpful."
The process roughly:
- The model generates multiple responses for the same question
- Human annotators rank these responses (which is better)
- The ranking data trains a "reward model" — it learns to predict what humans would prefer
- Reinforcement learning optimizes the original model to generate responses that score higher with the reward model
RLHF teaches the model:
- Safety: Refusing harmful requests
- Helpfulness: Providing detailed, organized answers
- Honesty: Expressing uncertainty when unsure
This is why ChatGPT and Claude say "I'm not sure" or refuse certain requests — these behaviors are trained through RLHF.
Training Data Cutoff
Pre-training uses data collected before a certain point in time — the knowledge cutoff date.
- The model doesn't know about events after the cutoff
- It can't automatically update its knowledge
- This is one reason technologies like RAG (Retrieval-Augmented Generation) exist — to provide models with real-time information
When a model performs poorly on recent events, it's often not because it's "dumb" — it's because that information simply isn't in the training data.
Practical Implications
- Base model ≠ Chat model. When you call an API, you're usually using a version that went through SFT + RLHF. Some APIs also offer base model access, which behaves very differently.
- Model knowledge has a time boundary. Don't expect it to know the latest information — use RAG for real-time data.
- The model's "personality" is trained. Its caution, politeness, and refusals are the result of RLHF, not some form of "consciousness."
- Training data sets the capability ceiling. If your use case involves very specialized or niche domains, the model may underperform — there's simply less of that content in the training data.
- Fine-tuning is something you can do. In the Fine-tuning track, you'll learn how to tune models with your own data.