Tokens and Text Representation
Why Tokens
Computers can't process text directly. Every piece of text you write must be split into small units before being fed into a model — these units are called tokens.
Tokens are the fundamental currency of the LLM world. Model input is billed by tokens, output is billed by tokens, and context windows are measured in tokens. Understanding tokens is the starting point for understanding LLMs.
Tokenization: How Text Gets Split
You might assume models process text character by character or word by word. In practice, most modern models use a method called BPE (Byte Pair Encoding), producing units somewhere between characters and words.
Some examples (using GPT-family tokenizers):
"hello"→ 1 token"indescribable"→"indes"+"crib"+"able"→ 3 tokens"你好世界"→ likely 2-4 tokens (Chinese characters are typically 1-2 tokens each)"ChatGPT"→"Chat"+"GPT"→ 2 tokens
Key observations:
- Common words are usually one token; rare words get split into subwords
- Token efficiency varies by language. English averages ~4 characters per token, while Chinese averages ~1-2 characters per token
- Spaces and punctuation are part of tokens
This explains practical quirks: why Chinese text consumes more tokens than English, why models sometimes struggle with spelling or character counting — they "see" tokens, not characters.
From Tokens to Vectors: Embeddings
After tokenization, each token needs to become a set of numbers for the model to process. This conversion is called Embedding.
Think of embeddings as a semantic coordinate system:
- Each token is mapped to a point in a high-dimensional space
- Semantically similar words are close together in this space
- For example,
"cat"and"dog"would be near each other, while"cat"and"car"would be far apart
These coordinates aren't hand-crafted — the model learns them during training. A typical embedding dimension ranges from several hundred to several thousand, meaning each token is represented by a vector of that many numbers.
Why Embeddings Matter
Embeddings aren't just an internal implementation detail — they're extremely useful in applications:
- Semantic search: Convert documents and queries into embeddings, measure relevance by vector distance — far more accurate than keyword matching
- RAG (Retrieval-Augmented Generation): Use embeddings to find relevant documents, then feed them to the LLM for answer generation
- Classification and clustering: Use embeddings for text classification without complex feature engineering
You'll explore these applications in depth in the RAG track.
Context Window: Your Token Budget
Every model has a context window — the maximum number of tokens it can process at once.
- GPT-3.5: 4K tokens
- GPT-4: 8K / 32K / 128K tokens
- Claude: 100K / 200K tokens
- Some latest models: over 1M tokens
Context window = input tokens + output tokens. If your prompt uses too many tokens, there's less room for the response.
Practical concerns:
- Long documents: A book might contain hundreds of thousands of tokens — anything beyond the window is invisible to the model
- Cost control: Token usage directly impacts API billing
- Performance: More tokens means slower responses
Summary
- Tokens are the basic unit of text processing in LLMs — they're not characters or words
- Tokenization splits text into subword-level segments
- Embeddings map tokens to semantic vectors, enabling the model to understand meaning
- The context window determines how much content a model can process at once
- Token count directly affects cost and performance