Inference: How Models Generate Text

Training vs Inference

The previous lesson covered how models "learn." This lesson covers how models "work" — that is, inference.

When you call the ChatGPT API or type a question in the web interface, inference is what happens: the model reads your input and generates a response token by token.

Inference parameters are what you directly control. Understanding them lets you fine-tune model behavior for your needs.

Temperature: Controlling Randomness

Temperature is the most important inference parameter. It controls how random the model is when selecting the next token.

Suppose the model's probability distribution for the next token is:

"great" → 50%
"wonderful" → 30%
"nice" → 15%
others → 5%

Different temperature values affect the selection like this:

Temperature = 0: Almost always picks the highest-probability token ("great"). Output is highly deterministic but potentially monotonous
Temperature = 0.7: Samples roughly proportional to probabilities, producing moderate variation
Temperature = 1.5: Probabilities get "flattened," lower-probability tokens have a much bigger chance, output becomes more random or even chaotic

Practical guidelines:

Code generation, data extraction, factual Q&A: Low temperature (0 - 0.3)
General conversation, writing: Medium temperature (0.5 - 0.8)
Creative writing, brainstorming: Higher temperature (0.8 - 1.2)

Top-p: An Alternative Control

Top-p (also called nucleus sampling) controls randomness from a different angle:

Sample only from the smallest set of tokens whose cumulative probability reaches p.

For example, Top-p = 0.9 means: sort tokens by probability from highest to lowest, take those that sum to 90%, and randomly sample from that set.

The advantage of Top-p is that it's adaptive:

If the model is very confident (one token at 95% probability), the candidate set is tiny
If the model is uncertain, the candidate set automatically grows

Usually Temperature and Top-p are used together, or you adjust just one.

Stop Sequences: When to Stop

How does the model know when to stop generating? There are several mechanisms:

Special token: The model generates an <EOS> (end of sequence) token, signaling the response is complete
Stop sequences: You can specify strings that, when generated, cause the model to stop
Max tokens: You can set a maximum output token count, and generation is cut off when reached

Setting these properly prevents the model from rambling or getting cut off unexpectedly.

Streaming

You've probably noticed ChatGPT's responses appear word by word rather than all at once. This is streaming.

Technically, it's implemented via Server-Sent Events (SSE). Every time one or a few tokens are generated, the server pushes an update.

Streaming matters for user experience:

Users don't stare at a blank screen waiting
They can judge early whether the response is going in the right direction
Perceived response time drops dramatically

Most LLM APIs support streaming mode. In your applications, you should enable it by default.

Cost Structure of Inference

APIs typically charge per token, with different prices for input and output:

Input tokens (your prompt): Usually cheaper
Output tokens (model's response): Usually more expensive (because they must be generated one at a time)

Some optimization strategies:

Trim your prompt: Remove unnecessary content
Set reasonable max tokens: Prevent excessively long responses
Caching: Cache results for identical or similar requests
Choose the right model: Use smaller models for simple tasks, reserve large models for complex ones

Key Takeaways

Temperature and Top-p balance creativity vs determinism. Choose values based on your use case.
Temperature = 0 doesn't mean 100% deterministic, but it's close enough for most purposes.
Streaming should be your default, unless you have specific reasons to wait for the complete response.
Output tokens cost more than input tokens. Factor this into your prompt design.
Inference speed depends on model size, input length, and output length. Latency-sensitive applications need to balance these trade-offs.