Model Formats and Quantization
What's Inside a Model File
When you download an LLM, you're downloading weight parameters — billions of floating-point numbers that store everything the model learned from its training data.
These parameters need to be saved in some file format. Different formats have different characteristics, and understanding them helps you make better choices.
Major Model Formats
GGUF
GGUF (GPT-Generated Unified Format) is the dominant format for local inference, defined by the llama.cpp project.
Key features:
- Single file: All model information (weights, tokenizer, metadata) packed into one file
- Built-in quantization: Multiple quantization schemes supported
- Cross-platform: Works on CPU, GPU, and Metal
- Ollama's underlying format: When you
ollama pulla model, it's GGUF under the hood
GGUF filenames typically include quantization info, like llama-3-8b-Q4_K_M.gguf, so you can see the model size and quantization level at a glance.
SafeTensors
SafeTensors is Hugging Face's format, primarily for GPU inference and training.
Key features:
- Safe: No pickle serialization, avoiding deserialization attack risks
- Fast loading: Supports memory mapping for quick load times
- Standard format for GPU training and inference
- Usually saved as multiple shard files
If you use the transformers library or vLLM for inference, you're using SafeTensors.
PyTorch (.bin) and GGML
- PyTorch .bin: Legacy format using pickle serialization, has security risks, being replaced by SafeTensors
- GGML: GGUF's predecessor, outdated, not recommended
Quantization: Trading Precision for Space
A 7B parameter model stored in FP16 (16-bit floating point) requires about 14GB of memory. A 70B model needs 140GB — far beyond consumer GPU capacity.
Quantization is the key technology: represent each parameter with fewer bits, trading some precision for smaller memory footprint and faster inference.
The principle is simple. FP16 uses 16 bits per parameter, but many parameters don't need that much precision. Using 4 bits instead cuts memory usage to 1/4.
Quantization Levels Explained
| Level | Bits per param | 7B model size | Quality loss | Use case |
|---|---|---|---|---|
| F16 | 16 bit | ~14 GB | None | Best choice when you have enough VRAM |
| Q8_0 | 8 bit | ~7 GB | Minimal | Quality-first, space available |
| Q6_K | 6 bit | ~5.5 GB | Very small | Good balance of quality and space |
| Q5_K_M | 5 bit | ~4.8 GB | Small | Recommended general choice |
| Q4_K_M | 4 bit | ~4.0 GB | Slight | Most popular, best value |
| Q3_K_M | 3 bit | ~3.3 GB | Noticeable | When memory is tight |
| Q2_K | 2 bit | ~2.5 GB | Severe | Not recommended, too much quality loss |
The K in the name indicates k-quant method (smarter quantization), M means Medium precision. There's also S (Small, smaller but lower quality) and L (Large, bigger but better quality).
How to Choose a Quantization Level
Practical decision flow:
- VRAM is sufficient → F16 or Q8: Best quality, no compromise needed
- Want best value → Q4_K_M: The community's top recommendation, minimal quality loss with significant memory savings
- Memory is tight → Q5_K_M or Q4_K_M: Balance between quality and space
- Very tight → Q3_K_M: Usable, but noticeable quality degradation
- Q2 → Basically don't: Quality loss is too severe; better to use a smaller model
A common misconception: a 70B model at Q2 is usually worse than a 7B model at Q8. Excessive quantization often hurts more than switching to a smaller model.
Estimating Model Size
Quick estimation formula:
Memory needed ≈ Parameters (billions) × Bits per parameter ÷ 8
Example: 7B model with Q4 quantization
7 × 4 ÷ 8 = 3.5 GB (actual usage ~4 GB with overhead)
This helps you quickly determine if a model can run on your hardware.
Where to Find Quantized Models
- Hugging Face: Search for model name + "GGUF", e.g., "llama 3 8b GGUF"
- TheBloke: The most well-known quantized model publisher on Hugging Face (though many model authors now publish GGUF themselves)
- Ollama model library:
ollama listto see available models, pre-configured and ready to go
Key Takeaways
- GGUF is the standard format for local inference, SafeTensors for GPU training and inference. Choose based on your use case.
- Quantization is the key to running large models locally — trading precision for space. Q4_K_M is the most popular choice with the best value.
- Don't over-quantize. Q2 is generally unusable. Rather than heavily quantizing a large model, use a smaller model at a higher quantization level.
- Model size ≈ Parameters × Bits ÷ 8. Use this formula to quickly assess hardware feasibility.