Deploying Fine-Tuned Models
From Training to Production
The model is trained and evaluation looks good. Next question: how do you deploy it to production?
Merging LoRA Weights
If you fine-tuned with LoRA, first decide whether to merge.
Dynamic Loading (No Merge)
from peft import PeftModel
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
model = PeftModel.from_pretrained(base_model, "./lora_adapter")
Good for:
- Same base model with multiple LoRAs
- Frequent LoRA switching
Merge and Save
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")
Good for:
- Single final model needed
- Further format conversion (e.g., GGUF)
- Production deployment
Merging is recommended for most production scenarios. No inference overhead, simpler deployment.
Model Format Conversion
Convert to GGUF (Local Deployment)
Merged models can be converted to GGUF for Ollama or llama.cpp deployment:
# Using llama.cpp conversion script
python convert_hf_to_gguf.py ./merged_model --outfile model-f16.gguf
# Quantize (optional, reduces size)
./llama-quantize model-f16.gguf model-Q4_K_M.gguf Q4_K_M
Use with Ollama
Create a Modelfile:
FROM ./model-Q4_K_M.gguf
SYSTEM "You are a professional customer support assistant."
PARAMETER temperature 0.3
ollama create my-finetuned-model -f Modelfile
ollama run my-finetuned-model
Keep SafeTensors (GPU Deployment)
For vLLM or TGI deployment, use the merged SafeTensors format directly:
# vLLM
vllm serve ./merged_model --port 8000
# Or upload to Hugging Face
huggingface-cli upload my-org/my-model ./merged_model
Choosing a Deployment Solution
| Solution | Hardware | Best For | Performance |
|---|---|---|---|
| Ollama + GGUF | CPU / Apple Silicon / GPU | Personal use, small teams | Single-user smooth |
| llama-server + GGUF | Same | More control needed | Same |
| vLLM + SafeTensors | NVIDIA GPU | High-concurrency production | High throughput |
| TGI + SafeTensors | NVIDIA GPU | Production | High throughput |
| HF Inference Endpoints | Cloud GPU | No infrastructure management | Elastic scaling |
Performance Optimization
KV Cache
During inference, the model caches previous tokens' Key-Value pairs to avoid recomputation. This is the most basic inference optimization, enabled by default in most frameworks.
Continuous Batching
The core advantage of vLLM and TGI — different requests' token generation is interleaved, significantly improving GPU utilization.
Traditional batching:
Request 1: [generating...............] waiting for request 2
Request 2: [generating.......] done, but waiting for request 1
Continuous batching:
Request 1: [generating...............]
Request 2: [generating.......] → done, slot freed, new request joins
Request 3: [generating...........]
Quantized Deployment
Production can use quantized models too. INT8 or INT4 quantization has minimal quality loss for most tasks:
vllm serve ./merged_model \
--quantization awq \
--port 8000
Cost Considerations
Self-Hosted vs Cloud
| Self-Hosted | Cloud (API) | |
|---|---|---|
| Upfront cost | High (GPU hardware) | Low |
| Running cost | Electricity + ops | Pay per use |
| Best for | High-frequency, latency-sensitive | Low-frequency, elastic demand |
| Scalability | Manual scaling | Auto-scaling |
Cost Estimation
Self-hosted (RTX 4090, 24GB):
- Hardware: ~$2,000
- Can run: 7B Q4 model, ~50-80 tokens/s
- Electricity: ~$30/month (24/7)
- Break-even: depends on usage volume
Cloud (e.g., RunPod, Lambda):
- A100 80GB: ~$1.5/hour
- Good for on-demand use, no 24/7 needed
Monitoring and Iteration
Going live isn't the end — it's the start of a new cycle:
Key Metrics
metrics = {
"latency_p50": "50th percentile latency",
"latency_p99": "99th percentile latency",
"throughput": "Requests per second",
"error_rate": "Error rate",
"user_satisfaction": "User satisfaction",
"gpu_utilization": "GPU utilization",
}
Continuous Improvement Loop
Collect user feedback
↓
Identify cases where model performs poorly
↓
Add these cases to training data
↓
Re-fine-tune
↓
Evaluate → Deploy → Continue collecting feedback
This loop drives continuous improvement. Each iteration doesn't need massive new data — a few dozen targeted samples can fix specific problems.
Key Takeaways
- Merge LoRA weights before deployment is the most common path. Post-merge, convert to GGUF (local) or keep SafeTensors (GPU).
- Personal/small team: Ollama + GGUF. High-concurrency production: vLLM + SafeTensors.
- Quantized deployment has minimal quality loss in most scenarios while significantly reducing hardware requirements and cost.
- Deployment is the start of iteration, not the end. Collect feedback → improve data → re-fine-tune in a continuous improvement loop.