Serving Local Models as APIs

From Command Line to API

Chatting with a model in the terminal is just the first step. To integrate local models into your applications, you need an HTTP API. Good news: virtually all local inference tools provide OpenAI-compatible API endpoints.

OpenAI-Compatible API: The De Facto Standard

OpenAI's Chat Completions API format has become the de facto standard for LLM APIs. Almost every local inference tool supports it:

POST /v1/chat/completions
{
  "model": "llama3.1:8b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "temperature": 0.7,
  "stream": true
}

This means your code only needs to change one base_url line to switch seamlessly between local models and cloud APIs.

Common Serving Solutions

1. Ollama

The simplest option. Runs an API server automatically after installation.

# Start (runs automatically after install)
ollama serve

# API endpoint
# http://localhost:11434/v1/chat/completions

Pros: Zero config, automatic model management, supports concurrency. Cons: Limited parameter customization, not ideal for high-performance production.

2. llama-server (llama.cpp)

Lightweight with more control.

llama-server \
  -m model.gguf \
  -c 8192 \
  -ngl 99 \
  --port 8080

Pros: Fine-grained parameter control, small resource footprint. Cons: Manual model file management.

3. vLLM

Production-oriented high-performance inference engine.

pip install vllm

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --port 8000 \
  --max-model-len 8192

Key advantages:

PagedAttention: Efficient KV Cache management, significantly improves throughput
Continuous batching: Processes multiple requests simultaneously, better GPU utilization
High concurrency: Suitable for multi-user scenarios

Cons: Requires NVIDIA GPU, doesn't support GGUF (uses SafeTensors), more complex setup than Ollama.

4. Text Generation Inference (TGI)

By Hugging Face, similar positioning to vLLM.

docker run --gpus all \
  -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference \
  --model-id meta-llama/Llama-3.1-8B-Instruct

Pros: Easy Docker deployment, good Hugging Face ecosystem integration. Cons: Also requires NVIDIA GPU.

Solution Comparison

	Ollama	llama-server	vLLM	TGI
Setup difficulty	Minimal	Simple	Medium	Medium
Model format	GGUF	GGUF	SafeTensors	SafeTensors
GPU support	Multi-platform	Multi-platform	NVIDIA	NVIDIA
CPU support	Yes	Yes	No	No
Concurrency	Basic	Basic	Excellent	Excellent
Continuous batching	No	No	Yes	Yes
Best for	Dev/personal use	Lightweight deploy	Production	Production

Code Integration

All these services are compatible with the OpenAI SDK — the code is nearly identical:

Python

from openai import OpenAI

# Switch backends by changing base_url
client = OpenAI(
    base_url="http://localhost:11434/v1",  # Ollama
    # base_url="http://localhost:8080/v1",  # llama-server
    # base_url="http://localhost:8000/v1",  # vLLM
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Implement binary search in Python"}
    ],
    temperature=0.3
)

print(response.choices[0].message.content)

TypeScript

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'not-needed',
});

const stream = await client.chat.completions.create({
  model: 'llama3.1:8b',
  messages: [{ role: 'user', content: 'Explain async/await' }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || '');
}

Streaming

Streaming output is important for user experience — no need to wait for the full response:

stream = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

Production Considerations

If you plan to use local models in production, also consider:

Load balancing: Multiple model instances behind Nginx or Caddy.

Health checks: Regularly verify the inference service is alive.

Request queuing: LLM inference is slow; you need proper queuing and timeout mechanisms.

Monitoring: Track GPU utilization, inference latency, and memory usage.

Security: Don't expose the inference API directly to the internet. Add authentication and rate limiting.

Key Takeaways

The OpenAI-compatible API is the standard interface. All major local inference tools support it, keeping migration costs minimal.
Ollama is great for development and personal use, vLLM for production environments. Choose based on your needs.
Code integration just requires changing base_url — use OpenAI SDKs to connect to local models with zero switching cost.
Production deployment needs load balancing, monitoring, and security — not just getting the model running.