tech/ollama/models

MODELS

Ollama model selection and management skill. Use when:

production Ollama v0.5+
improves: tech/ollama

Ollama — Model Selection & Management

Stub — core patterns below.

Pull & Manage

# Pull a model (downloads to ~/.ollama/models)
ollama pull llama3.2

# Pull specific quantisation
ollama pull llama3.1:8b-instruct-q4_K_M

# Pull from Hugging Face (GGUF)
ollama pull hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M

# List local models
ollama list

# Show model info (params, quantisation, context length)
ollama show llama3.2

# Remove a model
ollama rm llama3.2

# Copy/alias a model
ollama cp llama3.2 my-custom-llama

Quantisation Guide

FormatSize vs fp16Quality lossBest for
q2_K~25%HighTiny RAM, experiments
q4_K_M~45%LowDefault choice — best balance
q5_K_M~55%Very lowWhen VRAM allows
q8_0~75%MinimalNear-lossless, big GPU
fp16100%NoneMaximum quality, research

Rule of thumb: Use q4_K_M unless you have a specific quality requirement. It fits a 7B model in ~4GB VRAM and a 13B in ~8GB.

VRAM Requirements

Model sizeq4_K_M VRAMq8_0 VRAMNotes
1B–3B~1–2 GB~2–3 GBRuns on integrated GPU or CPU
7B–8B~4–5 GB~8 GBRTX 3060 / M2
13B~8 GB~14 GBRTX 3080 / M2 Pro
30B–34B~20 GB~34 GBRTX 3090 / M3 Max
70B~40 GB~70 GBA100 40GB / 2× RTX 3090

Recommended Models by Task

# General chat / instruction following
ollama pull llama3.2          # 3B — fast, good quality
ollama pull llama3.1:8b       # 8B — better reasoning

# Coding
ollama pull qwen2.5-coder:7b  # Best small coding model
ollama pull deepseek-coder-v2 # Strong for code review

# Reasoning / chain-of-thought
ollama pull deepseek-r1:7b    # Thinking model, shows reasoning
ollama pull deepseek-r1:32b   # Stronger reasoning, needs big GPU

# Ultra-fast / on-device
ollama pull phi4-mini         # 3.8B, excellent for its size
ollama pull llama3.2:1b       # 1B, CPU-only capable

# Multimodal (vision)
ollama pull llava:7b          # Image + text
ollama pull minicpm-v         # Efficient vision model
ollama pull llama3.2-vision   # Meta's vision variant

# Embeddings
ollama pull nomic-embed-text  # 768-dim, fast, good quality
ollama pull mxbai-embed-large # 1024-dim, higher quality
ollama pull bge-m3            # Multilingual, 1024-dim

Embedding Workflow

// Generate embeddings for RAG
async function embed(text: string): Promise<number[]> {
  const res = await fetch('http://localhost:11434/api/embed', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      model: 'nomic-embed-text',
      input: text,
    }),
  });
  const data = await res.json() as { embeddings: number[][] };
  return data.embeddings[0];
}

// Batch embedding
const res = await fetch('http://localhost:11434/api/embed', {
  method: 'POST',
  body: JSON.stringify({
    model: 'nomic-embed-text',
    input: ['Document one text', 'Document two text', 'Query text'],
  }),
});

CPU-Only Operation

# Force CPU (no GPU offload)
OLLAMA_NUM_GPU=0 ollama serve

# Or set in Modelfile
PARAMETER num_gpu 0

# Recommended CPU models
ollama pull phi4-mini          # 3.8B — fast on modern CPU
ollama pull llama3.2:1b        # 1B — usable on any machine
ollama pull gemma3:1b          # 1B Google model