tech/ollama

OLLAMA

Ollama local LLM runtime skill. Use when:

production Ollama v0.5+, REST API, OpenAI-compatible API, Docker
improves: tech

Ollama

Ollama is a local LLM runtime that lets you pull, run, and serve open-weight models with a single command. It exposes a REST API (and an OpenAI-compatible endpoint) so existing tooling — LangChain, OpenAI SDKs, Continue.dev — works without modification. In the 2nth.ai stack it provides private inference for sensitive workloads or cost-sensitive high-volume tasks.

Stub — full skill pending. Core patterns documented below.

Subdomains

PathFocusStatus
tech/ollama/modelsModel selection, pull, GGUF, quantisationstub
tech/ollama/apiREST API, OpenAI-compatible endpoint, streamingstub
tech/ollama/modelfileCustom Modelfile — system prompts, parameters, adaptersstub
tech/ollama/deploymentDocker, Linux service, Kubernetes, GPU configurationstub

Quick Start

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model
ollama pull llama3.2
ollama run llama3.2

# Pull a smaller/faster model for edge use
ollama pull phi4-mini
ollama run phi4-mini "Summarise this in one sentence: ..."

OpenAI-Compatible API

Ollama serves an OpenAI-compatible endpoint at http://localhost:11434/v1 — swap the base URL and it works with any OpenAI SDK:

import OpenAI from 'openai';

const ollama = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama', // required by SDK, value ignored
});

const res = await ollama.chat.completions.create({
  model: 'llama3.2',
  messages: [{ role: 'user', content: 'Explain OSPF in one paragraph.' }],
  stream: true,
});

for await (const chunk of res) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? '');
}

Model Families

FamilyBest forRecommended pull
Llama 3.2General purpose, tool usellama3.2 (3B), llama3.2:1b
Llama 3.1Long context (128K), codingllama3.1:8b, llama3.1:70b
Mistral / MixtralFast inference, instruction followingmistral, mixtral:8x7b
Gemma 3Multimodal, efficientgemma3:4b, gemma3:12b
Phi-4 MiniUltra-small, on-devicephi4-mini
Qwen 2.5Code, maths, Chineseqwen2.5-coder:7b
DeepSeek R1Reasoning, chain-of-thoughtdeepseek-r1:7b, deepseek-r1:32b
Nomic EmbedEmbeddingsnomic-embed-text

Native REST API

# Generate (single-turn)
curl http://localhost:11434/api/generate \
  -d '{"model":"llama3.2","prompt":"What is Cloudflare Workers?","stream":false}'

# Chat (multi-turn)
curl http://localhost:11434/api/chat \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is RAG?"}
    ]
  }'

# Embeddings
curl http://localhost:11434/api/embed \
  -d '{"model":"nomic-embed-text","input":"Ollama runs models locally"}'

# List running models
curl http://localhost:11434/api/ps

# List available models
curl http://localhost:11434/api/tags

Key Concepts

Integration Points in 2nth.ai Stack

Use casePattern
Private document analysisOllama + nomic-embed-text → Vectorize or pgvector
Cost fallback from ClaudeOpenAI-compatible swap — same SDK, different baseURL
Local agent developmentContinue.dev in VS Code pointing to local Ollama
On-prem client deploymentDocker container on client GPU server
Classification at edgephi4-mini for fast intent routing before Claude