Ollama local LLM runtime skill. Use when:
Ollama is a local LLM runtime that lets you pull, run, and serve open-weight models with a single command. It exposes a REST API (and an OpenAI-compatible endpoint) so existing tooling — LangChain, OpenAI SDKs, Continue.dev — works without modification. In the 2nth.ai stack it provides private inference for sensitive workloads or cost-sensitive high-volume tasks.
Stub — full skill pending. Core patterns documented below.
| Path | Focus | Status |
|---|---|---|
tech/ollama/models | Model selection, pull, GGUF, quantisation | stub |
tech/ollama/api | REST API, OpenAI-compatible endpoint, streaming | stub |
tech/ollama/modelfile | Custom Modelfile — system prompts, parameters, adapters | stub |
tech/ollama/deployment | Docker, Linux service, Kubernetes, GPU configuration | stub |
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a model
ollama pull llama3.2
ollama run llama3.2
# Pull a smaller/faster model for edge use
ollama pull phi4-mini
ollama run phi4-mini "Summarise this in one sentence: ..."
Ollama serves an OpenAI-compatible endpoint at http://localhost:11434/v1 — swap the base URL and it works with any OpenAI SDK:
import OpenAI from 'openai';
const ollama = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama', // required by SDK, value ignored
});
const res = await ollama.chat.completions.create({
model: 'llama3.2',
messages: [{ role: 'user', content: 'Explain OSPF in one paragraph.' }],
stream: true,
});
for await (const chunk of res) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? '');
}
| Family | Best for | Recommended pull |
|---|---|---|
| Llama 3.2 | General purpose, tool use | llama3.2 (3B), llama3.2:1b |
| Llama 3.1 | Long context (128K), coding | llama3.1:8b, llama3.1:70b |
| Mistral / Mixtral | Fast inference, instruction following | mistral, mixtral:8x7b |
| Gemma 3 | Multimodal, efficient | gemma3:4b, gemma3:12b |
| Phi-4 Mini | Ultra-small, on-device | phi4-mini |
| Qwen 2.5 | Code, maths, Chinese | qwen2.5-coder:7b |
| DeepSeek R1 | Reasoning, chain-of-thought | deepseek-r1:7b, deepseek-r1:32b |
| Nomic Embed | Embeddings | nomic-embed-text |
# Generate (single-turn)
curl http://localhost:11434/api/generate \
-d '{"model":"llama3.2","prompt":"What is Cloudflare Workers?","stream":false}'
# Chat (multi-turn)
curl http://localhost:11434/api/chat \
-d '{
"model": "llama3.2",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is RAG?"}
]
}'
# Embeddings
curl http://localhost:11434/api/embed \
-d '{"model":"nomic-embed-text","input":"Ollama runs models locally"}'
# List running models
curl http://localhost:11434/api/ps
# List available models
curl http://localhost:11434/api/tags
num_ctx in Modelfile or request body; default varies by modelnum_gpu to force layer countkeep_alive: -1 to keep forever| Use case | Pattern |
|---|---|
| Private document analysis | Ollama + nomic-embed-text → Vectorize or pgvector |
| Cost fallback from Claude | OpenAI-compatible swap — same SDK, different baseURL |
| Local agent development | Continue.dev in VS Code pointing to local Ollama |
| On-prem client deployment | Docker container on client GPU server |
| Classification at edge | phi4-mini for fast intent routing before Claude |