Ollama Complete Guide 2026: Run LLMs Locally for Free

Ollama 2026: Run Any LLM Locally, Free Forever

No API keys. No rate limits. No data leaving your machine. Ollama lets you run powerful open-source LLMs (Llama 3.3, Mistral, Gemma 3, DeepSeek-R1) locally with a single command.

Why Run LLMs Locally?
Installation (All Platforms)
Pull and Run Models
REST API (Use from Any Language)
Python Integration
OpenAI-Compatible Mode
Build a Local RAG App with Ollama
Custom Modelfiles (Fine-tune Behavior)
Performance Guide
Use Cases

Why Run LLMs Locally?

Privacy: Your data never leaves your machine — critical for medical, legal, financial apps
Cost: $0 per token vs$ 15/1M tokens for GPT-4o
Speed: No network latency; runs at your GPU's speed
Offline: Works without internet
Customization: Full control over system prompts, parameters, and model weights

Installation (All Platforms)

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer from https://ollama.com/download

# Verify installation
ollama --version

Pull and Run Models

# Pull Llama 3.3 (70B — state of the art open source)
ollama pull llama3.3

# Pull smaller, faster models
ollama pull llama3.2          # 3B — great for laptops
ollama pull mistral           # 7B — fast, great quality
ollama pull gemma3            # Google's Gemma 3
ollama pull deepseek-r1       # DeepSeek R1 — incredible reasoning
ollama pull codellama         # Optimized for code
ollama pull phi4              # Microsoft Phi-4 — compact + smart
ollama pull qwen2.5-coder     # Alibaba's coding model

# Chat with a model
ollama run llama3.2
>>> Tell me a joke about programmers

REST API (Use from Any Language)

Ollama exposes a REST API on localhost:11434:

# Generate a completion
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.2", "prompt": "Why is the sky blue?", "stream": false}'

# Chat API (same format as OpenAI!)
curl http://localhost:11434/api/chat \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Write a Python quicksort"}],
    "stream": false
  }'

Python Integration

import ollama

# Simple generation
response = ollama.generate(model='llama3.2', prompt='Explain closures in JavaScript')
print(response['response'])

# Chat with history
messages = [
    {'role': 'system', 'content': 'You are a senior software engineer. Be concise.'},
    {'role': 'user', 'content': 'What is the difference between TCP and UDP?'},
]
response = ollama.chat(model='llama3.2', messages=messages)
print(response['message']['content'])

# Streaming response
for chunk in ollama.generate(model='llama3.2', prompt='Write a merge sort', stream=True):
    print(chunk['response'], end='', flush=True)

OpenAI-Compatible Mode

Ollama speaks the OpenAI API format — drop it into any OpenAI app:

from openai import OpenAI

# Point OpenAI client to local Ollama
client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # any string works
)

response = client.chat.completions.create(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Write a binary search in Python'}]
)
print(response.choices[0].message.content)

This means you can use Ollama as a free drop-in replacement for any OpenAI-powered app during development!

Build a Local RAG App with Ollama

from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader

# 100% local — no API keys needed
llm = ChatOllama(model="llama3.2", temperature=0)
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Load and embed docs
loader = PyPDFLoader("my_document.pdf")
chunks = RecursiveCharacterTextSplitter(chunk_size=500).split_documents(loader.load())
db = Chroma.from_documents(chunks, embeddings)

# Query
retriever = db.as_retriever(search_kwargs={"k": 3})
docs = retriever.invoke("What is the main topic?")
context = "\n".join([d.page_content for d in docs])

response = llm.invoke(f"Context: {context}\n\nQuestion: What is the main topic?")
print(response.content)

Custom Modelfiles (Fine-tune Behavior)

# Modelfile — create a custom model persona
FROM llama3.2

SYSTEM """
You are an expert Python developer who writes clean, well-commented code.
Always include type hints. Always handle exceptions. Use f-strings.
When explaining code, be concise and focus on the key insight.
"""

PARAMETER temperature 0.1
PARAMETER num_ctx 8192

# Build and run your custom model
ollama create python-expert -f Modelfile
ollama run python-expert

Performance Guide

Model	RAM Required	Speed (tokens/s)	Quality
llama3.2:3b	4 GB	60-80 t/s	Good
llama3.2:8b	8 GB	30-40 t/s	Very Good
mistral:7b	8 GB	35-45 t/s	Very Good
llama3.3:70b	48 GB	5-10 t/s	Excellent
deepseek-r1:32b	24 GB	10-15 t/s	Excellent

Apple Silicon (M1/M2/M3): Ollama uses Metal GPU acceleration — 3-5x faster than CPU. NVIDIA GPU: Set CUDA_VISIBLE_DEVICES — near cloud-speed performance.

Use Cases

Code completion server: Integrate with VS Code via Continue extension
Private chatbot: Customer support without data leaving premises
Local RAG: Query confidential documents securely
CI/CD AI reviews: Run code review in your pipeline, no API costs
Edge AI: Deploy on-premise for air-gapped environments