Ollama Complete Guide 2026: Run LLMs Locally for Free
Advertisement
Ollama 2026: Run Any LLM Locally, Free Forever
No API keys. No rate limits. No data leaving your machine. Ollama lets you run powerful open-source LLMs (Llama 3.3, Mistral, Gemma 3, DeepSeek-R1) locally with a single command.
- Why Run LLMs Locally?
- Installation (All Platforms)
- Pull and Run Models
- REST API (Use from Any Language)
- Python Integration
- OpenAI-Compatible Mode
- Build a Local RAG App with Ollama
- Custom Modelfiles (Fine-tune Behavior)
- Performance Guide
- Use Cases
Why Run LLMs Locally?
- Privacy: Your data never leaves your machine — critical for medical, legal, financial apps
- Cost: 15/1M tokens for GPT-4o
- Speed: No network latency; runs at your GPU's speed
- Offline: Works without internet
- Customization: Full control over system prompts, parameters, and model weights
Installation (All Platforms)
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download installer from https://ollama.com/download
# Verify installation
ollama --version
Pull and Run Models
# Pull Llama 3.3 (70B — state of the art open source)
ollama pull llama3.3
# Pull smaller, faster models
ollama pull llama3.2 # 3B — great for laptops
ollama pull mistral # 7B — fast, great quality
ollama pull gemma3 # Google's Gemma 3
ollama pull deepseek-r1 # DeepSeek R1 — incredible reasoning
ollama pull codellama # Optimized for code
ollama pull phi4 # Microsoft Phi-4 — compact + smart
ollama pull qwen2.5-coder # Alibaba's coding model
# Chat with a model
ollama run llama3.2
>>> Tell me a joke about programmers
REST API (Use from Any Language)
Ollama exposes a REST API on localhost:11434:
# Generate a completion
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.2", "prompt": "Why is the sky blue?", "stream": false}'
# Chat API (same format as OpenAI!)
curl http://localhost:11434/api/chat \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Write a Python quicksort"}],
"stream": false
}'
Python Integration
import ollama
# Simple generation
response = ollama.generate(model='llama3.2', prompt='Explain closures in JavaScript')
print(response['response'])
# Chat with history
messages = [
{'role': 'system', 'content': 'You are a senior software engineer. Be concise.'},
{'role': 'user', 'content': 'What is the difference between TCP and UDP?'},
]
response = ollama.chat(model='llama3.2', messages=messages)
print(response['message']['content'])
# Streaming response
for chunk in ollama.generate(model='llama3.2', prompt='Write a merge sort', stream=True):
print(chunk['response'], end='', flush=True)
OpenAI-Compatible Mode
Ollama speaks the OpenAI API format — drop it into any OpenAI app:
from openai import OpenAI
# Point OpenAI client to local Ollama
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama' # any string works
)
response = client.chat.completions.create(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Write a binary search in Python'}]
)
print(response.choices[0].message.content)
This means you can use Ollama as a free drop-in replacement for any OpenAI-powered app during development!
Build a Local RAG App with Ollama
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
# 100% local — no API keys needed
llm = ChatOllama(model="llama3.2", temperature=0)
embeddings = OllamaEmbeddings(model="nomic-embed-text")
# Load and embed docs
loader = PyPDFLoader("my_document.pdf")
chunks = RecursiveCharacterTextSplitter(chunk_size=500).split_documents(loader.load())
db = Chroma.from_documents(chunks, embeddings)
# Query
retriever = db.as_retriever(search_kwargs={"k": 3})
docs = retriever.invoke("What is the main topic?")
context = "\n".join([d.page_content for d in docs])
response = llm.invoke(f"Context: {context}\n\nQuestion: What is the main topic?")
print(response.content)
Custom Modelfiles (Fine-tune Behavior)
# Modelfile — create a custom model persona
FROM llama3.2
SYSTEM """
You are an expert Python developer who writes clean, well-commented code.
Always include type hints. Always handle exceptions. Use f-strings.
When explaining code, be concise and focus on the key insight.
"""
PARAMETER temperature 0.1
PARAMETER num_ctx 8192
# Build and run your custom model
ollama create python-expert -f Modelfile
ollama run python-expert
Performance Guide
| Model | RAM Required | Speed (tokens/s) | Quality |
|---|---|---|---|
| llama3.2:3b | 4 GB | 60-80 t/s | Good |
| llama3.2:8b | 8 GB | 30-40 t/s | Very Good |
| mistral:7b | 8 GB | 35-45 t/s | Very Good |
| llama3.3:70b | 48 GB | 5-10 t/s | Excellent |
| deepseek-r1:32b | 24 GB | 10-15 t/s | Excellent |
Apple Silicon (M1/M2/M3): Ollama uses Metal GPU acceleration — 3-5x faster than CPU. NVIDIA GPU: Set CUDA_VISIBLE_DEVICES — near cloud-speed performance.
Use Cases
- Code completion server: Integrate with VS Code via
Continueextension - Private chatbot: Customer support without data leaving premises
- Local RAG: Query confidential documents securely
- CI/CD AI reviews: Run code review in your pipeline, no API costs
- Edge AI: Deploy on-premise for air-gapped environments
Advertisement