Ollama Complete Guide 2026: Run LLMs Locally for Free

Sanjeev SharmaSanjeev Sharma
4 min read

Advertisement

Ollama 2026: Run Any LLM Locally, Free Forever

No API keys. No rate limits. No data leaving your machine. Ollama lets you run powerful open-source LLMs (Llama 3.3, Mistral, Gemma 3, DeepSeek-R1) locally with a single command.

Why Run LLMs Locally?

  • Privacy: Your data never leaves your machine — critical for medical, legal, financial apps
  • Cost: 0pertokenvs0 per token vs 15/1M tokens for GPT-4o
  • Speed: No network latency; runs at your GPU's speed
  • Offline: Works without internet
  • Customization: Full control over system prompts, parameters, and model weights

Installation (All Platforms)

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer from https://ollama.com/download

# Verify installation
ollama --version

Pull and Run Models

# Pull Llama 3.3 (70B — state of the art open source)
ollama pull llama3.3

# Pull smaller, faster models
ollama pull llama3.2          # 3B — great for laptops
ollama pull mistral           # 7B — fast, great quality
ollama pull gemma3            # Google's Gemma 3
ollama pull deepseek-r1       # DeepSeek R1 — incredible reasoning
ollama pull codellama         # Optimized for code
ollama pull phi4              # Microsoft Phi-4 — compact + smart
ollama pull qwen2.5-coder     # Alibaba's coding model

# Chat with a model
ollama run llama3.2
>>> Tell me a joke about programmers

REST API (Use from Any Language)

Ollama exposes a REST API on localhost:11434:

# Generate a completion
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.2", "prompt": "Why is the sky blue?", "stream": false}'

# Chat API (same format as OpenAI!)
curl http://localhost:11434/api/chat \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Write a Python quicksort"}],
    "stream": false
  }'

Python Integration

import ollama

# Simple generation
response = ollama.generate(model='llama3.2', prompt='Explain closures in JavaScript')
print(response['response'])

# Chat with history
messages = [
    {'role': 'system', 'content': 'You are a senior software engineer. Be concise.'},
    {'role': 'user', 'content': 'What is the difference between TCP and UDP?'},
]
response = ollama.chat(model='llama3.2', messages=messages)
print(response['message']['content'])

# Streaming response
for chunk in ollama.generate(model='llama3.2', prompt='Write a merge sort', stream=True):
    print(chunk['response'], end='', flush=True)

OpenAI-Compatible Mode

Ollama speaks the OpenAI API format — drop it into any OpenAI app:

from openai import OpenAI

# Point OpenAI client to local Ollama
client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # any string works
)

response = client.chat.completions.create(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Write a binary search in Python'}]
)
print(response.choices[0].message.content)

This means you can use Ollama as a free drop-in replacement for any OpenAI-powered app during development!


Build a Local RAG App with Ollama

from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader

# 100% local — no API keys needed
llm = ChatOllama(model="llama3.2", temperature=0)
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Load and embed docs
loader = PyPDFLoader("my_document.pdf")
chunks = RecursiveCharacterTextSplitter(chunk_size=500).split_documents(loader.load())
db = Chroma.from_documents(chunks, embeddings)

# Query
retriever = db.as_retriever(search_kwargs={"k": 3})
docs = retriever.invoke("What is the main topic?")
context = "\n".join([d.page_content for d in docs])

response = llm.invoke(f"Context: {context}\n\nQuestion: What is the main topic?")
print(response.content)

Custom Modelfiles (Fine-tune Behavior)

# Modelfile — create a custom model persona
FROM llama3.2

SYSTEM """
You are an expert Python developer who writes clean, well-commented code.
Always include type hints. Always handle exceptions. Use f-strings.
When explaining code, be concise and focus on the key insight.
"""

PARAMETER temperature 0.1
PARAMETER num_ctx 8192
# Build and run your custom model
ollama create python-expert -f Modelfile
ollama run python-expert

Performance Guide

ModelRAM RequiredSpeed (tokens/s)Quality
llama3.2:3b4 GB60-80 t/sGood
llama3.2:8b8 GB30-40 t/sVery Good
mistral:7b8 GB35-45 t/sVery Good
llama3.3:70b48 GB5-10 t/sExcellent
deepseek-r1:32b24 GB10-15 t/sExcellent

Apple Silicon (M1/M2/M3): Ollama uses Metal GPU acceleration — 3-5x faster than CPU. NVIDIA GPU: Set CUDA_VISIBLE_DEVICES — near cloud-speed performance.


Use Cases

  • Code completion server: Integrate with VS Code via Continue extension
  • Private chatbot: Customer support without data leaving premises
  • Local RAG: Query confidential documents securely
  • CI/CD AI reviews: Run code review in your pipeline, no API costs
  • Edge AI: Deploy on-premise for air-gapped environments

Advertisement

Sanjeev Sharma

Written by

Sanjeev Sharma

Full Stack Engineer · E-mopro