Ollama — Run LLMs Locally on Your Mac

Introduction

Ollama makes running LLMs locally simple. This guide covers installation, downloading models, and integrating Ollama with Python applications.

Installation
Mac
Linux
Windows
Downloading Models
Available Models
Interactive Chat
Ollama CLI Commands
Using Ollama with Python
Direct HTTP API
Using ollama-python Library
Streaming Responses
Building a Chat Application
Integration with LangChain
Using with RAG
Customizing Models
Create Custom Model
Performance Optimization
Multi-Model Comparison
Production Considerations
Conclusion
FAQ

Installation

Mac

# Download from ollama.ai or use brew
brew install ollama

# Start Ollama service
ollama serve

Linux

curl https://ollama.ai/install.sh | sh

# Start service
ollama serve

Windows

Download from ollama.ai and run the installer.

Downloading Models

# List available models
ollama list

# Download Mistral 7B
ollama pull mistral

# Download Llama 2
ollama pull llama2

# Download Orca Mini
ollama pull orca-mini

# View model info
ollama show mistral

Available Models

Popular:
- mistral: 7B, fast and capable
- llama2: 7B, 13B, 70B variants
- neural-chat: Optimized for chat
- orca-mini: Small, efficient
- dolphin: Uncensored model
- vicuna: High quality responses

Interactive Chat

# Start interactive chat
ollama run mistral

# Type messages and get responses interactively
>>> What is machine learning?
Machine learning is...

Ollama CLI Commands

# Run model and get output
ollama run mistral "Explain quantum computing"

# Run with specific parameters
ollama run mistral --temperature 0.5 "Generate a poem"

# List models
ollama list

# Show model details
ollama show mistral

# Remove model
ollama rm mistral

# Pull specific version
ollama pull llama2:13b

Using Ollama with Python

Direct HTTP API

import requests
import json

# Ollama runs on localhost:11434 by default

def generate_text(prompt: str, model: str = "mistral"):
    """Generate text using Ollama."""
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False
        }
    )

    return response.json()["response"]

# Test
result = generate_text("What is the capital of France?")
print(result)

Using ollama-python Library

pip install ollama

import ollama

# Generate text
response = ollama.generate(
    model="mistral",
    prompt="Explain machine learning in one sentence"
)

print(response["response"])

Streaming Responses

import ollama

def stream_response(prompt: str):
    """Stream responses in real-time."""
    stream = ollama.generate(
        model="mistral",
        prompt=prompt,
        stream=True
    )

    for chunk in stream:
        print(chunk["response"], end="", flush=True)
    print()

# Usage
stream_response("Write a short story about AI")

Building a Chat Application

import ollama

class LocalChatBot:
    def __init__(self, model: str = "mistral"):
        self.model = model
        self.conversation = []

    def chat(self, user_message: str) -> str:
        # Add user message
        self.conversation.append({
            "role": "user",
            "content": user_message
        })

        # Build conversation history
        history = "You are a helpful assistant.\n\n"
        for msg in self.conversation[:-1]:  # Exclude last message
            history += f"{msg['role'].capitalize()}: {msg['content']}\n"

        # Generate response
        response = ollama.generate(
            model=self.model,
            prompt=history + f"User: {user_message}\nAssistant:",
            stream=False
        )

        assistant_response = response["response"].strip()

        # Add to conversation
        self.conversation.append({
            "role": "assistant",
            "content": assistant_response
        })

        return assistant_response

# Usage
bot = LocalChatBot()
print(bot.chat("What is machine learning?"))
print(bot.chat("Give me an example"))
print(bot.chat("How does it work?"))

Integration with LangChain

pip install langchain langchain-community

from langchain_community.llms import Ollama
from langchain_core.prompts import ChatPromptTemplate

# Create Ollama LLM
llm = Ollama(
    model="mistral",
    base_url="http://localhost:11434"
)

# Create chain
prompt = ChatPromptTemplate.from_template(
    "Explain {topic} in one sentence"
)

chain = prompt | llm

# Execute
result = chain.invoke({"topic": "quantum computing"})
print(result)

Using with RAG

from langchain_community.llms import Ollama
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings
from langchain.chains import RetrievalQA

# Local embeddings with Ollama
embeddings = OllamaEmbeddings(
    model="mistral",
    base_url="http://localhost:11434"
)

# Create vector store (using local embeddings)
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings
)

# Create QA chain
qa = RetrievalQA.from_chain_type(
    llm=Ollama(model="mistral"),
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

# Query
answer = qa.run("What is the main topic?")

Customizing Models

Create Custom Model

# Create Modelfile
cat > Modelfile << 'EOF'
FROM mistral

SYSTEM You are an expert Python programmer. Help users with their code.

PARAMETER temperature 0.7
EOF

# Build custom model
ollama create my-python-helper -f Modelfile

# Use custom model
ollama run my-python-helper "How do I sort a list in Python?"

Performance Optimization

# Use GPU acceleration (if available)
# Ollama automatically uses GPU on supported hardware

# Check GPU usage
nvidia-smi

# Models run at different speeds:
# Mistral 7B: ~40 tokens/sec on CPU, 500+ on GPU
# Llama2 7B: ~30 tokens/sec on CPU, 400+ on GPU
# Orca Mini: ~100 tokens/sec on CPU (very efficient)

Multi-Model Comparison

import ollama
import time

models = ["mistral", "orca-mini", "neural-chat"]
prompt = "What is artificial intelligence?"

for model in models:
    start = time.time()
    response = ollama.generate(model=model, prompt=prompt, stream=False)
    elapsed = time.time() - start

    print(f"\n{model}:")
    print(f"Response: {response['response'][:100]}...")
    print(f"Time: {elapsed:.2f}s")

Production Considerations

CPU vs GPU: GPU is 10x faster but requires CUDA/Metal support
Memory: Models need 4-48GB depending on size
Latency: Acceptable for non-real-time applications
Scaling: Multiple Ollama instances for load distribution
Reliability: Implement error handling and timeouts

import requests
from requests.exceptions import Timeout, ConnectionError

def robust_generate(prompt: str, model: str = "mistral"):
    """Generate with error handling."""
    try:
        response = requests.post(
            "http://localhost:11434/api/generate",
            json={
                "model": model,
                "prompt": prompt,
                "stream": False
            },
            timeout=30
        )
        return response.json()["response"]
    except (Timeout, ConnectionError) as e:
        print(f"Ollama connection error: {e}")
        return None

Conclusion

Ollama democratizes local LLM access. Run models privately, offline, without API costs. Perfect for development, education, and privacy-sensitive applications.

FAQ

Q: What model should I use? A: Start with Mistral 7B for good balance. Use Orca Mini for speed, Llama2 for quality.

Q: Can I run multiple models simultaneously? A: Yes, but they'll share resources. Separate GPU instances recommended for production.

Q: Is Ollama suitable for production? A: For internal tools, prototypes, and offline applications. For high-traffic services, use cloud APIs.