Ollama — Run LLMs Locally on Your Mac

Sanjeev SharmaSanjeev Sharma
5 min read

Advertisement

Introduction

Ollama makes running LLMs locally simple. This guide covers installation, downloading models, and integrating Ollama with Python applications.

Installation

Mac

# Download from ollama.ai or use brew
brew install ollama

# Start Ollama service
ollama serve

Linux

curl https://ollama.ai/install.sh | sh

# Start service
ollama serve

Windows

Download from ollama.ai and run the installer.

Downloading Models

# List available models
ollama list

# Download Mistral 7B
ollama pull mistral

# Download Llama 2
ollama pull llama2

# Download Orca Mini
ollama pull orca-mini

# View model info
ollama show mistral

Available Models

Popular:
- mistral: 7B, fast and capable
- llama2: 7B, 13B, 70B variants
- neural-chat: Optimized for chat
- orca-mini: Small, efficient
- dolphin: Uncensored model
- vicuna: High quality responses

Interactive Chat

# Start interactive chat
ollama run mistral

# Type messages and get responses interactively
>>> What is machine learning?
Machine learning is...

Ollama CLI Commands

# Run model and get output
ollama run mistral "Explain quantum computing"

# Run with specific parameters
ollama run mistral --temperature 0.5 "Generate a poem"

# List models
ollama list

# Show model details
ollama show mistral

# Remove model
ollama rm mistral

# Pull specific version
ollama pull llama2:13b

Using Ollama with Python

Direct HTTP API

import requests
import json

# Ollama runs on localhost:11434 by default

def generate_text(prompt: str, model: str = "mistral"):
    """Generate text using Ollama."""
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False
        }
    )

    return response.json()["response"]

# Test
result = generate_text("What is the capital of France?")
print(result)

Using ollama-python Library

pip install ollama
import ollama

# Generate text
response = ollama.generate(
    model="mistral",
    prompt="Explain machine learning in one sentence"
)

print(response["response"])

Streaming Responses

import ollama

def stream_response(prompt: str):
    """Stream responses in real-time."""
    stream = ollama.generate(
        model="mistral",
        prompt=prompt,
        stream=True
    )

    for chunk in stream:
        print(chunk["response"], end="", flush=True)
    print()

# Usage
stream_response("Write a short story about AI")

Building a Chat Application

import ollama

class LocalChatBot:
    def __init__(self, model: str = "mistral"):
        self.model = model
        self.conversation = []

    def chat(self, user_message: str) -> str:
        # Add user message
        self.conversation.append({
            "role": "user",
            "content": user_message
        })

        # Build conversation history
        history = "You are a helpful assistant.\n\n"
        for msg in self.conversation[:-1]:  # Exclude last message
            history += f"{msg['role'].capitalize()}: {msg['content']}\n"

        # Generate response
        response = ollama.generate(
            model=self.model,
            prompt=history + f"User: {user_message}\nAssistant:",
            stream=False
        )

        assistant_response = response["response"].strip()

        # Add to conversation
        self.conversation.append({
            "role": "assistant",
            "content": assistant_response
        })

        return assistant_response

# Usage
bot = LocalChatBot()
print(bot.chat("What is machine learning?"))
print(bot.chat("Give me an example"))
print(bot.chat("How does it work?"))

Integration with LangChain

pip install langchain langchain-community
from langchain_community.llms import Ollama
from langchain_core.prompts import ChatPromptTemplate

# Create Ollama LLM
llm = Ollama(
    model="mistral",
    base_url="http://localhost:11434"
)

# Create chain
prompt = ChatPromptTemplate.from_template(
    "Explain {topic} in one sentence"
)

chain = prompt | llm

# Execute
result = chain.invoke({"topic": "quantum computing"})
print(result)

Using with RAG

from langchain_community.llms import Ollama
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings
from langchain.chains import RetrievalQA

# Local embeddings with Ollama
embeddings = OllamaEmbeddings(
    model="mistral",
    base_url="http://localhost:11434"
)

# Create vector store (using local embeddings)
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings
)

# Create QA chain
qa = RetrievalQA.from_chain_type(
    llm=Ollama(model="mistral"),
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

# Query
answer = qa.run("What is the main topic?")

Customizing Models

Create Custom Model

# Create Modelfile
cat > Modelfile << 'EOF'
FROM mistral

SYSTEM You are an expert Python programmer. Help users with their code.

PARAMETER temperature 0.7
EOF

# Build custom model
ollama create my-python-helper -f Modelfile

# Use custom model
ollama run my-python-helper "How do I sort a list in Python?"

Performance Optimization

# Use GPU acceleration (if available)
# Ollama automatically uses GPU on supported hardware

# Check GPU usage
nvidia-smi

# Models run at different speeds:
# Mistral 7B: ~40 tokens/sec on CPU, 500+ on GPU
# Llama2 7B: ~30 tokens/sec on CPU, 400+ on GPU
# Orca Mini: ~100 tokens/sec on CPU (very efficient)

Multi-Model Comparison

import ollama
import time

models = ["mistral", "orca-mini", "neural-chat"]
prompt = "What is artificial intelligence?"

for model in models:
    start = time.time()
    response = ollama.generate(model=model, prompt=prompt, stream=False)
    elapsed = time.time() - start

    print(f"\n{model}:")
    print(f"Response: {response['response'][:100]}...")
    print(f"Time: {elapsed:.2f}s")

Production Considerations

  1. CPU vs GPU: GPU is 10x faster but requires CUDA/Metal support
  2. Memory: Models need 4-48GB depending on size
  3. Latency: Acceptable for non-real-time applications
  4. Scaling: Multiple Ollama instances for load distribution
  5. Reliability: Implement error handling and timeouts
import requests
from requests.exceptions import Timeout, ConnectionError

def robust_generate(prompt: str, model: str = "mistral"):
    """Generate with error handling."""
    try:
        response = requests.post(
            "http://localhost:11434/api/generate",
            json={
                "model": model,
                "prompt": prompt,
                "stream": False
            },
            timeout=30
        )
        return response.json()["response"]
    except (Timeout, ConnectionError) as e:
        print(f"Ollama connection error: {e}")
        return None

Conclusion

Ollama democratizes local LLM access. Run models privately, offline, without API costs. Perfect for development, education, and privacy-sensitive applications.

FAQ

Q: What model should I use? A: Start with Mistral 7B for good balance. Use Orca Mini for speed, Llama2 for quality.

Q: Can I run multiple models simultaneously? A: Yes, but they'll share resources. Separate GPU instances recommended for production.

Q: Is Ollama suitable for production? A: For internal tools, prototypes, and offline applications. For high-traffic services, use cloud APIs.

Advertisement

Sanjeev Sharma

Written by

Sanjeev Sharma

Full Stack Engineer · E-mopro