Ollama — Run LLMs Locally on Your Mac
Advertisement
Introduction
Ollama makes running LLMs locally simple. This guide covers installation, downloading models, and integrating Ollama with Python applications.
- Installation
- Mac
- Linux
- Windows
- Downloading Models
- Available Models
- Interactive Chat
- Ollama CLI Commands
- Using Ollama with Python
- Direct HTTP API
- Using ollama-python Library
- Streaming Responses
- Building a Chat Application
- Integration with LangChain
- Using with RAG
- Customizing Models
- Create Custom Model
- Performance Optimization
- Multi-Model Comparison
- Production Considerations
- Conclusion
- FAQ
Installation
Mac
# Download from ollama.ai or use brew
brew install ollama
# Start Ollama service
ollama serve
Linux
curl https://ollama.ai/install.sh | sh
# Start service
ollama serve
Windows
Download from ollama.ai and run the installer.
Downloading Models
# List available models
ollama list
# Download Mistral 7B
ollama pull mistral
# Download Llama 2
ollama pull llama2
# Download Orca Mini
ollama pull orca-mini
# View model info
ollama show mistral
Available Models
Popular:
- mistral: 7B, fast and capable
- llama2: 7B, 13B, 70B variants
- neural-chat: Optimized for chat
- orca-mini: Small, efficient
- dolphin: Uncensored model
- vicuna: High quality responses
Interactive Chat
# Start interactive chat
ollama run mistral
# Type messages and get responses interactively
>>> What is machine learning?
Machine learning is...
Ollama CLI Commands
# Run model and get output
ollama run mistral "Explain quantum computing"
# Run with specific parameters
ollama run mistral --temperature 0.5 "Generate a poem"
# List models
ollama list
# Show model details
ollama show mistral
# Remove model
ollama rm mistral
# Pull specific version
ollama pull llama2:13b
Using Ollama with Python
Direct HTTP API
import requests
import json
# Ollama runs on localhost:11434 by default
def generate_text(prompt: str, model: str = "mistral"):
"""Generate text using Ollama."""
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False
}
)
return response.json()["response"]
# Test
result = generate_text("What is the capital of France?")
print(result)
Using ollama-python Library
pip install ollama
import ollama
# Generate text
response = ollama.generate(
model="mistral",
prompt="Explain machine learning in one sentence"
)
print(response["response"])
Streaming Responses
import ollama
def stream_response(prompt: str):
"""Stream responses in real-time."""
stream = ollama.generate(
model="mistral",
prompt=prompt,
stream=True
)
for chunk in stream:
print(chunk["response"], end="", flush=True)
print()
# Usage
stream_response("Write a short story about AI")
Building a Chat Application
import ollama
class LocalChatBot:
def __init__(self, model: str = "mistral"):
self.model = model
self.conversation = []
def chat(self, user_message: str) -> str:
# Add user message
self.conversation.append({
"role": "user",
"content": user_message
})
# Build conversation history
history = "You are a helpful assistant.\n\n"
for msg in self.conversation[:-1]: # Exclude last message
history += f"{msg['role'].capitalize()}: {msg['content']}\n"
# Generate response
response = ollama.generate(
model=self.model,
prompt=history + f"User: {user_message}\nAssistant:",
stream=False
)
assistant_response = response["response"].strip()
# Add to conversation
self.conversation.append({
"role": "assistant",
"content": assistant_response
})
return assistant_response
# Usage
bot = LocalChatBot()
print(bot.chat("What is machine learning?"))
print(bot.chat("Give me an example"))
print(bot.chat("How does it work?"))
Integration with LangChain
pip install langchain langchain-community
from langchain_community.llms import Ollama
from langchain_core.prompts import ChatPromptTemplate
# Create Ollama LLM
llm = Ollama(
model="mistral",
base_url="http://localhost:11434"
)
# Create chain
prompt = ChatPromptTemplate.from_template(
"Explain {topic} in one sentence"
)
chain = prompt | llm
# Execute
result = chain.invoke({"topic": "quantum computing"})
print(result)
Using with RAG
from langchain_community.llms import Ollama
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings
from langchain.chains import RetrievalQA
# Local embeddings with Ollama
embeddings = OllamaEmbeddings(
model="mistral",
base_url="http://localhost:11434"
)
# Create vector store (using local embeddings)
vectorstore = Chroma.from_documents(
documents=docs,
embedding=embeddings
)
# Create QA chain
qa = RetrievalQA.from_chain_type(
llm=Ollama(model="mistral"),
chain_type="stuff",
retriever=vectorstore.as_retriever()
)
# Query
answer = qa.run("What is the main topic?")
Customizing Models
Create Custom Model
# Create Modelfile
cat > Modelfile << 'EOF'
FROM mistral
SYSTEM You are an expert Python programmer. Help users with their code.
PARAMETER temperature 0.7
EOF
# Build custom model
ollama create my-python-helper -f Modelfile
# Use custom model
ollama run my-python-helper "How do I sort a list in Python?"
Performance Optimization
# Use GPU acceleration (if available)
# Ollama automatically uses GPU on supported hardware
# Check GPU usage
nvidia-smi
# Models run at different speeds:
# Mistral 7B: ~40 tokens/sec on CPU, 500+ on GPU
# Llama2 7B: ~30 tokens/sec on CPU, 400+ on GPU
# Orca Mini: ~100 tokens/sec on CPU (very efficient)
Multi-Model Comparison
import ollama
import time
models = ["mistral", "orca-mini", "neural-chat"]
prompt = "What is artificial intelligence?"
for model in models:
start = time.time()
response = ollama.generate(model=model, prompt=prompt, stream=False)
elapsed = time.time() - start
print(f"\n{model}:")
print(f"Response: {response['response'][:100]}...")
print(f"Time: {elapsed:.2f}s")
Production Considerations
- CPU vs GPU: GPU is 10x faster but requires CUDA/Metal support
- Memory: Models need 4-48GB depending on size
- Latency: Acceptable for non-real-time applications
- Scaling: Multiple Ollama instances for load distribution
- Reliability: Implement error handling and timeouts
import requests
from requests.exceptions import Timeout, ConnectionError
def robust_generate(prompt: str, model: str = "mistral"):
"""Generate with error handling."""
try:
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False
},
timeout=30
)
return response.json()["response"]
except (Timeout, ConnectionError) as e:
print(f"Ollama connection error: {e}")
return None
Conclusion
Ollama democratizes local LLM access. Run models privately, offline, without API costs. Perfect for development, education, and privacy-sensitive applications.
FAQ
Q: What model should I use? A: Start with Mistral 7B for good balance. Use Orca Mini for speed, Llama2 for quality.
Q: Can I run multiple models simultaneously? A: Yes, but they'll share resources. Separate GPU instances recommended for production.
Q: Is Ollama suitable for production? A: For internal tools, prototypes, and offline applications. For high-traffic services, use cloud APIs.
Advertisement