Mixtral 8x7B — Run and Use Locally

Introduction

Mixtral 8x7B uses Mixture of Experts (MoE) to deliver 70B-equivalent quality with only 8x7B active parameters. This guide covers setup and optimization for local deployment.

What is Mixture of Experts?
Installation
Via Ollama
Via Hugging Face
Memory Requirements
Basic Usage
Quantization
Multi-GPU Inference
Chat Application
RAG with Mixtral
Fine-tuning Mixtral
Monitoring Expert Usage
Performance Benchmarks
Comparison: Mixtral vs Dense Models
Conclusion
FAQ

What is Mixture of Experts?

MoE architecture has specialized sub-models (experts) that are selectively activated based on input:

Standard Model: All parameters active for every input
↓
Performance: Fast, consistent

MoE Model: Only subset of experts active
↓
Performance: Selective experts process token
↓
Benefit: More capacity, same speed

Mixtral 8x7B:

8 experts of 7B parameters each
~3 active per token
Effective size: ~53B parameters active
Quality: Comparable to 70B dense models
Speed: Similar to 20B dense models

Installation

Via Ollama

ollama pull mixtral:latest
ollama run mixtral

Via Hugging Face

pip install transformers torch

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

Memory Requirements

Mixtral 8x7B Considerations:
- Model size: 46GB (full precision)
- 8-bit quantization: 23GB
- 4-bit quantization: 12GB

Recommendations:
- CPU-only: 46GB+ system RAM
- Single GPU: RTX 4090 (24GB) with 8-bit
- Multiple GPUs: Split across devices

Basic Usage

def generate(prompt: str):
    """Generate text with Mixtral."""
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.7,
        top_p=0.95
    )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test
result = generate("Machine learning is")
print(result)

Quantization

from transformers import BitsAndBytesConfig

# 8-bit quantization
bnb_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
    quantization_config=bnb_config,
    device_map="auto"
)

# 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
    quantization_config=bnb_config,
    device_map="auto"
)

Multi-GPU Inference

from transformers import AutoModelForCausalLM

# Automatic distribution across GPUs
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
    device_map="auto",
    torch_dtype=torch.float16
)

# Check allocation
print(model.hf_device_map)

Chat Application

def chat_mixtral(messages: list) -> str:
    """Chat with Mixtral."""
    # Format using Mistral instruction format
    formatted = ""
    for msg in messages:
        if msg["role"] == "user":
            formatted += f"[INST] {msg['content']} [/INST] "
        else:
            formatted += f"{msg['content']} </s>"

    inputs = tokenizer(formatted, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=512)

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Multi-turn conversation
messages = [
    {"role": "user", "content": "What is quantum computing?"},
]
response1 = chat_mixtral(messages)
print(response1)

messages.append({"role": "assistant", "content": response1})
messages.append({"role": "user", "content": "How does it differ from classical computing?"})
response2 = chat_mixtral(messages)
print(response2)

RAG with Mixtral

from langchain.llms import Ollama
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA

# Use Ollama for easy setup
llm = Ollama(model="mixtral")

# Create QA chain (with RAG)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

# Query
answer = qa.run("What is the main topic?")
print(answer)

Fine-tuning Mixtral

from peft import get_peft_model, LoraConfig

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
    load_in_8bit=True,
    device_map="auto"
)

# Apply LoRA (only tune adapters)
peft_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj"],
    lora_dropout=0.1,
    bias="none"
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

# Fine-tune (see previous guides for Trainer setup)

Monitoring Expert Usage

def analyze_expert_usage(prompt: str):
    """Analyze which experts are used for a prompt."""
    # Get model internals
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    with torch.no_grad():
        outputs = model(**inputs, output_router_logits=True)

    # Router logits show which experts were selected
    # Higher logistics = more likely expert was selected

    print(f"Prompt: {prompt}")
    print("Expert usage varies by token position")

analyze_expert_usage("Machine learning is a fascinating field")

Performance Benchmarks

import time

def benchmark_mixtral():
    """Benchmark Mixtral speed."""
    prompt = "Write a detailed explanation of neural networks"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    start = time.time()
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=True
    )
    elapsed = time.time() - start

    tokens = len(outputs[0])
    speed = tokens / elapsed

    print(f"Generated: {tokens} tokens")
    print(f"Time: {elapsed:.2f}s")
    print(f"Speed: {speed:.1f} tokens/sec")

benchmark_mixtral()

# Expected speeds:
# 4-bit quantization: 100+ tokens/sec on RTX 4090
# 8-bit quantization: 80-100 tokens/sec on RTX 4090

Comparison: Mixtral vs Dense Models

Mixtral 8x7B:
- Quality: ~70B equivalent
- Speed: ~20B speed
- Memory: 12-24GB (quantized)
- Efficiency: High

LLaMA 2 70B:
- Quality: 70B (full)
- Speed: 70B speed
- Memory: 40-80GB (quantized)
- Efficiency: Lower

Llama 2 7B:
- Quality: 7B
- Speed: Very fast
- Memory: 4-8GB (quantized)
- Efficiency: Highest

Conclusion

Mixtral 8x7B offers unique value: large-model quality with efficient inference. Perfect for scenarios needing both capability and speed.

FAQ

Q: When should I use Mixtral over Llama? A: Use Mixtral for quality-sensitive tasks with performance constraints. Llama is simpler and lighter.

Q: Can I run Mixtral on RTX 3090? A: Yes, with 4-bit quantization and ~12GB VRAM. May require careful memory management.

Q: Is Mixtral better than GPT-3.5? A: On many benchmarks, Mixtral outperforms GPT-3.5. For very complex tasks, GPT-4 is still superior.