Mixtral 8x7B — Run and Use Locally

Sanjeev SharmaSanjeev Sharma
4 min read

Advertisement

Introduction

Mixtral 8x7B uses Mixture of Experts (MoE) to deliver 70B-equivalent quality with only 8x7B active parameters. This guide covers setup and optimization for local deployment.

What is Mixture of Experts?

MoE architecture has specialized sub-models (experts) that are selectively activated based on input:

Standard Model: All parameters active for every input
Performance: Fast, consistent

MoE Model: Only subset of experts active
Performance: Selective experts process token
Benefit: More capacity, same speed

Mixtral 8x7B:

  • 8 experts of 7B parameters each
  • ~3 active per token
  • Effective size: ~53B parameters active
  • Quality: Comparable to 70B dense models
  • Speed: Similar to 20B dense models

Installation

Via Ollama

ollama pull mixtral:latest
ollama run mixtral

Via Hugging Face

pip install transformers torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

Memory Requirements

Mixtral 8x7B Considerations:
- Model size: 46GB (full precision)
- 8-bit quantization: 23GB
- 4-bit quantization: 12GB

Recommendations:
- CPU-only: 46GB+ system RAM
- Single GPU: RTX 4090 (24GB) with 8-bit
- Multiple GPUs: Split across devices

Basic Usage

def generate(prompt: str):
    """Generate text with Mixtral."""
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.7,
        top_p=0.95
    )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test
result = generate("Machine learning is")
print(result)

Quantization

from transformers import BitsAndBytesConfig

# 8-bit quantization
bnb_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
    quantization_config=bnb_config,
    device_map="auto"
)

# 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
    quantization_config=bnb_config,
    device_map="auto"
)

Multi-GPU Inference

from transformers import AutoModelForCausalLM

# Automatic distribution across GPUs
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
    device_map="auto",
    torch_dtype=torch.float16
)

# Check allocation
print(model.hf_device_map)

Chat Application

def chat_mixtral(messages: list) -> str:
    """Chat with Mixtral."""
    # Format using Mistral instruction format
    formatted = ""
    for msg in messages:
        if msg["role"] == "user":
            formatted += f"[INST] {msg['content']} [/INST] "
        else:
            formatted += f"{msg['content']} </s>"

    inputs = tokenizer(formatted, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=512)

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Multi-turn conversation
messages = [
    {"role": "user", "content": "What is quantum computing?"},
]
response1 = chat_mixtral(messages)
print(response1)

messages.append({"role": "assistant", "content": response1})
messages.append({"role": "user", "content": "How does it differ from classical computing?"})
response2 = chat_mixtral(messages)
print(response2)

RAG with Mixtral

from langchain.llms import Ollama
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA

# Use Ollama for easy setup
llm = Ollama(model="mixtral")

# Create QA chain (with RAG)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

# Query
answer = qa.run("What is the main topic?")
print(answer)

Fine-tuning Mixtral

from peft import get_peft_model, LoraConfig

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
    load_in_8bit=True,
    device_map="auto"
)

# Apply LoRA (only tune adapters)
peft_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj"],
    lora_dropout=0.1,
    bias="none"
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

# Fine-tune (see previous guides for Trainer setup)

Monitoring Expert Usage

def analyze_expert_usage(prompt: str):
    """Analyze which experts are used for a prompt."""
    # Get model internals
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    with torch.no_grad():
        outputs = model(**inputs, output_router_logits=True)

    # Router logits show which experts were selected
    # Higher logistics = more likely expert was selected

    print(f"Prompt: {prompt}")
    print("Expert usage varies by token position")

analyze_expert_usage("Machine learning is a fascinating field")

Performance Benchmarks

import time

def benchmark_mixtral():
    """Benchmark Mixtral speed."""
    prompt = "Write a detailed explanation of neural networks"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    start = time.time()
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=True
    )
    elapsed = time.time() - start

    tokens = len(outputs[0])
    speed = tokens / elapsed

    print(f"Generated: {tokens} tokens")
    print(f"Time: {elapsed:.2f}s")
    print(f"Speed: {speed:.1f} tokens/sec")

benchmark_mixtral()

# Expected speeds:
# 4-bit quantization: 100+ tokens/sec on RTX 4090
# 8-bit quantization: 80-100 tokens/sec on RTX 4090

Comparison: Mixtral vs Dense Models

Mixtral 8x7B:
- Quality: ~70B equivalent
- Speed: ~20B speed
- Memory: 12-24GB (quantized)
- Efficiency: High

LLaMA 2 70B:
- Quality: 70B (full)
- Speed: 70B speed
- Memory: 40-80GB (quantized)
- Efficiency: Lower

Llama 2 7B:
- Quality: 7B
- Speed: Very fast
- Memory: 4-8GB (quantized)
- Efficiency: Highest

Conclusion

Mixtral 8x7B offers unique value: large-model quality with efficient inference. Perfect for scenarios needing both capability and speed.

FAQ

Q: When should I use Mixtral over Llama? A: Use Mixtral for quality-sensitive tasks with performance constraints. Llama is simpler and lighter.

Q: Can I run Mixtral on RTX 3090? A: Yes, with 4-bit quantization and ~12GB VRAM. May require careful memory management.

Q: Is Mixtral better than GPT-3.5? A: On many benchmarks, Mixtral outperforms GPT-3.5. For very complex tasks, GPT-4 is still superior.

Advertisement

Sanjeev Sharma

Written by

Sanjeev Sharma

Full Stack Engineer · E-mopro