LLaMA 3 — Complete Setup and Usage Guide

Sanjeev SharmaSanjeev Sharma
4 min read

Advertisement

Introduction

LLaMA 3, Meta's latest open-source LLM, delivers excellent quality at reasonable resource requirements. This guide covers setup, usage, and optimization techniques.

Understanding LLaMA 3

LLaMA 3 comes in two sizes:

  • 8B: 8 billion parameters, efficient
  • 70B: 70 billion parameters, highest quality

Performance:

  • 8B outperforms most 7B models
  • 70B competes with GPT-4 on many tasks

Getting Access

# Download from Meta or Hugging Face
# Meta's official weights available at meta.com/research/llama

# Via Hugging Face (requires access approval)
pip install huggingface-hub

huggingface-cli login
# Follow prompts to authenticate

Installation with Ollama

# Easiest method
ollama pull llama2:7b

# Run interactively
ollama run llama2:7b

# Or with streaming for large models
ollama pull llama2:70b

Installation with llama.cpp

# Clone repository
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build
make

# Download quantized model
wget https://huggingface.co/TheBloke/Llama-2-7B-GGML/resolve/main/llama-2-7b.ggmlv3.q4_K_M.gguf

# Run
./main -m llama-2-7b.ggmlv3.q4_K_M.gguf -n 128 -p "Machine learning is"

Using Transformers Library

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "meta-llama/Llama-2-7b-chat-hf"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load model (requires GPU for 7B)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Generate text
prompt = "Explain machine learning in simple terms"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Quantization with bitsandbytes

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# 8-bit quantization
bnb_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# 4-bit quantization (even smaller)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

Chat Application

from transformers import pipeline

# Create chat pipeline
chat_pipeline = pipeline(
    "text-generation",
    model="meta-llama/Llama-2-7b-chat-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Chat
conversation = []

def chat(user_input: str) -> str:
    conversation.append({
        "role": "user",
        "content": user_input
    })

    # Format for model
    messages_text = ""
    for msg in conversation:
        if msg["role"] == "user":
            messages_text += f"[INST] {msg['content']} [/INST] "
        else:
            messages_text += f"{msg['content']} "

    response = chat_pipeline(
        messages_text,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True
    )[0]["generated_text"]

    # Extract assistant response
    assistant_response = response.split("[/INST]")[-1].strip()

    conversation.append({
        "role": "assistant",
        "content": assistant_response
    })

    return assistant_response

# Usage
print(chat("What is machine learning?"))
print(chat("Give me an example"))
print(chat("How does it work?"))

LLaMA with LangChain

from langchain.llms import Ollama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Create Ollama LLM
llm = Ollama(model="llama2")

# Create prompt
prompt = ChatPromptTemplate.from_template(
    "Explain {topic} for a {audience}"
)

# Create chain
chain = prompt | llm | StrOutputParser()

# Execute
result = chain.invoke({
    "topic": "neural networks",
    "audience": "10-year-old child"
})

print(result)

Fine-tuning LLaMA 3

from peft import get_peft_model, LoraConfig
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    load_in_8bit=True,
    device_map="auto"
)

# Configure LoRA
peft_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none"
)

# Apply LoRA
model = get_peft_model(model, peft_config)

# Training arguments
training_args = TrainingArguments(
    output_dir="./llama-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-4
)

# Create trainer and train
# (dataset preparation code)

System Requirements

LLaMA 7B:
- RAM: 8GB minimum (quantized)
- VRAM: 4GB with 8-bit quantization
- Disk: 4GB for model

LLaMA 70B:
- RAM: 32GB minimum
- VRAM: 16GB+ recommended
- Disk: 40GB for model

Optimal:
- GPU: RTX 4090 or A100
- CPU: Recent Intel/AMD processor
- Storage: NVMe SSD for faster loading

Performance Benchmarks

import time
import torch

def benchmark_model(model, tokenizer, prompt: str):
    """Benchmark generation speed."""
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    start = time.time()
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            temperature=0.7
        )
    elapsed = time.time() - start

    tokens = len(outputs[0])
    speed = tokens / elapsed

    print(f"Generated {tokens} tokens in {elapsed:.2f}s ({speed:.1f} tokens/sec)")

benchmark_model(model, tokenizer, "Machine learning is")

Optimization Tips

  1. Use quantization for memory efficiency
  2. Enable Flash Attention for speed
  3. Use bfloat16 for GPU optimization
  4. Batch requests when possible
  5. Cache KV for faster generation
# Flash Attention (if supported)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.float16,
    device_map="auto"
)

Conclusion

LLaMA 3 provides powerful open-source LLM capabilities. With proper optimization, run these models efficiently on consumer hardware.

FAQ

Q: What's the difference between LLaMA 2 and 3? A: LLaMA 3 is more capable, better at instructions, and has improved reasoning. 8B is on par with 13B from previous versions.

Q: Can I run 70B locally? A: Yes, with 8-bit quantization on 32GB RAM + RTX 3090, or 16GB VRAM on A100.

Q: Should I use LLaMA or Mistral? A: Mistral is faster and lighter. LLaMA is more capable. Choose based on your speed/quality trade-off.

Advertisement

Sanjeev Sharma

Written by

Sanjeev Sharma

Full Stack Engineer · E-mopro