LLaMA 3 — Complete Setup and Usage Guide

Introduction

LLaMA 3, Meta's latest open-source LLM, delivers excellent quality at reasonable resource requirements. This guide covers setup, usage, and optimization techniques.

Understanding LLaMA 3
Getting Access
Installation with Ollama
Installation with llama.cpp
Using Transformers Library
Quantization with bitsandbytes
Chat Application
LLaMA with LangChain
Fine-tuning LLaMA 3
System Requirements
Performance Benchmarks
Optimization Tips
Conclusion
FAQ

Understanding LLaMA 3

LLaMA 3 comes in two sizes:

8B: 8 billion parameters, efficient
70B: 70 billion parameters, highest quality

Performance:

8B outperforms most 7B models
70B competes with GPT-4 on many tasks

Getting Access

# Download from Meta or Hugging Face
# Meta's official weights available at meta.com/research/llama

# Via Hugging Face (requires access approval)
pip install huggingface-hub

huggingface-cli login
# Follow prompts to authenticate

Installation with Ollama

# Easiest method
ollama pull llama2:7b

# Run interactively
ollama run llama2:7b

# Or with streaming for large models
ollama pull llama2:70b

Installation with llama.cpp

# Clone repository
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build
make

# Download quantized model
wget https://huggingface.co/TheBloke/Llama-2-7B-GGML/resolve/main/llama-2-7b.ggmlv3.q4_K_M.gguf

# Run
./main -m llama-2-7b.ggmlv3.q4_K_M.gguf -n 128 -p "Machine learning is"

Using Transformers Library

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "meta-llama/Llama-2-7b-chat-hf"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load model (requires GPU for 7B)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Generate text
prompt = "Explain machine learning in simple terms"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Quantization with bitsandbytes

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# 8-bit quantization
bnb_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# 4-bit quantization (even smaller)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

Chat Application

from transformers import pipeline

# Create chat pipeline
chat_pipeline = pipeline(
    "text-generation",
    model="meta-llama/Llama-2-7b-chat-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Chat
conversation = []

def chat(user_input: str) -> str:
    conversation.append({
        "role": "user",
        "content": user_input
    })

    # Format for model
    messages_text = ""
    for msg in conversation:
        if msg["role"] == "user":
            messages_text += f"[INST] {msg['content']} [/INST] "
        else:
            messages_text += f"{msg['content']} "

    response = chat_pipeline(
        messages_text,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True
    )[0]["generated_text"]

    # Extract assistant response
    assistant_response = response.split("[/INST]")[-1].strip()

    conversation.append({
        "role": "assistant",
        "content": assistant_response
    })

    return assistant_response

# Usage
print(chat("What is machine learning?"))
print(chat("Give me an example"))
print(chat("How does it work?"))

LLaMA with LangChain

from langchain.llms import Ollama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Create Ollama LLM
llm = Ollama(model="llama2")

# Create prompt
prompt = ChatPromptTemplate.from_template(
    "Explain {topic} for a {audience}"
)

# Create chain
chain = prompt | llm | StrOutputParser()

# Execute
result = chain.invoke({
    "topic": "neural networks",
    "audience": "10-year-old child"
})

print(result)

Fine-tuning LLaMA 3

from peft import get_peft_model, LoraConfig
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    load_in_8bit=True,
    device_map="auto"
)

# Configure LoRA
peft_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none"
)

# Apply LoRA
model = get_peft_model(model, peft_config)

# Training arguments
training_args = TrainingArguments(
    output_dir="./llama-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-4
)

# Create trainer and train
# (dataset preparation code)

System Requirements

LLaMA 7B:
- RAM: 8GB minimum (quantized)
- VRAM: 4GB with 8-bit quantization
- Disk: 4GB for model

LLaMA 70B:
- RAM: 32GB minimum
- VRAM: 16GB+ recommended
- Disk: 40GB for model

Optimal:
- GPU: RTX 4090 or A100
- CPU: Recent Intel/AMD processor
- Storage: NVMe SSD for faster loading

Performance Benchmarks

import time
import torch

def benchmark_model(model, tokenizer, prompt: str):
    """Benchmark generation speed."""
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    start = time.time()
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            temperature=0.7
        )
    elapsed = time.time() - start

    tokens = len(outputs[0])
    speed = tokens / elapsed

    print(f"Generated {tokens} tokens in {elapsed:.2f}s ({speed:.1f} tokens/sec)")

benchmark_model(model, tokenizer, "Machine learning is")

Optimization Tips

Use quantization for memory efficiency
Enable Flash Attention for speed
Use bfloat16 for GPU optimization
Batch requests when possible
Cache KV for faster generation

# Flash Attention (if supported)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.float16,
    device_map="auto"
)

Conclusion

LLaMA 3 provides powerful open-source LLM capabilities. With proper optimization, run these models efficiently on consumer hardware.

FAQ

Q: What's the difference between LLaMA 2 and 3? A: LLaMA 3 is more capable, better at instructions, and has improved reasoning. 8B is on par with 13B from previous versions.

Q: Can I run 70B locally? A: Yes, with 8-bit quantization on 32GB RAM + RTX 3090, or 16GB VRAM on A100.

Q: Should I use LLaMA or Mistral? A: Mistral is faster and lighter. LLaMA is more capable. Choose based on your speed/quality trade-off.