LLaMA 3 — Complete Setup and Usage Guide
Advertisement
Introduction
LLaMA 3, Meta's latest open-source LLM, delivers excellent quality at reasonable resource requirements. This guide covers setup, usage, and optimization techniques.
- Understanding LLaMA 3
- Getting Access
- Installation with Ollama
- Installation with llama.cpp
- Using Transformers Library
- Quantization with bitsandbytes
- Chat Application
- LLaMA with LangChain
- Fine-tuning LLaMA 3
- System Requirements
- Performance Benchmarks
- Optimization Tips
- Conclusion
- FAQ
Understanding LLaMA 3
LLaMA 3 comes in two sizes:
- 8B: 8 billion parameters, efficient
- 70B: 70 billion parameters, highest quality
Performance:
- 8B outperforms most 7B models
- 70B competes with GPT-4 on many tasks
Getting Access
# Download from Meta or Hugging Face
# Meta's official weights available at meta.com/research/llama
# Via Hugging Face (requires access approval)
pip install huggingface-hub
huggingface-cli login
# Follow prompts to authenticate
Installation with Ollama
# Easiest method
ollama pull llama2:7b
# Run interactively
ollama run llama2:7b
# Or with streaming for large models
ollama pull llama2:70b
Installation with llama.cpp
# Clone repository
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Build
make
# Download quantized model
wget https://huggingface.co/TheBloke/Llama-2-7B-GGML/resolve/main/llama-2-7b.ggmlv3.q4_K_M.gguf
# Run
./main -m llama-2-7b.ggmlv3.q4_K_M.gguf -n 128 -p "Machine learning is"
Using Transformers Library
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "meta-llama/Llama-2-7b-chat-hf"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Load model (requires GPU for 7B)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# Generate text
prompt = "Explain machine learning in simple terms"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
top_p=0.9
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Quantization with bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# 8-bit quantization
bnb_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto"
)
# 4-bit quantization (even smaller)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4"
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto"
)
Chat Application
from transformers import pipeline
# Create chat pipeline
chat_pipeline = pipeline(
"text-generation",
model="meta-llama/Llama-2-7b-chat-hf",
torch_dtype=torch.float16,
device_map="auto"
)
# Chat
conversation = []
def chat(user_input: str) -> str:
conversation.append({
"role": "user",
"content": user_input
})
# Format for model
messages_text = ""
for msg in conversation:
if msg["role"] == "user":
messages_text += f"[INST] {msg['content']} [/INST] "
else:
messages_text += f"{msg['content']} "
response = chat_pipeline(
messages_text,
max_new_tokens=512,
temperature=0.7,
do_sample=True
)[0]["generated_text"]
# Extract assistant response
assistant_response = response.split("[/INST]")[-1].strip()
conversation.append({
"role": "assistant",
"content": assistant_response
})
return assistant_response
# Usage
print(chat("What is machine learning?"))
print(chat("Give me an example"))
print(chat("How does it work?"))
LLaMA with LangChain
from langchain.llms import Ollama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
# Create Ollama LLM
llm = Ollama(model="llama2")
# Create prompt
prompt = ChatPromptTemplate.from_template(
"Explain {topic} for a {audience}"
)
# Create chain
chain = prompt | llm | StrOutputParser()
# Execute
result = chain.invoke({
"topic": "neural networks",
"audience": "10-year-old child"
})
print(result)
Fine-tuning LLaMA 3
from peft import get_peft_model, LoraConfig
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
# Load model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
load_in_8bit=True,
device_map="auto"
)
# Configure LoRA
peft_config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none"
)
# Apply LoRA
model = get_peft_model(model, peft_config)
# Training arguments
training_args = TrainingArguments(
output_dir="./llama-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-4
)
# Create trainer and train
# (dataset preparation code)
System Requirements
LLaMA 7B:
- RAM: 8GB minimum (quantized)
- VRAM: 4GB with 8-bit quantization
- Disk: 4GB for model
LLaMA 70B:
- RAM: 32GB minimum
- VRAM: 16GB+ recommended
- Disk: 40GB for model
Optimal:
- GPU: RTX 4090 or A100
- CPU: Recent Intel/AMD processor
- Storage: NVMe SSD for faster loading
Performance Benchmarks
import time
import torch
def benchmark_model(model, tokenizer, prompt: str):
"""Benchmark generation speed."""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
start = time.time()
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.7
)
elapsed = time.time() - start
tokens = len(outputs[0])
speed = tokens / elapsed
print(f"Generated {tokens} tokens in {elapsed:.2f}s ({speed:.1f} tokens/sec)")
benchmark_model(model, tokenizer, "Machine learning is")
Optimization Tips
- Use quantization for memory efficiency
- Enable Flash Attention for speed
- Use bfloat16 for GPU optimization
- Batch requests when possible
- Cache KV for faster generation
# Flash Attention (if supported)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
attn_implementation="flash_attention_2",
torch_dtype=torch.float16,
device_map="auto"
)
Conclusion
LLaMA 3 provides powerful open-source LLM capabilities. With proper optimization, run these models efficiently on consumer hardware.
FAQ
Q: What's the difference between LLaMA 2 and 3? A: LLaMA 3 is more capable, better at instructions, and has improved reasoning. 8B is on par with 13B from previous versions.
Q: Can I run 70B locally? A: Yes, with 8-bit quantization on 32GB RAM + RTX 3090, or 16GB VRAM on A100.
Q: Should I use LLaMA or Mistral? A: Mistral is faster and lighter. LLaMA is more capable. Choose based on your speed/quality trade-off.
Advertisement