Claude for Long Document Analysis
Advertisement
Introduction
Claude's 200K token context window (expanded from earlier 100K) is a game-changer for document analysis. You can analyze entire books, research papers, or large codebases in a single request. This guide covers techniques for maximizing Claude's document analysis capabilities and building effective workflows around long-context understanding.
- Understanding Token Context
- Single-Pass Document Analysis
- Codebase Analysis
- Academic Paper Analysis
- Multi-File Comparison
- Conversation with Long Context
- Extracting Structured Information
- Summarization at Different Levels
- Cost Optimization for Long Documents
- Limitations and Challenges
- Practical Workflow
- Conclusion
- FAQ
Understanding Token Context
200K tokens roughly equals 150,000 words—an entire novel, dissertation, or substantial codebase. This enables fundamentally different workflows than models with smaller contexts.
Typical sizes:
- Single file: 1K-20K tokens
- Research paper: 10K-30K tokens
- Book chapter: 15K-40K tokens
- Entire codebase (small): 50K-100K tokens
- Large codebase: 100K-150K tokens
- Technical documentation: 20K-100K tokens
def estimate_tokens(text):
"""Rough estimate: ~1 token per 4 characters"""
return len(text) // 4
# Examples
print(estimate_tokens("Hello world")) # ~3 tokens
with open("document.txt") as f:
content = f.read()
estimated = estimate_tokens(content)
print(f"Document estimated: {estimated} tokens (max 200K)")
Single-Pass Document Analysis
The power of long context is analyzing entire documents without chunking:
from anthropic import Anthropic
def analyze_full_document(file_path, analysis_prompt):
"""Analyze entire document in one pass"""
client = Anthropic()
# Read entire document
with open(file_path, 'r') as f:
document = f.read()
# Token check
tokens = len(document) // 4 # rough estimate
if tokens > 200000:
print(f"Warning: Document may exceed token limit ({tokens} estimated)")
# Single request to Claude
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2000,
messages=[
{
"role": "user",
"content": f"{document}\n\n---\n\n{analysis_prompt}"
}
]
)
return response.content[0].text
# Usage
analysis = analyze_full_document(
"thesis.pdf.txt",
"""Analyze this thesis for:
1. Main argument and thesis statement
2. Key evidence supporting it
3. Potential weaknesses
4. How it relates to existing literature
5. Practical implications"""
)
print(analysis)
Codebase Analysis
Analyze entire projects to understand architecture:
import os
from pathlib import Path
def prepare_codebase_for_analysis(project_dir, max_tokens=150000):
"""Prepare codebase for Claude analysis"""
code_content = []
total_tokens = 0
# Include relevant files only
include_extensions = {'.py', '.js', '.ts', '.java', '.go', '.rs'}
exclude_dirs = {'node_modules', '.git', '__pycache__', 'venv', '.env'}
for root, dirs, files in os.walk(project_dir):
# Skip excluded directories
dirs[:] = [d for d in dirs if d not in exclude_dirs]
for file in files:
if Path(file).suffix in include_extensions:
file_path = os.path.join(root, file)
try:
with open(file_path, 'r') as f:
content = f.read()
file_tokens = len(content) // 4
if total_tokens + file_tokens > max_tokens:
continue
code_content.append(f"\n{'='*50}\n# {file_path}\n{'='*50}\n{content}")
total_tokens += file_tokens
except (UnicodeDecodeError, IOError):
continue
return "\n".join(code_content), total_tokens
def analyze_codebase(project_dir, analysis_focus):
"""Comprehensive codebase analysis"""
client = anthropic.Anthropic()
code, tokens = prepare_codebase_for_analysis(project_dir)
print(f"Analyzing codebase ({tokens} tokens)...")
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=3000,
messages=[
{
"role": "user",
"content": f"""Analyze this codebase:\n\n{code}\n\n
Focus on: {analysis_focus}
Provide:
1. Architecture overview
2. Key components and their relationships
3. Design patterns used
4. Potential improvements
5. Technical debt or issues"""
}
]
)
return response.content[0].text
# Usage
analysis = analyze_codebase(
"./my_project",
"scalability, security, test coverage"
)
print(analysis)
Academic Paper Analysis
Extract insights from research papers:
def analyze_research_paper(paper_text, questions):
"""Extract insights from academic paper"""
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2000,
messages=[
{
"role": "user",
"content": f"""Paper:\n\n{paper_text}\n\n---
Please answer these questions:
{questions}
Format your response with:
- Direct quotes from paper
- Page references
- Your synthesis"""
}
]
)
return response.content[0].text
# Usage
questions = """
1. What is the main research question?
2. What methodology was used?
3. What are the key findings?
4. How does this relate to prior work?
5. What are limitations and future work?
"""
insights = analyze_research_paper(paper_text, questions)
Multi-File Comparison
Compare multiple documents:
def compare_documents(docs_dict, comparison_focus):
"""Compare multiple documents (e.g., contract versions)"""
client = anthropic.Anthropic()
# Prepare document content
content = "Documents to Compare:\n"
for name, text in docs_dict.items():
content += f"\n{'='*50}\n{name}\n{'='*50}\n{text}\n"
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2000,
messages=[
{
"role": "user",
"content": f"""{content}
Compare these documents focusing on:
{comparison_focus}
Provide:
1. Key differences
2. Similarities
3. Significance of changes
4. Recommendations"""
}
]
)
return response.content[0].text
# Usage
docs = {
"Version 1": old_contract_text,
"Version 2": new_contract_text
}
comparison = compare_documents(
docs,
"Legal implications, financial terms, liability changes"
)
Conversation with Long Context
Maintain multi-turn conversations while analyzing documents:
class DocumentConversation:
def __init__(self, document_text, system_prompt=None):
self.client = anthropic.Anthropic()
self.document = document_text
self.messages = []
self.system = system_prompt or "You are analyzing the provided document. Answer questions thoroughly."
def ask(self, question):
"""Ask a follow-up question about the document"""
self.messages.append({
"role": "user",
"content": question
})
response = self.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1000,
system=self.system,
messages=[
{
"role": "user",
"content": f"Document:\n\n{self.document}\n\n---\n\nQuestion: {question}"
}
] + self.messages
)
answer = response.content[0].text
self.messages.append({
"role": "assistant",
"content": answer
})
return answer
# Usage
conv = DocumentConversation(
long_document_text,
"You are an expert analyst of this technical document"
)
print(conv.ask("What are the main security concerns?"))
print(conv.ask("How does authentication work?"))
print(conv.ask("What about the authentication relates to the security concerns?"))
Extracting Structured Information
Extract structured data from documents:
def extract_from_document(document, extraction_schema):
"""Extract structured data from document"""
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1000,
messages=[
{
"role": "user",
"content": f"""Document:\n\n{document}\n\n---
Extract the following information and format as JSON:
{extraction_schema}"""
}
]
)
return response.content[0].text
# Usage
schema = """
{
"title": "...",
"authors": [...],
"keyFindings": [...],
"methodology": "...",
"limitations": [...],
"futureWork": "..."
}
"""
extracted = extract_from_document(paper_text, schema)
# Parse and use the extracted JSON
Summarization at Different Levels
Create summaries of different lengths:
def summarize_document(document, length="medium"):
"""Generate summaries of different lengths"""
client = anthropic.Anthropic()
length_specs = {
"short": "1-2 paragraphs (executive summary)",
"medium": "3-4 paragraphs (comprehensive overview)",
"long": "Detailed summary preserving major points (1-2 pages)"
}
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2000 if length == "long" else 1000,
messages=[
{
"role": "user",
"content": f"""Document:\n\n{document}\n\n---
Create a {length_specs.get(length, 'medium')} summary.
The summary should:
- Capture main ideas
- Be understandable to someone unfamiliar with the document
- Preserve key details
- Use clear, concise language"""
}
]
)
return response.content[0].text
# Usage
summary = summarize_document(long_paper, "medium")
Cost Optimization for Long Documents
Long context requests cost more due to more input tokens:
def estimate_cost_for_document(document_length):
"""Estimate cost for document analysis"""
# Claude Sonnet: $0.003 per 1K input tokens
tokens = document_length // 4
input_cost = (tokens / 1000) * 0.003
output_cost = (1000 / 1000) * 0.015 # Assume 1K output
return input_cost + output_cost
# Cost optimization strategies
def optimize_for_cost(document):
"""Strategies to reduce costs while maintaining quality"""
tokens = len(document) // 4
if tokens > 100000:
print("Consider breaking analysis into chunks")
return "chunk"
elif tokens > 50000:
print("Use focused questions to limit output tokens")
return "focused_questions"
else:
print("Full analysis feasible")
return "full"
Limitations and Challenges
Even with 200K tokens, challenges remain:
- Accuracy drops with very long documents (sometimes misses details in middle)
- Cost increases linearly with document length
- Response time increases with very long contexts
- Models sometimes hallucinate when analyzing massive documents
Mitigation strategies:
- Break documents into logical sections
- Ask specific questions rather than general analysis
- Request citations and specific quotes
- Verify important findings with targeted re-analysis
Practical Workflow
Effective long-document workflow:
1. Prepare document (convert PDFs, clean text)
2. Estimate tokens to ensure fit
3. Define specific analysis questions
4. Make initial analysis request
5. Ask follow-up questions based on findings
6. Extract and structure key insights
7. Verify critical findings with targeted re-analysis
Conclusion
Claude's 200K context window enables entirely new applications—analyzing entire codebases, dissertations, or technical documentation in single requests. This capability is transformative for research, code review, and document intelligence applications. Use it strategically to reduce complexity and improve analysis quality.
FAQ
Q: How accurate is Claude on long documents? A: Very accurate for main points, sometimes misses details in very long documents (100K+ tokens). Verify critical findings.
Q: Is it cheaper to process long documents with multiple requests? A: Usually no—one long request costs less than multiple shorter requests due to token efficiency.
Q: Can I analyze documents longer than 200K tokens? A: No, that's the maximum. Break into chunks or use specialized tools for extremely large documents.
Advertisement