Claude for Long Document Analysis

Sanjeev SharmaSanjeev Sharma
7 min read

Advertisement

Introduction

Claude's 200K token context window (expanded from earlier 100K) is a game-changer for document analysis. You can analyze entire books, research papers, or large codebases in a single request. This guide covers techniques for maximizing Claude's document analysis capabilities and building effective workflows around long-context understanding.

Understanding Token Context

200K tokens roughly equals 150,000 words—an entire novel, dissertation, or substantial codebase. This enables fundamentally different workflows than models with smaller contexts.

Typical sizes:

  • Single file: 1K-20K tokens
  • Research paper: 10K-30K tokens
  • Book chapter: 15K-40K tokens
  • Entire codebase (small): 50K-100K tokens
  • Large codebase: 100K-150K tokens
  • Technical documentation: 20K-100K tokens
def estimate_tokens(text):
    """Rough estimate: ~1 token per 4 characters"""
    return len(text) // 4

# Examples
print(estimate_tokens("Hello world"))  # ~3 tokens
with open("document.txt") as f:
    content = f.read()
    estimated = estimate_tokens(content)
    print(f"Document estimated: {estimated} tokens (max 200K)")

Single-Pass Document Analysis

The power of long context is analyzing entire documents without chunking:

from anthropic import Anthropic

def analyze_full_document(file_path, analysis_prompt):
    """Analyze entire document in one pass"""
    client = Anthropic()

    # Read entire document
    with open(file_path, 'r') as f:
        document = f.read()

    # Token check
    tokens = len(document) // 4  # rough estimate
    if tokens > 200000:
        print(f"Warning: Document may exceed token limit ({tokens} estimated)")

    # Single request to Claude
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2000,
        messages=[
            {
                "role": "user",
                "content": f"{document}\n\n---\n\n{analysis_prompt}"
            }
        ]
    )

    return response.content[0].text

# Usage
analysis = analyze_full_document(
    "thesis.pdf.txt",
    """Analyze this thesis for:
    1. Main argument and thesis statement
    2. Key evidence supporting it
    3. Potential weaknesses
    4. How it relates to existing literature
    5. Practical implications"""
)
print(analysis)

Codebase Analysis

Analyze entire projects to understand architecture:

import os
from pathlib import Path

def prepare_codebase_for_analysis(project_dir, max_tokens=150000):
    """Prepare codebase for Claude analysis"""
    code_content = []
    total_tokens = 0

    # Include relevant files only
    include_extensions = {'.py', '.js', '.ts', '.java', '.go', '.rs'}
    exclude_dirs = {'node_modules', '.git', '__pycache__', 'venv', '.env'}

    for root, dirs, files in os.walk(project_dir):
        # Skip excluded directories
        dirs[:] = [d for d in dirs if d not in exclude_dirs]

        for file in files:
            if Path(file).suffix in include_extensions:
                file_path = os.path.join(root, file)
                try:
                    with open(file_path, 'r') as f:
                        content = f.read()

                    file_tokens = len(content) // 4
                    if total_tokens + file_tokens > max_tokens:
                        continue

                    code_content.append(f"\n{'='*50}\n# {file_path}\n{'='*50}\n{content}")
                    total_tokens += file_tokens

                except (UnicodeDecodeError, IOError):
                    continue

    return "\n".join(code_content), total_tokens

def analyze_codebase(project_dir, analysis_focus):
    """Comprehensive codebase analysis"""
    client = anthropic.Anthropic()

    code, tokens = prepare_codebase_for_analysis(project_dir)

    print(f"Analyzing codebase ({tokens} tokens)...")

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=3000,
        messages=[
            {
                "role": "user",
                "content": f"""Analyze this codebase:\n\n{code}\n\n
                Focus on: {analysis_focus}

                Provide:
                1. Architecture overview
                2. Key components and their relationships
                3. Design patterns used
                4. Potential improvements
                5. Technical debt or issues"""
            }
        ]
    )

    return response.content[0].text

# Usage
analysis = analyze_codebase(
    "./my_project",
    "scalability, security, test coverage"
)
print(analysis)

Academic Paper Analysis

Extract insights from research papers:

def analyze_research_paper(paper_text, questions):
    """Extract insights from academic paper"""
    client = anthropic.Anthropic()

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2000,
        messages=[
            {
                "role": "user",
                "content": f"""Paper:\n\n{paper_text}\n\n---

                Please answer these questions:
                {questions}

                Format your response with:
                - Direct quotes from paper
                - Page references
                - Your synthesis"""
            }
        ]
    )

    return response.content[0].text

# Usage
questions = """
1. What is the main research question?
2. What methodology was used?
3. What are the key findings?
4. How does this relate to prior work?
5. What are limitations and future work?
"""

insights = analyze_research_paper(paper_text, questions)

Multi-File Comparison

Compare multiple documents:

def compare_documents(docs_dict, comparison_focus):
    """Compare multiple documents (e.g., contract versions)"""
    client = anthropic.Anthropic()

    # Prepare document content
    content = "Documents to Compare:\n"
    for name, text in docs_dict.items():
        content += f"\n{'='*50}\n{name}\n{'='*50}\n{text}\n"

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2000,
        messages=[
            {
                "role": "user",
                "content": f"""{content}

                Compare these documents focusing on:
                {comparison_focus}

                Provide:
                1. Key differences
                2. Similarities
                3. Significance of changes
                4. Recommendations"""
            }
        ]
    )

    return response.content[0].text

# Usage
docs = {
    "Version 1": old_contract_text,
    "Version 2": new_contract_text
}

comparison = compare_documents(
    docs,
    "Legal implications, financial terms, liability changes"
)

Conversation with Long Context

Maintain multi-turn conversations while analyzing documents:

class DocumentConversation:
    def __init__(self, document_text, system_prompt=None):
        self.client = anthropic.Anthropic()
        self.document = document_text
        self.messages = []
        self.system = system_prompt or "You are analyzing the provided document. Answer questions thoroughly."

    def ask(self, question):
        """Ask a follow-up question about the document"""
        self.messages.append({
            "role": "user",
            "content": question
        })

        response = self.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1000,
            system=self.system,
            messages=[
                {
                    "role": "user",
                    "content": f"Document:\n\n{self.document}\n\n---\n\nQuestion: {question}"
                }
            ] + self.messages
        )

        answer = response.content[0].text
        self.messages.append({
            "role": "assistant",
            "content": answer
        })

        return answer

# Usage
conv = DocumentConversation(
    long_document_text,
    "You are an expert analyst of this technical document"
)

print(conv.ask("What are the main security concerns?"))
print(conv.ask("How does authentication work?"))
print(conv.ask("What about the authentication relates to the security concerns?"))

Extracting Structured Information

Extract structured data from documents:

def extract_from_document(document, extraction_schema):
    """Extract structured data from document"""
    client = anthropic.Anthropic()

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1000,
        messages=[
            {
                "role": "user",
                "content": f"""Document:\n\n{document}\n\n---

                Extract the following information and format as JSON:
                {extraction_schema}"""
            }
        ]
    )

    return response.content[0].text

# Usage
schema = """
{
    "title": "...",
    "authors": [...],
    "keyFindings": [...],
    "methodology": "...",
    "limitations": [...],
    "futureWork": "..."
}
"""

extracted = extract_from_document(paper_text, schema)
# Parse and use the extracted JSON

Summarization at Different Levels

Create summaries of different lengths:

def summarize_document(document, length="medium"):
    """Generate summaries of different lengths"""
    client = anthropic.Anthropic()

    length_specs = {
        "short": "1-2 paragraphs (executive summary)",
        "medium": "3-4 paragraphs (comprehensive overview)",
        "long": "Detailed summary preserving major points (1-2 pages)"
    }

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2000 if length == "long" else 1000,
        messages=[
            {
                "role": "user",
                "content": f"""Document:\n\n{document}\n\n---

                Create a {length_specs.get(length, 'medium')} summary.

                The summary should:
                - Capture main ideas
                - Be understandable to someone unfamiliar with the document
                - Preserve key details
                - Use clear, concise language"""
            }
        ]
    )

    return response.content[0].text

# Usage
summary = summarize_document(long_paper, "medium")

Cost Optimization for Long Documents

Long context requests cost more due to more input tokens:

def estimate_cost_for_document(document_length):
    """Estimate cost for document analysis"""
    # Claude Sonnet: $0.003 per 1K input tokens
    tokens = document_length // 4
    input_cost = (tokens / 1000) * 0.003
    output_cost = (1000 / 1000) * 0.015  # Assume 1K output
    return input_cost + output_cost

# Cost optimization strategies
def optimize_for_cost(document):
    """Strategies to reduce costs while maintaining quality"""
    tokens = len(document) // 4

    if tokens > 100000:
        print("Consider breaking analysis into chunks")
        return "chunk"
    elif tokens > 50000:
        print("Use focused questions to limit output tokens")
        return "focused_questions"
    else:
        print("Full analysis feasible")
        return "full"

Limitations and Challenges

Even with 200K tokens, challenges remain:

  • Accuracy drops with very long documents (sometimes misses details in middle)
  • Cost increases linearly with document length
  • Response time increases with very long contexts
  • Models sometimes hallucinate when analyzing massive documents

Mitigation strategies:

  1. Break documents into logical sections
  2. Ask specific questions rather than general analysis
  3. Request citations and specific quotes
  4. Verify important findings with targeted re-analysis

Practical Workflow

Effective long-document workflow:

1. Prepare document (convert PDFs, clean text)
2. Estimate tokens to ensure fit
3. Define specific analysis questions
4. Make initial analysis request
5. Ask follow-up questions based on findings
6. Extract and structure key insights
7. Verify critical findings with targeted re-analysis

Conclusion

Claude's 200K context window enables entirely new applications—analyzing entire codebases, dissertations, or technical documentation in single requests. This capability is transformative for research, code review, and document intelligence applications. Use it strategically to reduce complexity and improve analysis quality.

FAQ

Q: How accurate is Claude on long documents? A: Very accurate for main points, sometimes misses details in very long documents (100K+ tokens). Verify critical findings.

Q: Is it cheaper to process long documents with multiple requests? A: Usually no—one long request costs less than multiple shorter requests due to token efficiency.

Q: Can I analyze documents longer than 200K tokens? A: No, that's the maximum. Break into chunks or use specialized tools for extremely large documents.

Advertisement

Sanjeev Sharma

Written by

Sanjeev Sharma

Full Stack Engineer · E-mopro