Published on

RAG Chunking Strategies — How You Split Documents Changes Everything

Authors

Introduction

How you chunk documents directly determines retrieval quality. Split too small and you lose context; too large and you waste tokens with irrelevant information. More critically, different chunking strategies optimize for different retrieval patterns.

This post covers the full spectrum of chunking approaches used in production RAG systems.

Fixed-Size Chunking with Overlap

The simplest approach: split by character count with overlap to preserve context:

interface ChunkingConfig {
  chunkSize: number; // characters per chunk
  overlapSize: number; // characters to overlap between chunks
}

function fixedSizeChunking(
  text: string,
  config: ChunkingConfig
): Array<{ id: string; text: string; startIdx: number; endIdx: number }> {
  const chunks: Array<{
    id: string;
    text: string;
    startIdx: number;
    endIdx: number;
  }> = [];
  const stride = config.chunkSize - config.overlapSize;

  for (let i = 0; i < text.length; i += stride) {
    const endIdx = Math.min(i + config.chunkSize, text.length);
    const chunkText = text.substring(i, endIdx);

    if (chunkText.length > 50) { // Skip tiny fragments
      chunks.push({
        id: `chunk_${chunks.length}`,
        text: chunkText,
        startIdx: i,
        endIdx: endIdx,
      });
    }

    if (endIdx === text.length) break;
  }

  return chunks;
}

// Usage: 1024 chars per chunk, 128 chars overlap
const chunks = fixedSizeChunking(largeText, {
  chunkSize: 1024,
  overlapSize: 128,
});

Pros: Simple, fast, predictable token usage Cons: No understanding of content structure, loses semantic boundaries

Recursive Character Splitting

LangChain's recursive splitter: split on progressively smaller delimiters to preserve structure:

async function recursiveCharacterSplit(
  text: string,
  targetChunkSize: number = 1024,
  targetOverlapSize: number = 100
): Promise<string[]> {
  const separators = [
    '\n\n', // paragraph breaks
    '\n', // line breaks
    '. ', // sentence breaks
    ' ', // words
    '', // characters
  ];

  function splitText(
    textToSplit: string,
    separators: string[]
  ): string[] {
    const goodSplits: string[] = [];
    let separatorIndex = separators.length - 1;

    // Find the largest separator that splits the text
    while (separatorIndex >= 0) {
      const separator = separators[separatorIndex];
      if (separator === '') {
        goodSplits.push(...textToSplit.split(''));
        break;
      }

      if (textToSplit.includes(separator)) {
        const splits = textToSplit.split(separator);
        let mergedText = '';
        const newGoodSplits: string[] = [];

        for (const s of splits) {
          if (
            (mergedText + s).length < targetChunkSize &&
            (mergedText + s).length < 5000
          ) {
            mergedText += s + separator;
          } else {
            if (mergedText) newGoodSplits.push(mergedText.trim());
            mergedText = s + separator;
          }
        }

        if (mergedText) newGoodSplits.push(mergedText.trim());
        return newGoodSplits;
      }

      separatorIndex -= 1;
    }

    return goodSplits.filter(s => s.length > 0);
  }

  return splitText(text, separators);
}

Pros: Preserves logical structure, respects paragraph/sentence boundaries Cons: Still structural, not semantic

Semantic Chunking

Split based on semantic similarity between sentences:

import { embed } from './embeddings';

async function semanticChunking(
  text: string,
  similarityThreshold: number = 0.5
): Promise<string[]> {
  // Step 1: Split into sentences
  const sentences = text.match(/[^.!?]+[.!?]+/g) || [];

  if (sentences.length < 2) return sentences;

  // Step 2: Embed each sentence
  const embeddings = await Promise.all(
    sentences.map(s => embed(s.trim()))
  );

  // Step 3: Compute similarity between consecutive sentences
  function cosineSimilarity(a: number[], b: number[]): number {
    const dotProduct = a.reduce((sum, x, i) => sum + x * b[i], 0);
    const normA = Math.sqrt(a.reduce((sum, x) => sum + x * x, 0));
    const normB = Math.sqrt(b.reduce((sum, x) => sum + x * x, 0));
    return dotProduct / (normA * normB);
  }

  // Step 4: Identify chunk boundaries at low-similarity transitions
  const chunks: string[] = [];
  let currentChunk = sentences[0];

  for (let i = 1; i < sentences.length; i++) {
    const similarity = cosineSimilarity(embeddings[i - 1], embeddings[i]);

    if (similarity < similarityThreshold) {
      // Start new chunk at semantic boundary
      chunks.push(currentChunk.trim());
      currentChunk = sentences[i];
    } else {
      // Continue building current chunk
      currentChunk += ' ' + sentences[i];
    }
  }

  if (currentChunk) chunks.push(currentChunk.trim());

  return chunks;
}

Pros: Respects semantic boundaries, improves retrieval relevance Cons: Slow (requires embedding every sentence), less predictable sizes

Document Structure Chunking

Parse document structure (headers, sections) and chunk accordingly:

interface StructuredChunk {
  text: string;
  metadata: {
    heading: string;
    section: string;
    hierarchy: string[];
    pageNumber?: number;
  };
}

async function structureAwareChunking(markdown: string): Promise<StructuredChunk[]> {
  const lines = markdown.split('\n');
  const chunks: StructuredChunk[] = [];
  let currentHeading = 'Root';
  let currentSection = '';
  let currentHierarchy: string[] = [];
  let buffer = '';

  for (const line of lines) {
    // Detect headers
    const headerMatch = line.match(/^(#{1,6})\s+(.+)$/);

    if (headerMatch) {
      // Save previous chunk
      if (buffer.trim()) {
        chunks.push({
          text: buffer.trim(),
          metadata: {
            heading: currentHeading,
            section: currentSection,
            hierarchy: [...currentHierarchy],
          },
        });
        buffer = '';
      }

      // Update hierarchy
      const level = headerMatch[1].length;
      currentHierarchy = currentHierarchy.slice(0, level - 1);
      currentHeading = headerMatch[2];
      currentHierarchy.push(currentHeading);
      currentSection = currentHierarchy.join(' > ');
    } else if (line.trim()) {
      buffer += line + '\n';

      // Create chunk when buffer reaches ~1000 chars
      if (buffer.length > 1000) {
        chunks.push({
          text: buffer.trim(),
          metadata: {
            heading: currentHeading,
            section: currentSection,
            hierarchy: [...currentHierarchy],
          },
        });
        buffer = '';
      }
    }
  }

  // Save final chunk
  if (buffer.trim()) {
    chunks.push({
      text: buffer.trim(),
      metadata: {
        heading: currentHeading,
        section: currentSection,
        hierarchy: [...currentHierarchy],
      },
    });
  }

  return chunks;
}

Pros: Preserves document structure in metadata, enables structure-aware retrieval Cons: Requires parsing, format-dependent

Sentence-Window Retrieval

Embed individual sentences, but retrieve entire paragraphs (maximize context):

interface SentenceWindowChunk {
  sentenceId: string;
  sentence: string;
  sentenceEmbedding: number[];
  window: {
    before?: string;
    after?: string;
  };
  metadata: Record<string, unknown>;
}

async function sentenceWindowRetrieval(
  text: string,
  windowSize: number = 2
): Promise<SentenceWindowChunk[]> {
  // Step 1: Split into sentences
  const sentences = text.match(/[^.!?]+[.!?]+/g) || [];

  // Step 2: Create sentence-level chunks with window context
  const chunks: SentenceWindowChunk[] = [];

  for (let i = 0; i < sentences.length; i++) {
    const sentence = sentences[i].trim();
    const sentenceEmbedding = await embed(sentence);

    // Collect surrounding sentences (window)
    const before = sentences.slice(Math.max(0, i - windowSize), i).join(' ');
    const after = sentences.slice(i + 1, Math.min(sentences.length, i + 1 + windowSize)).join(' ');

    chunks.push({
      sentenceId: `sent_${i}`,
      sentence,
      sentenceEmbedding,
      window: {
        before: before || undefined,
        after: after || undefined,
      },
      metadata: {
        sentenceIndex: i,
        docOffset: text.indexOf(sentence),
      },
    });
  }

  return chunks;
}

// At retrieval time: search by sentence, return window
async function retrieveWithWindow(
  query: string,
  chunks: SentenceWindowChunk[],
  topK: number = 5
): Promise<Array<{ text: string; metadata: Record<string, unknown> }>> {
  const queryEmbedding = await embed(query);

  // Compute similarity for each sentence
  const similarities = chunks.map(chunk => {
    const similarity = cosineSimilarity(queryEmbedding, chunk.sentenceEmbedding);
    return { chunk, similarity };
  });

  // Get top-k sentences
  const topSentences = similarities
    .sort((a, b) => b.similarity - a.similarity)
    .slice(0, topK);

  // Return sentences with their windows for context
  return topSentences.map(({ chunk }) => ({
    text: [
      chunk.window.before ? chunk.window.before + ' ' : '',
      chunk.sentence,
      chunk.window.after ? ' ' + chunk.window.after : '',
    ]
      .filter(Boolean)
      .join(''),
    metadata: chunk.metadata,
  }));
}

function cosineSimilarity(a: number[], b: number[]): number {
  const dotProduct = a.reduce((sum, x, i) => sum + x * b[i], 0);
  const normA = Math.sqrt(a.reduce((sum, x) => sum + x * x, 0));
  const normB = Math.sqrt(b.reduce((sum, x) => sum + x * x, 0));
  return dotProduct / (normA * normB);
}

Pros: Balances search granularity with context preservation Cons: More complex implementation, double storage

Parent-Child Chunking

Embed fine-grained chunks, but retrieve parent (coarse) chunks:

interface ParentChildChunks {
  parentId: string;
  parentText: string;
  children: Array<{
    childId: string;
    text: string;
    embedding: number[];
  }>;
}

function parentChildChunking(
  text: string,
  parentSize: number = 2048,
  childSize: number = 512,
  overlap: number = 100
): ParentChildChunks[] {
  const parents: ParentChildChunks[] = [];
  const stride = parentSize - overlap;

  for (let i = 0; i < text.length; i += stride) {
    const endIdx = Math.min(i + parentSize, text.length);
    const parentText = text.substring(i, endIdx);

    if (parentText.length < 50) continue;

    const parentId = `parent_${parents.length}`;
    const children = [];
    const childStride = childSize - (overlap / 2);

    for (let j = 0; j < parentText.length; j += childStride) {
      const childEndIdx = Math.min(j + childSize, parentText.length);
      const childText = parentText.substring(j, childEndIdx);

      if (childText.length > 20) {
        children.push({
          childId: `${parentId}_child_${children.length}`,
          text: childText,
          embedding: [], // Populate via embedModel
        });
      }

      if (childEndIdx === parentText.length) break;
    }

    parents.push({
      parentId,
      parentText,
      children,
    });

    if (endIdx === text.length) break;
  }

  return parents;
}

// Retrieval: search children, return parents
async function retrieveWithParentChild(
  query: string,
  parentChildStructure: ParentChildChunks[],
  topK: number = 3
): Promise<string[]> {
  const queryEmbedding = await embed(query);
  const similarities: Array<{
    parentId: string;
    childId: string;
    similarity: number;
  }> = [];

  for (const parent of parentChildStructure) {
    for (const child of parent.children) {
      const similarity = cosineSimilarity(queryEmbedding, child.embedding);
      similarities.push({ parentId: parent.parentId, childId: child.childId, similarity });
    }
  }

  // Get top-k by child similarity, but return parent text
  const topChildren = similarities
    .sort((a, b) => b.similarity - a.similarity)
    .slice(0, topK);

  const parentIds = new Set(topChildren.map(c => c.parentId));
  const results: string[] = [];

  for (const parentId of parentIds) {
    const parent = parentChildStructure.find(p => p.parentId === parentId);
    if (parent) results.push(parent.parentText);
  }

  return results;
}

Pros: Fine-grained search, coarse context Cons: Higher storage, complexity

Late Chunking

Embed entire documents, then chunk embeddings:

async function lateChunking(
  documents: Array<{ id: string; text: string }>,
  chunkSize: number = 10
): Promise<
  Array<{
    chunkId: string;
    docId: string;
    text: string;
    embedding: number[];
  }>
> {
  const chunks = [];
  let chunkCounter = 0;

  for (const doc of documents) {
    // Step 1: Get single embedding for entire document
    const docEmbedding = await embed(doc.text);

    // Step 2: Split document into logical chunks
    const docChunks = doc.text
      .split(/\n\n+/) // Split by paragraph
      .filter(c => c.length > 50);

    // Step 3: Assign shared embedding to all chunks
    // In practice, you'd use weighted combination based on chunk positions
    for (const chunkText of docChunks) {
      chunks.push({
        chunkId: `chunk_${chunkCounter++}`,
        docId: doc.id,
        text: chunkText,
        embedding: docEmbedding, // Shared!
      });
    }
  }

  return chunks;
}

Pros: Lower embedding costs, document-level coherence Cons: Less granular search, all chunks share same embedding

Chunk Size Experiments

Track these metrics to optimize chunk size for your domain:

interface ChunkingMetrics {
  avgChunkSize: number;
  minChunkSize: number;
  maxChunkSize: number;
  totalChunks: number;
  avgTokensPerChunk: number;
  avgRetrievalRank: number; // How high is relevant chunk ranked
  retrievalHitRate: number; // % of queries where relevant chunk is in top-5
}

async function evaluateChunkingStrategy(
  chunks: string[],
  testQueries: Array<{ query: string; relevantChunkIds: string[] }>,
  embedding: (text: string) => Promise<number[]>,
  tokenize: (text: string) => string[]
): Promise<ChunkingMetrics> {
  const tokenCounts = chunks.map(c => tokenize(c).length);
  const retrievalRanks: number[] = [];
  const hits: boolean[] = [];

  for (const test of testQueries) {
    const queryEmbedding = await embedding(test.query);
    const similarities = chunks.map((chunk, idx) => ({
      idx,
      similarity: cosineSimilarity(queryEmbedding, await embedding(chunk)),
    }));

    similarities.sort((a, b) => b.similarity - a.similarity);

    for (const relevantId of test.relevantChunkIds) {
      const rank = similarities.findIndex(s => s.idx === parseInt(relevantId));
      if (rank !== -1) {
        retrievalRanks.push(rank + 1);
        hits.push(rank < 5);
      }
    }
  }

  return {
    avgChunkSize: chunks.reduce((sum, c) => sum + c.length, 0) / chunks.length,
    minChunkSize: Math.min(...chunks.map(c => c.length)),
    maxChunkSize: Math.max(...chunks.map(c => c.length)),
    totalChunks: chunks.length,
    avgTokensPerChunk: tokenCounts.reduce((a, b) => a + b, 0) / tokenCounts.length,
    avgRetrievalRank: retrievalRanks.reduce((a, b) => a + b, 0) / retrievalRanks.length,
    retrievalHitRate: hits.filter(Boolean).length / hits.length,
  };
}

Checklist

  • Start with recursive character splitting for general documents
  • Measure retrieval hit rate for your domain
  • Consider semantic chunking for highly technical content
  • Implement sentence-window retrieval for balanced context
  • Use structure-aware chunking for markdown/PDFs
  • Track chunk size distribution (target 512-1024 tokens)
  • Test multiple overlap sizes (10-20% recommended)
  • Evaluate reranking quality by chunk size
  • Monitor embedding costs relative to retrieval quality gains
  • A/B test chunking strategies on production queries

Conclusion

There's no universal best chunking strategy. The optimal approach depends on your document type, embedding model, and downstream task. Start with recursive splitting, measure hit rates, then progressively add semantic or structure-aware chunking. The key metric: does your retrieval system find relevant chunks in the top-k results? Everything else is implementation detail.