Published on

How Vector Search Actually Works — HNSW, IVFFlat, and Choosing the Right Index

Authors

Introduction

Vector search underlies every RAG system. But most engineers don't understand what happens when you call vectorDb.search(). Under the hood, sophisticated approximate nearest neighbor algorithms make the difference between 10ms and 10s retrieval times. Understanding HNSW, IVFFlat, and their parameters transforms you from a user into an operator.

Exact Nearest Neighbor: Brute Force, O(n)

Before approximation, understand exact search.

function exactNearestNeighbor(
  query: number[],
  vectors: number[][]
): { index: number; distance: number } {
  let bestIndex = 0;
  let bestDistance = Infinity;

  // Compare query against every vector
  for (let i = 0; i < vectors.length; i++) {
    const distance = euclideanDistance(query, vectors[i]);

    if (distance < bestDistance) {
      bestDistance = distance;
      bestIndex = i;
    }
  }

  return { index: bestIndex, distance: bestDistance };
}

function euclideanDistance(a: number[], b: number[]): number {
  let sum = 0;
  for (let i = 0; i < a.length; i++) {
    sum += Math.pow(a[i] - b[i], 2);
  }
  return Math.sqrt(sum);
}

Brute force is O(n) time, O(n × d) space (n = vectors, d = dimensions). For 1M vectors, every query touches 1M comparisons. Unacceptable for production.

Approximate Nearest Neighbor (ANN) Concept

ANN trades accuracy for speed by accepting approximate matches. Instead of finding the true nearest neighbor, find a "nearby" neighbor quickly.

The ANN frontier:

  • Recall: % of true top-10 nearest neighbors found
  • Latency: Query time
  • Index size: Memory overhead

All three scale inversely. Build a better index = more memory + slower indexing. Query slower = higher recall. This is the fundamental ANN trade-off.

HNSW: Hierarchical Navigable Small World

HNSW is a graph-based index. Imagine a subway system: some trains run express (highways), others make all stops (local routes). HNSW navigates from highway to local based on proximity.

interface HSNWNode {
  id: number;
  vector: number[];
  neighbors: Set<number>[][]; // neighbors per layer
}

class HNSWIndex {
  nodes: HSNWNode[] = [];
  M: number; // max connections per node
  efConstruction: number; // beam width during construction

  insert(vector: number[]): void {
    const newNode: HSNWNode = {
      id: this.nodes.length,
      vector,
      neighbors: [],
    };

    if (this.nodes.length === 0) {
      this.nodes.push(newNode);
      return;
    }

    // Start from an entry point at the highest layer
    let nearest = this.nodes[0];

    // Navigate from top layer down, finding nearest neighbor at each layer
    for (let layer = this.nodes[0].neighbors.length - 1; layer >= 0; layer--) {
      nearest = this.searchLayer(vector, [nearest], layer, 1)[0];
    }

    // At layer 0, find M neighbors using beam search (ef_construction)
    const candidates = this.searchLayer(
      vector,
      [nearest],
      0,
      this.efConstruction
    );

    // Connect new node to M nearest candidates
    const m = Math.min(this.M, candidates.length);
    for (let i = 0; i < m; i++) {
      const neighbor = candidates[i];
      newNode.neighbors[0].add(neighbor.id);
      neighbor.neighbors[0].add(newNode.id);
    }

    this.nodes.push(newNode);
  }

  search(query: number[], ef: number = 40, k: number = 10): number[] {
    // ef = search beam width; higher = better recall, slower
    let nearest = this.nodes[0];

    // Navigate top-down
    for (let layer = this.nodes[0].neighbors.length - 1; layer >= 0; layer--) {
      nearest = this.searchLayer(query, [nearest], layer, 1)[0];
    }

    // Search at layer 0
    const candidates = this.searchLayer(query, [nearest], 0, ef);
    return candidates.slice(0, k).map(c => c.id);
  }

  private searchLayer(
    query: number[],
    entryPoints: HSNWNode[],
    layer: number,
    ef: number
  ): HSNWNode[] {
    const visited = new Set<number>();
    const candidates = new PriorityQueue<HSNWNode>(); // min-heap by distance
    const results = new PriorityQueue<HSNWNode>(); // max-heap by distance

    for (const entry of entryPoints) {
      const dist = euclideanDistance(query, entry.vector);
      candidates.push(entry, dist);
      results.push(entry, -dist);
      visited.add(entry.id);
    }

    while (candidates.size() > 0) {
      const [lowerBound] = candidates.peek();
      if (lowerBound > -results.peek()[0]) break;

      const current = candidates.pop()[0];

      // Explore neighbors
      for (const neighborId of current.neighbors[layer]) {
        if (visited.has(neighborId)) continue;
        visited.add(neighborId);

        const neighbor = this.nodes[neighborId];
        const dist = euclideanDistance(query, neighbor.vector);

        if (dist < -results.peek()[0] || results.size() < ef) {
          candidates.push(neighbor, dist);
          results.push(neighbor, -dist);

          if (results.size() > ef) results.pop();
        }
      }
    }

    return results.toArray().map(item => item[0]);
  }
}

HNSW properties:

  • Build time: O(n log n) due to multi-layer hierarchy
  • Query time: O(log n) logarithmic, competitive with B-trees
  • Recall: Can approach 99% with good parameters
  • Memory: ~10-20x the vector size

HNSW Parameters: ef_construction and M

Two critical knobs:

  • M (default 16): Max connections per node. Higher M = more memory, better recall
  • ef_construction (default 200): Beam width during index building. Higher = better recall, slower build
// Tuning HNSW for different scenarios

// Low latency, high throughput (e-commerce search)
const lowLatency = { M: 8, efConstruction: 100 };

// High recall, acceptable latency (RAG retrieval)
const highRecall = { M: 24, efConstruction: 400 };

// Massive scale, limited memory (1B+ vectors)
const massiveScale = { M: 4, efConstruction: 50 };

// At query time, tune ef (search beam width) separately
async function search(index: HNSWIndex, query: number[]): Promise<number[]> {
  // ef = beam width for search. Higher = slower, better recall
  // Typical: ef = 40-100
  return index.search(query, ef = 50, k = 10);
}

Typical impact:

  • M=8: recall 90%, 8ms query time
  • M=16: recall 95%, 12ms query time
  • M=32: recall 98%, 25ms query time

IVFFlat: Inverted File Index

IVFFlat (Inverted File with Flat quantization) uses a different strategy: cluster first, then search within clusters.

class IVFFlatIndex {
  nlist: number; // number of clusters (inverted lists)
  nprobe: number; // how many clusters to search
  centroids: number[][] = [];
  lists: number[][][] = []; // lists[clusterId] = vectors in that cluster

  async build(vectors: number[][]): Promise<void> {
    // Cluster vectors into nlist clusters using k-means
    const kmeans = new KMeans(this.nlist);
    const assignments = await kmeans.fit(vectors);

    // Build inverted lists
    this.centroids = kmeans.centroids;
    this.lists = Array(this.nlist)
      .fill(null)
      .map(() => []);

    for (let i = 0; i < vectors.length; i++) {
      const clusterId = assignments[i];
      this.lists[clusterId].push(vectors[i]);
    }
  }

  search(query: number[], k: number = 10): number[] {
    // Find nprobe nearest clusters to query
    const clusterDistances = this.centroids.map((centroid, clusterId) => ({
      clusterId,
      distance: euclideanDistance(query, centroid),
    }));

    clusterDistances.sort((a, b) => a.distance - b.distance);
    const nearestClusters = clusterDistances
      .slice(0, this.nprobe)
      .map(c => c.clusterId);

    // Search within chosen clusters
    const candidates: Array<{ index: number; distance: number }> = [];

    for (const clusterId of nearestClusters) {
      for (const vector of this.lists[clusterId]) {
        const distance = euclideanDistance(query, vector);
        candidates.push({ index: vector.id, distance });
      }
    }

    // Return top-k
    candidates.sort((a, b) => a.distance - b.distance);
    return candidates.slice(0, k).map(c => c.index);
  }
}

HNSW vs IVFFlat Trade-Offs

AspectHNSWIVFFlat
Build timeO(n log n), slowO(n), fast
Query timeO(log n), fastO(M/nprobe × n), depends on parameters
Recall95-99% (tunable)85-95% (tunable)
Memory10-20x vector sizeMinimal (clusters only)
Best forHigh-recall applicationsLarge-scale, cost-constrained

HNSW wins: RAG systems where recall > cost, indexes <100M vectors

IVFFlat wins: Massive scale (1B+ vectors), memory-constrained, acceptable recall loss

// Typical production choice: HNSW for RAG
// (better recall, manageable memory)
async function chooseIndex(vectorCount: number): Promise&lt;VectorIndex&gt; {
  if (vectorCount &lt; 10_000_000) {
    // &lt;10M: use HNSW for best recall
    return new HNSWIndex({ M: 16, efConstruction: 200 });
  } else {
    // &gt;10M: use IVFFlat for memory efficiency
    return new IVFFlatIndex({ nlist: Math.sqrt(vectorCount), nprobe: 100 });
  }
}

Quantization Impact on Recall

Quantization reduces vector precision (e.g., 32-bit float → 8-bit int) to save memory.

function quantize8bit(vector: number[]): number[] {
  // Normalize to [0, 255]
  const min = Math.min(...vector);
  const max = Math.max(...vector);
  const range = max - min;

  return vector.map(v =&gt; Math.round(((v - min) / range) * 255));
}

function dequantize(quantized: number[], min: number, max: number): number[] {
  const range = max - min;
  return quantized.map(v =&gt; (v / 255) * range + min);
}

Recall impact:

  • 32-bit float: 100% (baseline)
  • 8-bit quantization: 98-99% (0.5-2% loss)
  • 1-bit (binary): 85-90% (10-15% loss)

Memory savings:

  • 32-bit: 1.0x
  • 8-bit: 0.25x (75% savings)
  • 1-bit: 0.03x (97% savings)

For RAG, 8-bit quantization is usually the sweet spot: minimal recall loss, huge memory savings.

Benchmarking With ANN-Benchmarks

Use standardized benchmarks (ANN-Benchmarks) to compare indexes.

interface BenchmarkResult {
  indexType: string; // HNSW, IVFFlat, etc.
  parameters: Record&lt;string, number&gt;;
  queryLatencyMs: number;
  buildTimeMs: number;
  memoryMB: number;
  recall: number; // 0-1
}

async function benchmarkIndexes(
  vectors: number[][],
  queries: number[][],
  goldNeighbors: number[][]
): Promise&lt;BenchmarkResult[]&gt; {
  const results: BenchmarkResult[] = [];

  // Benchmark HNSW
  for (const M of [8, 16, 32]) {
    for (const ef of [100, 200, 400]) {
      const index = new HNSWIndex({ M, efConstruction: ef });

      const buildStart = Date.now();
      index.build(vectors);
      const buildTime = Date.now() - buildStart;

      let totalLatency = 0;
      let recalls = [];

      for (let i = 0; i &lt; queries.length; i++) {
        const queryStart = Date.now();
        const results = index.search(queries[i]);
        const queryLatency = Date.now() - queryStart;
        totalLatency += queryLatency;

        const recall = computeRecall(results, goldNeighbors[i]);
        recalls.push(recall);
      }

      results.push({
        indexType: 'HNSW',
        parameters: { M, efConstruction: ef },
        queryLatencyMs: totalLatency / queries.length,
        buildTimeMs: buildTime,
        memoryMB: estimateMemory(vectors.length, M),
        recall: recalls.reduce((a, b) =&gt; a + b) / recalls.length,
      });
    }
  }

  return results;
}

Use benchmarking results to choose parameters for your workload.

Checklist

  • Understand exact NN (brute force) as baseline
  • Use HNSW for high-recall applications (<100M vectors)
  • Use IVFFlat for massive scale (>1B vectors), memory-constrained
  • Tune HNSW: M for memory/quality, ef for search speed/quality
  • Tune IVFFlat: nlist for cluster granularity, nprobe for search scope
  • Consider 8-bit quantization for 75% memory savings with <2% recall loss
  • Benchmark on your corpus with realistic query patterns
  • Monitor recall: aim for >90% for RAG applications

Conclusion

Vector search is a solved problem if you understand the trade-offs. HNSW dominates for typical RAG systems: fast (milliseconds), accurate (95-99% recall), and manageable memory. IVFFlat takes over at billion-scale. Quantization can cut memory in half with negligible accuracy loss. Benchmark on your data—parameters vary wildly across domains.