How Vector Search Actually Works — HNSW, IVFFlat, and Choosing the Right Index

Introduction

Vector search underlies every RAG system. But most engineers don't understand what happens when you call vectorDb.search(). Under the hood, sophisticated approximate nearest neighbor algorithms make the difference between 10ms and 10s retrieval times. Understanding HNSW, IVFFlat, and their parameters transforms you from a user into an operator.

Exact Nearest Neighbor: Brute Force, O(n)
Approximate Nearest Neighbor (ANN) Concept
HNSW: Hierarchical Navigable Small World
HNSW Parameters: ef_construction and M
IVFFlat: Inverted File Index
HNSW vs IVFFlat Trade-Offs
Quantization Impact on Recall
Benchmarking With ANN-Benchmarks
Checklist
Conclusion

Exact Nearest Neighbor: Brute Force, O(n)

Before approximation, understand exact search.

function exactNearestNeighbor(
  query: number[],
  vectors: number[][]
): { index: number; distance: number } {
  let bestIndex = 0;
  let bestDistance = Infinity;

  // Compare query against every vector
  for (let i = 0; i &lt; vectors.length; i++) {
    const distance = euclideanDistance(query, vectors[i]);

    if (distance &lt; bestDistance) {
      bestDistance = distance;
      bestIndex = i;
    }
  }

  return { index: bestIndex, distance: bestDistance };
}

function euclideanDistance(a: number[], b: number[]): number {
  let sum = 0;
  for (let i = 0; i &lt; a.length; i++) {
    sum += Math.pow(a[i] - b[i], 2);
  }
  return Math.sqrt(sum);
}

Brute force is O(n) time, O(n × d) space (n = vectors, d = dimensions). For 1M vectors, every query touches 1M comparisons. Unacceptable for production.

Approximate Nearest Neighbor (ANN) Concept

ANN trades accuracy for speed by accepting approximate matches. Instead of finding the true nearest neighbor, find a "nearby" neighbor quickly.

The ANN frontier:

Recall: % of true top-10 nearest neighbors found
Latency: Query time
Index size: Memory overhead

All three scale inversely. Build a better index = more memory + slower indexing. Query slower = higher recall. This is the fundamental ANN trade-off.

HNSW: Hierarchical Navigable Small World

HNSW is a graph-based index. Imagine a subway system: some trains run express (highways), others make all stops (local routes). HNSW navigates from highway to local based on proximity.

interface HSNWNode {
  id: number;
  vector: number[];
  neighbors: Set&lt;number&gt;[][]; // neighbors per layer
}

class HNSWIndex {
  nodes: HSNWNode[] = [];
  M: number; // max connections per node
  efConstruction: number; // beam width during construction

  insert(vector: number[]): void {
    const newNode: HSNWNode = {
      id: this.nodes.length,
      vector,
      neighbors: [],
    };

    if (this.nodes.length === 0) {
      this.nodes.push(newNode);
      return;
    }

    // Start from an entry point at the highest layer
    let nearest = this.nodes[0];

    // Navigate from top layer down, finding nearest neighbor at each layer
    for (let layer = this.nodes[0].neighbors.length - 1; layer &gt;= 0; layer--) {
      nearest = this.searchLayer(vector, [nearest], layer, 1)[0];
    }

    // At layer 0, find M neighbors using beam search (ef_construction)
    const candidates = this.searchLayer(
      vector,
      [nearest],
      0,
      this.efConstruction
    );

    // Connect new node to M nearest candidates
    const m = Math.min(this.M, candidates.length);
    for (let i = 0; i &lt; m; i++) {
      const neighbor = candidates[i];
      newNode.neighbors[0].add(neighbor.id);
      neighbor.neighbors[0].add(newNode.id);
    }

    this.nodes.push(newNode);
  }

  search(query: number[], ef: number = 40, k: number = 10): number[] {
    // ef = search beam width; higher = better recall, slower
    let nearest = this.nodes[0];

    // Navigate top-down
    for (let layer = this.nodes[0].neighbors.length - 1; layer &gt;= 0; layer--) {
      nearest = this.searchLayer(query, [nearest], layer, 1)[0];
    }

    // Search at layer 0
    const candidates = this.searchLayer(query, [nearest], 0, ef);
    return candidates.slice(0, k).map(c =&gt; c.id);
  }

  private searchLayer(
    query: number[],
    entryPoints: HSNWNode[],
    layer: number,
    ef: number
  ): HSNWNode[] {
    const visited = new Set&lt;number&gt;();
    const candidates = new PriorityQueue&lt;HSNWNode&gt;(); // min-heap by distance
    const results = new PriorityQueue&lt;HSNWNode&gt;(); // max-heap by distance

    for (const entry of entryPoints) {
      const dist = euclideanDistance(query, entry.vector);
      candidates.push(entry, dist);
      results.push(entry, -dist);
      visited.add(entry.id);
    }

    while (candidates.size() &gt; 0) {
      const [lowerBound] = candidates.peek();
      if (lowerBound &gt; -results.peek()[0]) break;

      const current = candidates.pop()[0];

      // Explore neighbors
      for (const neighborId of current.neighbors[layer]) {
        if (visited.has(neighborId)) continue;
        visited.add(neighborId);

        const neighbor = this.nodes[neighborId];
        const dist = euclideanDistance(query, neighbor.vector);

        if (dist &lt; -results.peek()[0] || results.size() &lt; ef) {
          candidates.push(neighbor, dist);
          results.push(neighbor, -dist);

          if (results.size() &gt; ef) results.pop();
        }
      }
    }

    return results.toArray().map(item =&gt; item[0]);
  }
}

HNSW properties:

Build time: O(n log n) due to multi-layer hierarchy
Query time: O(log n) logarithmic, competitive with B-trees
Recall: Can approach 99% with good parameters
Memory: ~10-20x the vector size

HNSW Parameters: ef_construction and M

Two critical knobs:

M (default 16): Max connections per node. Higher M = more memory, better recall
ef_construction (default 200): Beam width during index building. Higher = better recall, slower build

// Tuning HNSW for different scenarios

// Low latency, high throughput (e-commerce search)
const lowLatency = { M: 8, efConstruction: 100 };

// High recall, acceptable latency (RAG retrieval)
const highRecall = { M: 24, efConstruction: 400 };

// Massive scale, limited memory (1B+ vectors)
const massiveScale = { M: 4, efConstruction: 50 };

// At query time, tune ef (search beam width) separately
async function search(index: HNSWIndex, query: number[]): Promise&lt;number[]&gt; {
  // ef = beam width for search. Higher = slower, better recall
  // Typical: ef = 40-100
  return index.search(query, ef = 50, k = 10);
}

Typical impact:

M=8: recall 90%, 8ms query time
M=16: recall 95%, 12ms query time
M=32: recall 98%, 25ms query time

IVFFlat: Inverted File Index

IVFFlat (Inverted File with Flat quantization) uses a different strategy: cluster first, then search within clusters.

class IVFFlatIndex {
  nlist: number; // number of clusters (inverted lists)
  nprobe: number; // how many clusters to search
  centroids: number[][] = [];
  lists: number[][][] = []; // lists[clusterId] = vectors in that cluster

  async build(vectors: number[][]): Promise&lt;void&gt; {
    // Cluster vectors into nlist clusters using k-means
    const kmeans = new KMeans(this.nlist);
    const assignments = await kmeans.fit(vectors);

    // Build inverted lists
    this.centroids = kmeans.centroids;
    this.lists = Array(this.nlist)
      .fill(null)
      .map(() =&gt; []);

    for (let i = 0; i &lt; vectors.length; i++) {
      const clusterId = assignments[i];
      this.lists[clusterId].push(vectors[i]);
    }
  }

  search(query: number[], k: number = 10): number[] {
    // Find nprobe nearest clusters to query
    const clusterDistances = this.centroids.map((centroid, clusterId) =&gt; ({
      clusterId,
      distance: euclideanDistance(query, centroid),
    }));

    clusterDistances.sort((a, b) =&gt; a.distance - b.distance);
    const nearestClusters = clusterDistances
      .slice(0, this.nprobe)
      .map(c =&gt; c.clusterId);

    // Search within chosen clusters
    const candidates: Array&lt;{ index: number; distance: number }&gt; = [];

    for (const clusterId of nearestClusters) {
      for (const vector of this.lists[clusterId]) {
        const distance = euclideanDistance(query, vector);
        candidates.push({ index: vector.id, distance });
      }
    }

    // Return top-k
    candidates.sort((a, b) =&gt; a.distance - b.distance);
    return candidates.slice(0, k).map(c =&gt; c.index);
  }
}

HNSW vs IVFFlat Trade-Offs

Aspect	HNSW	IVFFlat
Build time	O(n log n), slow	O(n), fast
Query time	O(log n), fast	O(M/nprobe × n), depends on parameters
Recall	95-99% (tunable)	85-95% (tunable)
Memory	10-20x vector size	Minimal (clusters only)
Best for	High-recall applications	Large-scale, cost-constrained

HNSW wins: RAG systems where recall > cost, indexes <100M vectors

IVFFlat wins: Massive scale (1B+ vectors), memory-constrained, acceptable recall loss

// Typical production choice: HNSW for RAG
// (better recall, manageable memory)
async function chooseIndex(vectorCount: number): Promise&lt;VectorIndex&gt; {
  if (vectorCount &lt; 10_000_000) {
    // &lt;10M: use HNSW for best recall
    return new HNSWIndex({ M: 16, efConstruction: 200 });
  } else {
    // &gt;10M: use IVFFlat for memory efficiency
    return new IVFFlatIndex({ nlist: Math.sqrt(vectorCount), nprobe: 100 });
  }
}

Quantization Impact on Recall

Quantization reduces vector precision (e.g., 32-bit float → 8-bit int) to save memory.

function quantize8bit(vector: number[]): number[] {
  // Normalize to [0, 255]
  const min = Math.min(...vector);
  const max = Math.max(...vector);
  const range = max - min;

  return vector.map(v =&gt; Math.round(((v - min) / range) * 255));
}

function dequantize(quantized: number[], min: number, max: number): number[] {
  const range = max - min;
  return quantized.map(v =&gt; (v / 255) * range + min);
}

Recall impact:

32-bit float: 100% (baseline)
8-bit quantization: 98-99% (0.5-2% loss)
1-bit (binary): 85-90% (10-15% loss)

Memory savings:

32-bit: 1.0x
8-bit: 0.25x (75% savings)
1-bit: 0.03x (97% savings)

For RAG, 8-bit quantization is usually the sweet spot: minimal recall loss, huge memory savings.

Benchmarking With ANN-Benchmarks

Use standardized benchmarks (ANN-Benchmarks) to compare indexes.

interface BenchmarkResult {
  indexType: string; // HNSW, IVFFlat, etc.
  parameters: Record&lt;string, number&gt;;
  queryLatencyMs: number;
  buildTimeMs: number;
  memoryMB: number;
  recall: number; // 0-1
}

async function benchmarkIndexes(
  vectors: number[][],
  queries: number[][],
  goldNeighbors: number[][]
): Promise&lt;BenchmarkResult[]&gt; {
  const results: BenchmarkResult[] = [];

  // Benchmark HNSW
  for (const M of [8, 16, 32]) {
    for (const ef of [100, 200, 400]) {
      const index = new HNSWIndex({ M, efConstruction: ef });

      const buildStart = Date.now();
      index.build(vectors);
      const buildTime = Date.now() - buildStart;

      let totalLatency = 0;
      let recalls = [];

      for (let i = 0; i &lt; queries.length; i++) {
        const queryStart = Date.now();
        const results = index.search(queries[i]);
        const queryLatency = Date.now() - queryStart;
        totalLatency += queryLatency;

        const recall = computeRecall(results, goldNeighbors[i]);
        recalls.push(recall);
      }

      results.push({
        indexType: 'HNSW',
        parameters: { M, efConstruction: ef },
        queryLatencyMs: totalLatency / queries.length,
        buildTimeMs: buildTime,
        memoryMB: estimateMemory(vectors.length, M),
        recall: recalls.reduce((a, b) =&gt; a + b) / recalls.length,
      });
    }
  }

  return results;
}

Use benchmarking results to choose parameters for your workload.

Checklist

Understand exact NN (brute force) as baseline
Use HNSW for high-recall applications (<100M vectors)
Use IVFFlat for massive scale (>1B vectors), memory-constrained
Tune HNSW: M for memory/quality, ef for search speed/quality
Tune IVFFlat: nlist for cluster granularity, nprobe for search scope
Consider 8-bit quantization for 75% memory savings with <2% recall loss
Benchmark on your corpus with realistic query patterns
Monitor recall: aim for >90% for RAG applications

Conclusion

Vector search is a solved problem if you understand the trade-offs. HNSW dominates for typical RAG systems: fast (milliseconds), accurate (95-99% recall), and manageable memory. IVFFlat takes over at billion-scale. Quantization can cut memory in half with negligible accuracy loss. Benchmark on your data—parameters vary wildly across domains.