- Published on
How Vector Search Actually Works — HNSW, IVFFlat, and Choosing the Right Index
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Vector search underlies every RAG system. But most engineers don't understand what happens when you call vectorDb.search(). Under the hood, sophisticated approximate nearest neighbor algorithms make the difference between 10ms and 10s retrieval times. Understanding HNSW, IVFFlat, and their parameters transforms you from a user into an operator.
- Exact Nearest Neighbor: Brute Force, O(n)
- Approximate Nearest Neighbor (ANN) Concept
- HNSW: Hierarchical Navigable Small World
- HNSW Parameters: ef_construction and M
- IVFFlat: Inverted File Index
- HNSW vs IVFFlat Trade-Offs
- Quantization Impact on Recall
- Benchmarking With ANN-Benchmarks
- Checklist
- Conclusion
Exact Nearest Neighbor: Brute Force, O(n)
Before approximation, understand exact search.
function exactNearestNeighbor(
query: number[],
vectors: number[][]
): { index: number; distance: number } {
let bestIndex = 0;
let bestDistance = Infinity;
// Compare query against every vector
for (let i = 0; i < vectors.length; i++) {
const distance = euclideanDistance(query, vectors[i]);
if (distance < bestDistance) {
bestDistance = distance;
bestIndex = i;
}
}
return { index: bestIndex, distance: bestDistance };
}
function euclideanDistance(a: number[], b: number[]): number {
let sum = 0;
for (let i = 0; i < a.length; i++) {
sum += Math.pow(a[i] - b[i], 2);
}
return Math.sqrt(sum);
}
Brute force is O(n) time, O(n × d) space (n = vectors, d = dimensions). For 1M vectors, every query touches 1M comparisons. Unacceptable for production.
Approximate Nearest Neighbor (ANN) Concept
ANN trades accuracy for speed by accepting approximate matches. Instead of finding the true nearest neighbor, find a "nearby" neighbor quickly.
The ANN frontier:
- Recall: % of true top-10 nearest neighbors found
- Latency: Query time
- Index size: Memory overhead
All three scale inversely. Build a better index = more memory + slower indexing. Query slower = higher recall. This is the fundamental ANN trade-off.
HNSW: Hierarchical Navigable Small World
HNSW is a graph-based index. Imagine a subway system: some trains run express (highways), others make all stops (local routes). HNSW navigates from highway to local based on proximity.
interface HSNWNode {
id: number;
vector: number[];
neighbors: Set<number>[][]; // neighbors per layer
}
class HNSWIndex {
nodes: HSNWNode[] = [];
M: number; // max connections per node
efConstruction: number; // beam width during construction
insert(vector: number[]): void {
const newNode: HSNWNode = {
id: this.nodes.length,
vector,
neighbors: [],
};
if (this.nodes.length === 0) {
this.nodes.push(newNode);
return;
}
// Start from an entry point at the highest layer
let nearest = this.nodes[0];
// Navigate from top layer down, finding nearest neighbor at each layer
for (let layer = this.nodes[0].neighbors.length - 1; layer >= 0; layer--) {
nearest = this.searchLayer(vector, [nearest], layer, 1)[0];
}
// At layer 0, find M neighbors using beam search (ef_construction)
const candidates = this.searchLayer(
vector,
[nearest],
0,
this.efConstruction
);
// Connect new node to M nearest candidates
const m = Math.min(this.M, candidates.length);
for (let i = 0; i < m; i++) {
const neighbor = candidates[i];
newNode.neighbors[0].add(neighbor.id);
neighbor.neighbors[0].add(newNode.id);
}
this.nodes.push(newNode);
}
search(query: number[], ef: number = 40, k: number = 10): number[] {
// ef = search beam width; higher = better recall, slower
let nearest = this.nodes[0];
// Navigate top-down
for (let layer = this.nodes[0].neighbors.length - 1; layer >= 0; layer--) {
nearest = this.searchLayer(query, [nearest], layer, 1)[0];
}
// Search at layer 0
const candidates = this.searchLayer(query, [nearest], 0, ef);
return candidates.slice(0, k).map(c => c.id);
}
private searchLayer(
query: number[],
entryPoints: HSNWNode[],
layer: number,
ef: number
): HSNWNode[] {
const visited = new Set<number>();
const candidates = new PriorityQueue<HSNWNode>(); // min-heap by distance
const results = new PriorityQueue<HSNWNode>(); // max-heap by distance
for (const entry of entryPoints) {
const dist = euclideanDistance(query, entry.vector);
candidates.push(entry, dist);
results.push(entry, -dist);
visited.add(entry.id);
}
while (candidates.size() > 0) {
const [lowerBound] = candidates.peek();
if (lowerBound > -results.peek()[0]) break;
const current = candidates.pop()[0];
// Explore neighbors
for (const neighborId of current.neighbors[layer]) {
if (visited.has(neighborId)) continue;
visited.add(neighborId);
const neighbor = this.nodes[neighborId];
const dist = euclideanDistance(query, neighbor.vector);
if (dist < -results.peek()[0] || results.size() < ef) {
candidates.push(neighbor, dist);
results.push(neighbor, -dist);
if (results.size() > ef) results.pop();
}
}
}
return results.toArray().map(item => item[0]);
}
}
HNSW properties:
- Build time: O(n log n) due to multi-layer hierarchy
- Query time: O(log n) logarithmic, competitive with B-trees
- Recall: Can approach 99% with good parameters
- Memory: ~10-20x the vector size
HNSW Parameters: ef_construction and M
Two critical knobs:
- M (default 16): Max connections per node. Higher M = more memory, better recall
- ef_construction (default 200): Beam width during index building. Higher = better recall, slower build
// Tuning HNSW for different scenarios
// Low latency, high throughput (e-commerce search)
const lowLatency = { M: 8, efConstruction: 100 };
// High recall, acceptable latency (RAG retrieval)
const highRecall = { M: 24, efConstruction: 400 };
// Massive scale, limited memory (1B+ vectors)
const massiveScale = { M: 4, efConstruction: 50 };
// At query time, tune ef (search beam width) separately
async function search(index: HNSWIndex, query: number[]): Promise<number[]> {
// ef = beam width for search. Higher = slower, better recall
// Typical: ef = 40-100
return index.search(query, ef = 50, k = 10);
}
Typical impact:
- M=8: recall 90%, 8ms query time
- M=16: recall 95%, 12ms query time
- M=32: recall 98%, 25ms query time
IVFFlat: Inverted File Index
IVFFlat (Inverted File with Flat quantization) uses a different strategy: cluster first, then search within clusters.
class IVFFlatIndex {
nlist: number; // number of clusters (inverted lists)
nprobe: number; // how many clusters to search
centroids: number[][] = [];
lists: number[][][] = []; // lists[clusterId] = vectors in that cluster
async build(vectors: number[][]): Promise<void> {
// Cluster vectors into nlist clusters using k-means
const kmeans = new KMeans(this.nlist);
const assignments = await kmeans.fit(vectors);
// Build inverted lists
this.centroids = kmeans.centroids;
this.lists = Array(this.nlist)
.fill(null)
.map(() => []);
for (let i = 0; i < vectors.length; i++) {
const clusterId = assignments[i];
this.lists[clusterId].push(vectors[i]);
}
}
search(query: number[], k: number = 10): number[] {
// Find nprobe nearest clusters to query
const clusterDistances = this.centroids.map((centroid, clusterId) => ({
clusterId,
distance: euclideanDistance(query, centroid),
}));
clusterDistances.sort((a, b) => a.distance - b.distance);
const nearestClusters = clusterDistances
.slice(0, this.nprobe)
.map(c => c.clusterId);
// Search within chosen clusters
const candidates: Array<{ index: number; distance: number }> = [];
for (const clusterId of nearestClusters) {
for (const vector of this.lists[clusterId]) {
const distance = euclideanDistance(query, vector);
candidates.push({ index: vector.id, distance });
}
}
// Return top-k
candidates.sort((a, b) => a.distance - b.distance);
return candidates.slice(0, k).map(c => c.index);
}
}
HNSW vs IVFFlat Trade-Offs
| Aspect | HNSW | IVFFlat |
|---|---|---|
| Build time | O(n log n), slow | O(n), fast |
| Query time | O(log n), fast | O(M/nprobe × n), depends on parameters |
| Recall | 95-99% (tunable) | 85-95% (tunable) |
| Memory | 10-20x vector size | Minimal (clusters only) |
| Best for | High-recall applications | Large-scale, cost-constrained |
HNSW wins: RAG systems where recall > cost, indexes <100M vectors
IVFFlat wins: Massive scale (1B+ vectors), memory-constrained, acceptable recall loss
// Typical production choice: HNSW for RAG
// (better recall, manageable memory)
async function chooseIndex(vectorCount: number): Promise<VectorIndex> {
if (vectorCount < 10_000_000) {
// <10M: use HNSW for best recall
return new HNSWIndex({ M: 16, efConstruction: 200 });
} else {
// >10M: use IVFFlat for memory efficiency
return new IVFFlatIndex({ nlist: Math.sqrt(vectorCount), nprobe: 100 });
}
}
Quantization Impact on Recall
Quantization reduces vector precision (e.g., 32-bit float → 8-bit int) to save memory.
function quantize8bit(vector: number[]): number[] {
// Normalize to [0, 255]
const min = Math.min(...vector);
const max = Math.max(...vector);
const range = max - min;
return vector.map(v => Math.round(((v - min) / range) * 255));
}
function dequantize(quantized: number[], min: number, max: number): number[] {
const range = max - min;
return quantized.map(v => (v / 255) * range + min);
}
Recall impact:
- 32-bit float: 100% (baseline)
- 8-bit quantization: 98-99% (0.5-2% loss)
- 1-bit (binary): 85-90% (10-15% loss)
Memory savings:
- 32-bit: 1.0x
- 8-bit: 0.25x (75% savings)
- 1-bit: 0.03x (97% savings)
For RAG, 8-bit quantization is usually the sweet spot: minimal recall loss, huge memory savings.
Benchmarking With ANN-Benchmarks
Use standardized benchmarks (ANN-Benchmarks) to compare indexes.
interface BenchmarkResult {
indexType: string; // HNSW, IVFFlat, etc.
parameters: Record<string, number>;
queryLatencyMs: number;
buildTimeMs: number;
memoryMB: number;
recall: number; // 0-1
}
async function benchmarkIndexes(
vectors: number[][],
queries: number[][],
goldNeighbors: number[][]
): Promise<BenchmarkResult[]> {
const results: BenchmarkResult[] = [];
// Benchmark HNSW
for (const M of [8, 16, 32]) {
for (const ef of [100, 200, 400]) {
const index = new HNSWIndex({ M, efConstruction: ef });
const buildStart = Date.now();
index.build(vectors);
const buildTime = Date.now() - buildStart;
let totalLatency = 0;
let recalls = [];
for (let i = 0; i < queries.length; i++) {
const queryStart = Date.now();
const results = index.search(queries[i]);
const queryLatency = Date.now() - queryStart;
totalLatency += queryLatency;
const recall = computeRecall(results, goldNeighbors[i]);
recalls.push(recall);
}
results.push({
indexType: 'HNSW',
parameters: { M, efConstruction: ef },
queryLatencyMs: totalLatency / queries.length,
buildTimeMs: buildTime,
memoryMB: estimateMemory(vectors.length, M),
recall: recalls.reduce((a, b) => a + b) / recalls.length,
});
}
}
return results;
}
Use benchmarking results to choose parameters for your workload.
Checklist
- Understand exact NN (brute force) as baseline
- Use HNSW for high-recall applications (<100M vectors)
- Use IVFFlat for massive scale (>1B vectors), memory-constrained
- Tune HNSW: M for memory/quality, ef for search speed/quality
- Tune IVFFlat: nlist for cluster granularity, nprobe for search scope
- Consider 8-bit quantization for 75% memory savings with <2% recall loss
- Benchmark on your corpus with realistic query patterns
- Monitor recall: aim for >90% for RAG applications
Conclusion
Vector search is a solved problem if you understand the trade-offs. HNSW dominates for typical RAG systems: fast (milliseconds), accurate (95-99% recall), and manageable memory. IVFFlat takes over at billion-scale. Quantization can cut memory in half with negligible accuracy loss. Benchmark on your data—parameters vary wildly across domains.