Published on

Semantic Caching for LLMs — Reducing API Costs With Similarity-Based Cache Hits

Authors

Introduction

Traditional caches look for exact string matches. Semantic caches find similar queries and reuse responses. "What is the capital of France?" and "Which city is the capital of France?" are semantically identical—why call the LLM twice? Semantic caching can reduce API costs by 40-60% on typical production workloads, while staying transparent to users.

Semantic Cache Architecture

A semantic cache stores query-response pairs with embeddings, then retrieves similar cached responses on new queries.

interface CachedEntry {
  id: string;
  query: string;
  queryEmbedding: number[];
  response: string;
  metadata: {
    timestamp: number;
    ttl: number; // milliseconds
    tokenCost: number;
  };
}

interface SemanticCache {
  store: Map<string, CachedEntry>; // in-memory or Redis
  vectorIndex: VectorStore; // FAISS or similar
}

async function semanticCacheLookup(
  query: string,
  cache: SemanticCache,
  similarityThreshold: number = 0.95
): Promise<CachedEntry | null> {
  const queryEmbedding = await embed(query);

  // Find cached entries with cosine similarity > threshold
  const candidates = await cache.vectorIndex.search(
    queryEmbedding,
    {
      topK: 5,
      filter: entry => Date.now() - entry.metadata.timestamp < entry.metadata.ttl,
    }
  );

  const bestMatch = candidates[0];
  if (
    bestMatch &&
    cosineSimilarity(queryEmbedding, bestMatch.queryEmbedding) > similarityThreshold
  ) {
    return bestMatch;
  }

  return null;
}

async function cachedLLMGenerate(
  query: string,
  cache: SemanticCache,
  options: { threshold?: number; ttl?: number } = {}
): Promise<{ response: string; cached: boolean }> {
  // Try semantic cache first
  const cached = await semanticCacheLookup(
    query,
    cache,
    options.threshold ?? 0.95
  );

  if (cached) {
    console.log(`Cache hit: ${query}`);
    return { response: cached.response, cached: true };
  }

  // Cache miss: call LLM
  console.log(`Cache miss: ${query}`);
  const response = await llm.generate({ prompt: query });

  // Store in cache
  const queryEmbedding = await embed(query);
  const entry: CachedEntry = {
    id: generateId(),
    query,
    queryEmbedding,
    response,
    metadata: {
      timestamp: Date.now(),
      ttl: options.ttl ?? 24 * 60 * 60 * 1000, // 24h default
      tokenCost: estimateTokens(response),
    },
  };

  await cache.store.set(entry.id, entry);
  await cache.vectorIndex.add([entry]);

  return { response, cached: false };
}

This approach is transparent: callers don't know if a response came from cache or the LLM.

Similarity Threshold Tuning

Threshold is critical. Too low: you serve irrelevant cached responses (hallucinations). Too high: you miss valid cache hits (wasted API calls).

interface ThresholdAnalysis {
  threshold: number;
  precision: number; // % of cached responses marked "correct"
  recall: number; // % of cacheable queries that hit cache
  costSavings: number; // % of API calls avoided
}

async function evaluateThreshold(
  testQueries: Array<{ query: string; goldAnswer: string }>,
  cache: SemanticCache,
  thresholds: number[]
): Promise<ThresholdAnalysis[]> {
  const results: ThresholdAnalysis[] = [];

  for (const threshold of thresholds) {
    let hits = 0;
    let precisionHits = 0;
    let cacheableQueries = 0;

    for (const { query, goldAnswer } of testQueries) {
      const cached = await semanticCacheLookup(query, cache, threshold);

      if (cached) {
        hits++;
        // Grade: is cached response acceptable?
        const isCorrèct = await llm.evaluateSimilarity(cached.response, goldAnswer);
        if (isCorrect) precisionHits++;
      }

      cacheableQueries++; // assume all are cacheable for simplicity
    }

    results.push({
      threshold,
      precision: hits > 0 ? precisionHits / hits : 0,
      recall: hits / cacheableQueries,
      costSavings: hits / testQueries.length,
    });
  }

  return results;
}

Typical thresholds:

  • 0.90: Aggressive, high recall, low precision. Risk: serve ~10% wrong answers
  • 0.95: Balanced. ~95% cached responses are correct
  • 0.99: Conservative. ~99% correct, but miss many cache opportunities

Start at 0.95 and adjust based on your tolerance for false positives.

Redis + Vector Store for Cache

For distributed systems, use Redis for fast TTL-managed storage + a vector database for similarity search.

import Redis from 'ioredis';

class DistributedSemanticCache {
  redis: Redis;
  vectorDb: VectorStore;

  async get(query: string, threshold: number = 0.95): Promise<string | null> {
    const queryEmbedding = await embed(query);

    // Search vector DB for similar queries
    const candidates = await this.vectorDb.search(queryEmbedding, { topK: 5 });

    for (const candidate of candidates) {
      if (cosineSimilarity(queryEmbedding, candidate.embedding) > threshold) {
        // Fetch from Redis
        const cached = await this.redis.get(`cache:${candidate.id}`);
        if (cached) {
          return JSON.parse(cached).response;
        }
      }
    }

    return null;
  }

  async set(
    query: string,
    response: string,
    ttlSeconds: number = 86_400
  ): Promise<void> {
    const embedding = await embed(query);
    const id = generateId();

    // Store in Redis with TTL
    await this.redis.setex(
      `cache:${id}`,
      ttlSeconds,
      JSON.stringify({ query, response })
    );

    // Index in vector DB for similarity search
    // (Vector DB TTL handled separately, or periodically evict stale entries)
    await this.vectorDb.add([{ id, embedding, metadata: { query, createdAt: Date.now() } }]);
  }
}

This hybrid approach:

  • Redis handles O(1) exact lookups and TTL management
  • Vector DB handles O(log N) similarity search
  • Both are fast and scalable

TTL Per Query Type

Different queries have different freshness requirements.

type QueryType = 'fact' | 'code' | 'advice' | 'math';

function getTTL(queryType: QueryType): number {
  const ttlMap: Record<QueryType, number> = {
    fact: 7 * 24 * 60 * 60 * 1000, // 7 days (stable facts)
    code: 24 * 60 * 60 * 1000, // 1 day (code changes frequently)
    advice: 60 * 60 * 1000, // 1 hour (contextual, time-sensitive)
    math: Infinity, // never expire (deterministic)
  };

  return ttlMap[queryType];
}

async function cachedGenerateWithTTL(
  query: string,
  queryType: QueryType,
  cache: SemanticCache
): Promise<string> {
  const ttl = getTTL(queryType);
  const { response, cached } = await cachedLLMGenerate(query, cache, { ttl });

  return response;
}

Facts and math are stable; cache aggressively. Code and advice change quickly; cache minimally.

Cache Warming for Common Queries

Pre-populate the cache with answers to frequently-asked questions.

async function warmCache(
  commonQueries: Array<{ query: string; category: QueryType }>,
  cache: SemanticCache
): Promise<void> {
  console.log(`Warming cache with ${commonQueries.length} common queries...`);

  for (const { query, category } of commonQueries) {
    const ttl = getTTL(category);
    await cachedLLMGenerate(query, cache, { ttl });
  }

  console.log('Cache warmed.');
}

// At startup:
await warmCache([
  { query: 'What is your pricing?', category: 'fact' },
  { query: 'How do I reset my password?', category: 'advice' },
  { query: 'Write a Hello World program in Python', category: 'code' },
], cache);

This ensures instant hits on the most common user questions.

Cache Invalidation on Data Update

When underlying data changes (updated docs, new policies), invalidate related cache entries.

interface InvalidationRule {
  pattern: string; // regex or keyword
  ttlReset: number; // new TTL (0 = delete immediately)
}

async function invalidateCache(
  rules: InvalidationRule[],
  cache: SemanticCache
): Promise<number> {
  let deleted = 0;

  for (const entry of cache.store.values()) {
    for (const rule of rules) {
      if (new RegExp(rule.pattern).test(entry.query)) {
        if (rule.ttlReset === 0) {
          cache.store.delete(entry.id);
          await cache.vectorIndex.delete(entry.id);
          deleted++;
        } else {
          // Reduce TTL
          entry.metadata.ttl = rule.ttlReset;
        }
      }
    }
  }

  return deleted;
}

// Usage: when product pricing changes
await invalidateCache([
  { pattern: 'pricing|cost|price', ttlReset: 0 }, // delete immediately
  { pattern: 'refund|return', ttlReset: 1 * 60 * 60 * 1000 }, // 1h TTL
], cache);

This prevents stale answers after data updates.

Measuring Cache Hit Rate and Cost Savings

Instrument your cache to measure impact.

interface CacheMetrics {
  totalQueries: number;
  cacheHits: number;
  cacheMisses: number;
  hitRate: number; // hits / total
  tokensFromCache: number;
  tokensSaved: number;
  costSaved: number; // $$
}

class InstrumentedSemanticCache {
  metrics: CacheMetrics = {
    totalQueries: 0,
    cacheHits: 0,
    cacheMisses: 0,
    hitRate: 0,
    tokensFromCache: 0,
    tokensSaved: 0,
    costSaved: 0,
  };

  async generate(query: string, cache: SemanticCache): Promise<string> {
    this.metrics.totalQueries++;

    const { response, cached } = await cachedLLMGenerate(query, cache);

    if (cached) {
      this.metrics.cacheHits++;
      const tokens = estimateTokens(response);
      this.metrics.tokensFromCache += tokens;
    } else {
      this.metrics.cacheMisses++;
      const tokens = estimateTokens(response);
      this.metrics.tokensSaved += tokens;
    }

    this.metrics.hitRate = this.metrics.cacheHits / this.metrics.totalQueries;
    this.metrics.costSaved =
      this.metrics.tokensSaved * (this.costPerToken ?? 0.00002); // e.g., GPT-4

    return response;
  }

  report(): void {
    console.log(`Cache Hit Rate: ${(this.metrics.hitRate * 100).toFixed(1)}%`);
    console.log(`Tokens Saved: ${this.metrics.tokensSaved.toLocaleString()}`);
    console.log(`Cost Saved: $${this.metrics.costSaved.toFixed(2)}`);
  }
}

Monitor these metrics weekly. Typical production systems see 40-60% hit rates with proper tuning.

Checklist

  • Implement semantic cache with Redis + vector DB
  • Set similarity threshold to 0.95 initially; tune based on precision/recall
  • Assign TTLs per query type (facts: long, advice: short)
  • Pre-warm cache with common queries at startup
  • Implement cache invalidation rules for data updates
  • Instrument cache to measure hit rate, tokens saved, and cost reduction
  • Monitor for cache poisoning (incorrect cached answers)

Conclusion

Semantic caching is a high-impact, low-friction optimization. By embedding queries and storing responses, you avoid 40-60% of LLM calls. Start conservative with threshold 0.95, tune based on your error tolerance, and measure cost savings weekly. For SaaS platforms with millions of queries, semantic caching often pays for its infrastructure within days.