- Published on
Semantic Caching for LLMs — Reducing API Costs With Similarity-Based Cache Hits
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Traditional caches look for exact string matches. Semantic caches find similar queries and reuse responses. "What is the capital of France?" and "Which city is the capital of France?" are semantically identical—why call the LLM twice? Semantic caching can reduce API costs by 40-60% on typical production workloads, while staying transparent to users.
- Semantic Cache Architecture
- Similarity Threshold Tuning
- Redis + Vector Store for Cache
- TTL Per Query Type
- Cache Warming for Common Queries
- Cache Invalidation on Data Update
- Measuring Cache Hit Rate and Cost Savings
- Checklist
- Conclusion
Semantic Cache Architecture
A semantic cache stores query-response pairs with embeddings, then retrieves similar cached responses on new queries.
interface CachedEntry {
id: string;
query: string;
queryEmbedding: number[];
response: string;
metadata: {
timestamp: number;
ttl: number; // milliseconds
tokenCost: number;
};
}
interface SemanticCache {
store: Map<string, CachedEntry>; // in-memory or Redis
vectorIndex: VectorStore; // FAISS or similar
}
async function semanticCacheLookup(
query: string,
cache: SemanticCache,
similarityThreshold: number = 0.95
): Promise<CachedEntry | null> {
const queryEmbedding = await embed(query);
// Find cached entries with cosine similarity > threshold
const candidates = await cache.vectorIndex.search(
queryEmbedding,
{
topK: 5,
filter: entry => Date.now() - entry.metadata.timestamp < entry.metadata.ttl,
}
);
const bestMatch = candidates[0];
if (
bestMatch &&
cosineSimilarity(queryEmbedding, bestMatch.queryEmbedding) > similarityThreshold
) {
return bestMatch;
}
return null;
}
async function cachedLLMGenerate(
query: string,
cache: SemanticCache,
options: { threshold?: number; ttl?: number } = {}
): Promise<{ response: string; cached: boolean }> {
// Try semantic cache first
const cached = await semanticCacheLookup(
query,
cache,
options.threshold ?? 0.95
);
if (cached) {
console.log(`Cache hit: ${query}`);
return { response: cached.response, cached: true };
}
// Cache miss: call LLM
console.log(`Cache miss: ${query}`);
const response = await llm.generate({ prompt: query });
// Store in cache
const queryEmbedding = await embed(query);
const entry: CachedEntry = {
id: generateId(),
query,
queryEmbedding,
response,
metadata: {
timestamp: Date.now(),
ttl: options.ttl ?? 24 * 60 * 60 * 1000, // 24h default
tokenCost: estimateTokens(response),
},
};
await cache.store.set(entry.id, entry);
await cache.vectorIndex.add([entry]);
return { response, cached: false };
}
This approach is transparent: callers don't know if a response came from cache or the LLM.
Similarity Threshold Tuning
Threshold is critical. Too low: you serve irrelevant cached responses (hallucinations). Too high: you miss valid cache hits (wasted API calls).
interface ThresholdAnalysis {
threshold: number;
precision: number; // % of cached responses marked "correct"
recall: number; // % of cacheable queries that hit cache
costSavings: number; // % of API calls avoided
}
async function evaluateThreshold(
testQueries: Array<{ query: string; goldAnswer: string }>,
cache: SemanticCache,
thresholds: number[]
): Promise<ThresholdAnalysis[]> {
const results: ThresholdAnalysis[] = [];
for (const threshold of thresholds) {
let hits = 0;
let precisionHits = 0;
let cacheableQueries = 0;
for (const { query, goldAnswer } of testQueries) {
const cached = await semanticCacheLookup(query, cache, threshold);
if (cached) {
hits++;
// Grade: is cached response acceptable?
const isCorrèct = await llm.evaluateSimilarity(cached.response, goldAnswer);
if (isCorrect) precisionHits++;
}
cacheableQueries++; // assume all are cacheable for simplicity
}
results.push({
threshold,
precision: hits > 0 ? precisionHits / hits : 0,
recall: hits / cacheableQueries,
costSavings: hits / testQueries.length,
});
}
return results;
}
Typical thresholds:
- 0.90: Aggressive, high recall, low precision. Risk: serve ~10% wrong answers
- 0.95: Balanced. ~95% cached responses are correct
- 0.99: Conservative. ~99% correct, but miss many cache opportunities
Start at 0.95 and adjust based on your tolerance for false positives.
Redis + Vector Store for Cache
For distributed systems, use Redis for fast TTL-managed storage + a vector database for similarity search.
import Redis from 'ioredis';
class DistributedSemanticCache {
redis: Redis;
vectorDb: VectorStore;
async get(query: string, threshold: number = 0.95): Promise<string | null> {
const queryEmbedding = await embed(query);
// Search vector DB for similar queries
const candidates = await this.vectorDb.search(queryEmbedding, { topK: 5 });
for (const candidate of candidates) {
if (cosineSimilarity(queryEmbedding, candidate.embedding) > threshold) {
// Fetch from Redis
const cached = await this.redis.get(`cache:${candidate.id}`);
if (cached) {
return JSON.parse(cached).response;
}
}
}
return null;
}
async set(
query: string,
response: string,
ttlSeconds: number = 86_400
): Promise<void> {
const embedding = await embed(query);
const id = generateId();
// Store in Redis with TTL
await this.redis.setex(
`cache:${id}`,
ttlSeconds,
JSON.stringify({ query, response })
);
// Index in vector DB for similarity search
// (Vector DB TTL handled separately, or periodically evict stale entries)
await this.vectorDb.add([{ id, embedding, metadata: { query, createdAt: Date.now() } }]);
}
}
This hybrid approach:
- Redis handles O(1) exact lookups and TTL management
- Vector DB handles O(log N) similarity search
- Both are fast and scalable
TTL Per Query Type
Different queries have different freshness requirements.
type QueryType = 'fact' | 'code' | 'advice' | 'math';
function getTTL(queryType: QueryType): number {
const ttlMap: Record<QueryType, number> = {
fact: 7 * 24 * 60 * 60 * 1000, // 7 days (stable facts)
code: 24 * 60 * 60 * 1000, // 1 day (code changes frequently)
advice: 60 * 60 * 1000, // 1 hour (contextual, time-sensitive)
math: Infinity, // never expire (deterministic)
};
return ttlMap[queryType];
}
async function cachedGenerateWithTTL(
query: string,
queryType: QueryType,
cache: SemanticCache
): Promise<string> {
const ttl = getTTL(queryType);
const { response, cached } = await cachedLLMGenerate(query, cache, { ttl });
return response;
}
Facts and math are stable; cache aggressively. Code and advice change quickly; cache minimally.
Cache Warming for Common Queries
Pre-populate the cache with answers to frequently-asked questions.
async function warmCache(
commonQueries: Array<{ query: string; category: QueryType }>,
cache: SemanticCache
): Promise<void> {
console.log(`Warming cache with ${commonQueries.length} common queries...`);
for (const { query, category } of commonQueries) {
const ttl = getTTL(category);
await cachedLLMGenerate(query, cache, { ttl });
}
console.log('Cache warmed.');
}
// At startup:
await warmCache([
{ query: 'What is your pricing?', category: 'fact' },
{ query: 'How do I reset my password?', category: 'advice' },
{ query: 'Write a Hello World program in Python', category: 'code' },
], cache);
This ensures instant hits on the most common user questions.
Cache Invalidation on Data Update
When underlying data changes (updated docs, new policies), invalidate related cache entries.
interface InvalidationRule {
pattern: string; // regex or keyword
ttlReset: number; // new TTL (0 = delete immediately)
}
async function invalidateCache(
rules: InvalidationRule[],
cache: SemanticCache
): Promise<number> {
let deleted = 0;
for (const entry of cache.store.values()) {
for (const rule of rules) {
if (new RegExp(rule.pattern).test(entry.query)) {
if (rule.ttlReset === 0) {
cache.store.delete(entry.id);
await cache.vectorIndex.delete(entry.id);
deleted++;
} else {
// Reduce TTL
entry.metadata.ttl = rule.ttlReset;
}
}
}
}
return deleted;
}
// Usage: when product pricing changes
await invalidateCache([
{ pattern: 'pricing|cost|price', ttlReset: 0 }, // delete immediately
{ pattern: 'refund|return', ttlReset: 1 * 60 * 60 * 1000 }, // 1h TTL
], cache);
This prevents stale answers after data updates.
Measuring Cache Hit Rate and Cost Savings
Instrument your cache to measure impact.
interface CacheMetrics {
totalQueries: number;
cacheHits: number;
cacheMisses: number;
hitRate: number; // hits / total
tokensFromCache: number;
tokensSaved: number;
costSaved: number; // $$
}
class InstrumentedSemanticCache {
metrics: CacheMetrics = {
totalQueries: 0,
cacheHits: 0,
cacheMisses: 0,
hitRate: 0,
tokensFromCache: 0,
tokensSaved: 0,
costSaved: 0,
};
async generate(query: string, cache: SemanticCache): Promise<string> {
this.metrics.totalQueries++;
const { response, cached } = await cachedLLMGenerate(query, cache);
if (cached) {
this.metrics.cacheHits++;
const tokens = estimateTokens(response);
this.metrics.tokensFromCache += tokens;
} else {
this.metrics.cacheMisses++;
const tokens = estimateTokens(response);
this.metrics.tokensSaved += tokens;
}
this.metrics.hitRate = this.metrics.cacheHits / this.metrics.totalQueries;
this.metrics.costSaved =
this.metrics.tokensSaved * (this.costPerToken ?? 0.00002); // e.g., GPT-4
return response;
}
report(): void {
console.log(`Cache Hit Rate: ${(this.metrics.hitRate * 100).toFixed(1)}%`);
console.log(`Tokens Saved: ${this.metrics.tokensSaved.toLocaleString()}`);
console.log(`Cost Saved: $${this.metrics.costSaved.toFixed(2)}`);
}
}
Monitor these metrics weekly. Typical production systems see 40-60% hit rates with proper tuning.
Checklist
- Implement semantic cache with Redis + vector DB
- Set similarity threshold to 0.95 initially; tune based on precision/recall
- Assign TTLs per query type (facts: long, advice: short)
- Pre-warm cache with common queries at startup
- Implement cache invalidation rules for data updates
- Instrument cache to measure hit rate, tokens saved, and cost reduction
- Monitor for cache poisoning (incorrect cached answers)
Conclusion
Semantic caching is a high-impact, low-friction optimization. By embedding queries and storing responses, you avoid 40-60% of LLM calls. Start conservative with threshold 0.95, tune based on your error tolerance, and measure cost savings weekly. For SaaS platforms with millions of queries, semantic caching often pays for its infrastructure within days.