- Published on
System Design for AI-Powered Products — Architecture Decisions That Scale
- Authors
- Name
Introduction
Building AI-powered products at scale requires rethinking traditional system design principles. Unlike deterministic services, AI systems face unique challenges: non-deterministic outputs, variable latency, unpredictable costs, and dependencies on third-party APIs with rate limits. This post covers architectural patterns that handle these realities.
- The Core Challenges of AI Architecture
- Async-First Design for LLM Calls
- Streaming Architecture for Real-Time AI Responses
- LLM Response Caching Strategy
- Fallback Chain Design
- Cost Metering Per User and Tenant
- Rate Limiting by Tokens, Not Requests
- AI Feature Flags for Gradual Rollout
- Observability Stack for AI Systems
- Multi-Tenant AI Data Isolation
- Checklist
- Conclusion
The Core Challenges of AI Architecture
AI systems differ fundamentally from traditional backends. A database query returns the same result every time; an LLM call doesn't. An API endpoint completes in 100ms; an LLM call takes 2-10 seconds. Costs scale with usage in unpredictable ways—one user's prompt might tokenize to 50 tokens while another's is 50,000.
These differences force architectural decisions early:
Non-determinism: You can't rely on response caching as aggressively as traditional backends. A request for "summarize my data" will produce different outputs each time, making deterministic caching risky.
Latency variability: A user request hitting an LLM directly ties up your entire request cycle. If the LLM takes 8 seconds, the user waits 8 seconds. If rate limits hit, everyone waits.
Cost opacity: Without careful metering, a single runaway query can cost hundreds of dollars. You need cost visibility at request granularity.
Async-First Design for LLM Calls
Never block user requests on LLM responses. Use async patterns everywhere.
// Bad: user waits for LLM
app.post('/api/summarize', async (req, res) => {
const summary = await openai.createChatCompletion({
model: 'gpt-4o',
messages: [{ role: 'user', content: req.body.text }]
});
res.json(summary);
});
// Good: async job, user gets immediate response
app.post('/api/summarize', async (req, res) => {
const jobId = uuidv4();
await queue.enqueue({
type: 'summarize',
jobId,
userId: req.user.id,
text: req.body.text
});
res.json({ jobId });
});
// Worker processes LLM call
worker.on('summarize', async (job) => {
const summary = await openai.createChatCompletion({...});
await db.summaries.update(job.jobId, { result: summary });
});
Queue-based architectures decouple LLM latency from user experience. Users get immediate feedback ("Your summary is processing") while expensive operations run in the background.
Streaming Architecture for Real-Time AI Responses
When users expect streaming responses (like ChatGPT), use server-sent events or WebSockets to push tokens as they arrive.
app.get('/api/stream', async (req, res) => {
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
const stream = await openai.createChatCompletionStream({
model: 'gpt-4o',
messages: [{ role: 'user', content: req.query.prompt }],
stream: true
});
for await (const event of stream) {
const delta = event.choices[0].delta.content || '';
res.write(`data: ${JSON.stringify({ delta })}\n\n`);
}
res.end();
});
Streaming keeps users engaged by showing progress. Each token arriving is a signal that the system is thinking.
LLM Response Caching Strategy
Cache carefully. Two strategies work:
Exact match caching: Cache identical prompts with identical parameters. Useful for deterministic workflows.
const cacheKey = hash(JSON.stringify({
model: 'gpt-4o',
prompt: req.body.prompt,
temperature: 0.7
}));
const cached = await redis.get(cacheKey);
if (cached) return JSON.parse(cached);
const response = await openai.createChatCompletion({...});
await redis.setex(cacheKey, 3600, JSON.stringify(response));
Semantic caching: Cache similar prompts. If "What's the weather in NYC?" was asked before, reuse it for "Weather in New York City?"—requires embedding similarity.
const promptEmbedding = await embed(req.body.prompt);
const similar = await vectordb.search(promptEmbedding, { threshold: 0.95 });
if (similar.length > 0) {
return similar[0].cachedResponse;
}
Cache only when the cost of generating exceeds the cost of storage and lookup.
Fallback Chain Design
Design multilayered fallback chains:
async function getAIResponse(prompt, userId) {
try {
// Tier 1: Premium user? Use GPT-4o
if (user.tier === 'premium') {
return await openai.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: prompt }]
});
}
} catch (error) {
// Fallback: cheaper model
}
try {
// Tier 2: Standard model
return await openai.create({
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: prompt }]
});
} catch (error) {
// Fallback: cache
}
// Tier 3: Return previously cached response for similar prompt
const similar = await cache.findSimilar(prompt);
if (similar) return similar.response;
throw new Error('All fallbacks exhausted');
}
Fallback chains ensure availability when primary systems fail or rate limits hit.
Cost Metering Per User and Tenant
Track costs at request granularity:
async function recordCost(userId, tenantId, operation, tokens) {
const costUsd = (tokens / 1000) * 0.01; // adjust pricing
await db.costs.insert({
userId,
tenantId,
operation,
tokens,
costUsd,
timestamp: new Date()
});
// Alert if user exceeds daily limit
const dailyCost = await db.costs.sumByDate(userId, today());
if (dailyCost > limits[user.tier]) {
await sendAlert(userId, `Cost limit exceeded: $${dailyCost}`);
await disableAIFeatures(userId);
}
}
Without cost metering, a single user can bankrupt your service.
Rate Limiting by Tokens, Not Requests
Traditional rate limiting (X requests per minute) fails for AI. One request with 100K tokens is far costlier than one request with 100 tokens.
const tokenBucket = new TokenBucket(
capacity: 100000, // tokens
refillRate: 10000, // tokens per minute
);
app.use(async (req, res, next) => {
const estimatedTokens = estimateTokens(req.body.prompt);
if (!tokenBucket.tryConsume(estimatedTokens)) {
return res.status(429).json({
error: 'Rate limit exceeded',
retryAfter: tokenBucket.timeUntilRefill()
});
}
next();
});
Token-based limits align costs with rate limiting.
AI Feature Flags for Gradual Rollout
Feature flags let you roll out new models safely:
async function getResponse(prompt, userId) {
const flags = await featureFlags.getForUser(userId);
if (flags['use-new-gpt4o-model'] === true) {
return await openai.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: prompt }]
});
}
return await openai.create({
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: prompt }]
});
}
Roll out incrementally: enable for 5% of users first, then 25%, then 100%.
Observability Stack for AI Systems
Monitor at multiple levels:
// Trace-level: full request context
const span = tracer.startSpan('llm_request', {
attributes: {
model: 'gpt-4o',
prompt_tokens: 150,
max_tokens: 500,
temperature: 0.7,
userId,
tenantId
}
});
// Metric-level: aggregate patterns
metrics.histogram('llm.latency_ms', latencyMs, {
model: 'gpt-4o',
user_tier: 'premium'
});
metrics.counter('llm.tokens_used', tokenCount, {
model: 'gpt-4o',
operation: 'summarize'
});
// Log-level: events
logger.info('LLM request completed', {
llmLatency: '2500ms',
totalLatency: '2650ms',
model: 'gpt-4o',
tokensSaved: 0
});
Observability reveals cost spikes, latency outliers, and error patterns before they become incidents.
Multi-Tenant AI Data Isolation
In multi-tenant systems, prevent cross-tenant data leakage:
// Vector DB query: always filter by tenant
const results = await vectordb.search(embedding, {
filter: { tenantId: req.user.tenantId },
topK: 5
});
// LLM context: include tenant ID
const systemPrompt = `You are an AI assistant for ${tenant.name}.
Use only the following context that belongs to this tenant:`;
// Cache key: include tenant
const cacheKey = hash(JSON.stringify({
tenantId: req.user.tenantId,
prompt: req.body.prompt
}));
Tenant isolation is non-negotiable for security and compliance.
Checklist
- Implement async-first job queues for LLM calls
- Add streaming endpoints for real-time responses
- Deploy exact-match and semantic caching
- Design three-tier fallback chains (premium → standard → cache)
- Track costs per user and tenant
- Rate limit by tokens, not requests
- Use feature flags for model rollouts
- Instrument LLM latency, tokens, and costs
- Enforce tenant data isolation in vector stores and prompts
- Set hard cost limits and kill switches
Conclusion
Scaling AI products requires fundamentally different architecture than traditional backends. Async-first patterns, intelligent caching, fallback chains, and cost-aware rate limiting form the foundation. Pair these with comprehensive observability and tenant isolation, and you'll build systems that survive real-world usage: rate limits, cost overruns, and model failures.