- Published on
Cloudflare Workers AI — Running LLMs at the Edge in 60 Countries
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Cloudflare Workers AI runs inference at 60+ data centers worldwide, not a centralized region. Your LLM request executes <50ms from the user. No cold starts. No model downloads. Embed AI into your edge application with env.AI.run() and watch latency collapse.
- Available Models on Workers AI
- env.AI.run() API Basics
- Streaming AI Responses From Workers
- Workers AI + Vectorize for Edge RAG
- D1 for Conversation Storage at Edge
- R2 for File Storage
- Rate Limiting With Durable Objects
- Workers AI Limitations
- Pricing Model
- When Workers AI Beats Cloud LLM APIs
- Full Edge AI Stack Architecture
- Checklist
- Conclusion
Available Models on Workers AI
Workers AI catalog (2026):
- Llama models: Llama 3 70B (instruction-tuned, reasoning), Llama 3.1 8B (lightweight)
- Mistral: Mistral 7B instruct, Mixtral 8x7B mixture-of-experts
- Embeddings: BGE-large-en-v1.5 (1,536 dimensions), multilingual support
- Image classification: NSFW detection, object classification
- Speech models: Whisper (speech-to-text), text-to-speech
All models run locally at your nearest Cloudflare edge location. No egress to centralized infrastructure.
env.AI.run() API Basics
Workers AI exposes models via the env.AI binding:
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const response = await env.AI.run('@cf/meta/llama-3-8b-instruct', {
prompt: 'Explain edge computing in 3 sentences',
max_tokens: 256,
});
return new Response(JSON.stringify(response));
},
};
The @cf/ prefix identifies Cloudflare's managed models. Custom fine-tuned models use different naming schemes.
Response includes generated text, token counts, and timing:
{
"result": {
"response": "Edge computing moves computation closer to users...",
"tokens_generated": 42,
"tokens_total": 55
}
}
Streaming AI Responses From Workers
Enable real-time token streaming:
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { readable, writable } = new TransformStream();
const writer = writable.getWriter();
(async () => {
try {
const stream = await env.AI.run('@cf/meta/llama-3-70b-instruct', {
prompt: 'Generate a deployment checklist',
stream: true,
});
for await (const chunk of stream) {
await writer.write(
new TextEncoder().encode(
JSON.stringify(chunk) + '\n'
)
);
}
} finally {
await writer.close();
}
})();
return new Response(readable, {
headers: { 'Content-Type': 'text/event-stream' },
});
},
};
Streaming reduces perceived latency from 3 seconds (first token) to <200ms at the edge.
Workers AI + Vectorize for Edge RAG
Vectorize stores embeddings at edge locations:
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { query } = await request.json();
// Generate query embedding
const queryEmbedding = await env.AI.run('@cf/baai/bge-large-en-v1.5', {
text: query,
});
// Search Vectorize index
const matches = await env.VECTORIZE_INDEX.query(
queryEmbedding.data[0],
{ topK: 5 }
);
// Fetch documents from D1
const contexts = await Promise.all(
matches.matches.map(m =>
env.DB.prepare('SELECT content FROM documents WHERE id = ?')
.bind(m.metadata.docId)
.first()
)
);
// Generate response using retrieval context
const response = await env.AI.run('@cf/meta/llama-3-70b-instruct', {
prompt: \`
Context:
\${contexts.map(c => c.content).join('\n')}
Question: \${query}
\`,
max_tokens: 512,
});
return new Response(JSON.stringify(response));
},
};
Vectorize distributes embeddings globally; searches execute locally without centralized database queries.
D1 for Conversation Storage at Edge
D1 (SQLite) persists conversation history at edge:
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { userId, message } = await request.json();
// Retrieve conversation history
const history = await env.DB.prepare(
'SELECT role, content FROM messages WHERE user_id = ? ORDER BY created_at DESC LIMIT 10'
)
.bind(userId)
.all();
// Build prompt from history
const messages = history.reverse().map(row => ({
role: row.role,
content: row.content,
}));
messages.push({ role: 'user', content: message });
// Generate response
const response = await env.AI.run('@cf/meta/llama-3-8b-instruct', {
prompt: messages.map(m => `${m.role}: ${m.content}`).join('\n'),
max_tokens: 256,
});
// Store message and response
await env.DB.prepare(
'INSERT INTO messages (user_id, role, content, created_at) VALUES (?, ?, ?, ?)'
)
.bind(userId, 'user', message, new Date().toISOString())
.run();
await env.DB.prepare(
'INSERT INTO messages (user_id, role, content, created_at) VALUES (?, ?, ?, ?)'
)
.bind(userId, 'assistant', response.result.response, new Date().toISOString())
.run();
return new Response(JSON.stringify(response));
},
};
Conversation context stays regional; no latency spike fetching history from a central database.
R2 for File Storage
R2 (object storage) stores documents and fine-tuning data:
export default {
async fetch(request: Request, env: Env): Promise<Response> {
// Store uploaded document
const file = await request.formData();
const document = file.get('document');
const uploadKey = `documents/${Date.now()}-${document.name}`;
await env.BUCKET.put(uploadKey, document.stream(), {
customMetadata: { userId: 'user-123' },
});
// Retrieve for processing
const stored = await env.BUCKET.get(uploadKey);
const text = await stored.text();
// Generate embeddings for indexing
const embeddings = await env.AI.run('@cf/baai/bge-large-en-v1.5', {
text: text,
});
return new Response(JSON.stringify({ key: uploadKey }));
},
};
R2 integrates seamlessly with Workers AI for document processing workflows.
Rate Limiting With Durable Objects
Durable Objects enforce per-user rate limits across the globe:
export class RateLimiter {
private state: DurableObjectState;
private env: Env;
constructor(state: DurableObjectState, env: Env) {
this.state = state;
this.env = env;
}
async fetch(request: Request): Promise<Response> {
const now = Date.now();
const windowStart = now - 60000; // 1-minute window
let tokens = (await this.state.storage.get('tokens')) || [];
tokens = tokens.filter((t: number) => t > windowStart);
if (tokens.length >= 100) {
return new Response('Rate limit exceeded', { status: 429 });
}
tokens.push(now);
await this.state.storage.put('tokens', tokens);
return new Response('OK');
}
}
Every user's rate limit state is consistent globally, even with geographically distributed requests.
Workers AI Limitations
Model selection is curated, not unlimited. Custom models require Hugging Face integration (beta). Context window varies:
- Llama 3 70B: 8,192 tokens
- Mistral: 32,768 tokens
- Embeddings: 512 tokens per input
Long-context tasks (RAG documents >4KB) require chunking strategies.
Pricing Model
Workers AI charges per inference:
- Included: 10,000 free AI inferences/day on the Free plan
- Paid:
$0.50per 1M tokens (varies by model) - Bundled: Cloudflare's bundled pricing (
$200/month Enterprise) includes 50M AI tokens
At scale, AWS Bedrock or dedicated inference endpoints may be cheaper. Edge AI wins for latency-sensitive, low-frequency inferences.
When Workers AI Beats Cloud LLM APIs
Workers AI is optimal when:
- User latency matters (<100ms response time required)
- Per-request costs are high (billions of requests/month)
- Data residency or compliance requires processing at edge
- Inference frequency is bursty (avoid over-provisioning)
Cloud LLM APIs win for:
- High token throughput (>100M tokens/month)
- Complex agent workflows (multiple tool calls)
- Fine-tuned models unique to your domain
- Batch processing (non-real-time)
Full Edge AI Stack Architecture
A complete edge AI application:
- Workers route requests and orchestrate
- Workers AI runs inference locally
- Vectorize stores and searches embeddings
- D1 persists conversation state
- R2 archives documents
- Durable Objects manage rate limiting and state
- Analytics Engine logs inferences for cost tracking
All components replicate globally. No central bottleneck exists.
Checklist
- Create Cloudflare Worker project with
wrangler init - Bind Workers AI model via
wrangler.tomlconfiguration - Test model inference in local development
- Implement streaming responses for user-facing endpoints
- Set up Vectorize index for document embeddings
- Create D1 database schema for conversation storage
- Add R2 bucket for document archival
- Implement rate limiting with Durable Objects
- Monitor token usage in Cloudflare Analytics Dashboard
- Set up cost alerts for high inference volume
Conclusion
Cloudflare Workers AI collapses latency by running LLMs at the edge. For applications prioritizing response speed and global distribution, Workers AI offers unbeatable simplicity. Pair it with Vectorize and D1 for a complete, edge-native AI stack that scales to billions of requests without centralized infrastructure.