Cloudflare Workers AI — Running LLMs at the Edge in 60 Countries

Introduction

Cloudflare Workers AI runs inference at 60+ data centers worldwide, not a centralized region. Your LLM request executes <50ms from the user. No cold starts. No model downloads. Embed AI into your edge application with env.AI.run() and watch latency collapse.

Available Models on Workers AI
env.AI.run() API Basics
Streaming AI Responses From Workers
Workers AI + Vectorize for Edge RAG
D1 for Conversation Storage at Edge
R2 for File Storage
Rate Limiting With Durable Objects
Workers AI Limitations
Pricing Model
When Workers AI Beats Cloud LLM APIs
Full Edge AI Stack Architecture
Checklist
Conclusion

Available Models on Workers AI

Workers AI catalog (2026):

Llama models: Llama 3 70B (instruction-tuned, reasoning), Llama 3.1 8B (lightweight)
Mistral: Mistral 7B instruct, Mixtral 8x7B mixture-of-experts
Embeddings: BGE-large-en-v1.5 (1,536 dimensions), multilingual support
Image classification: NSFW detection, object classification
Speech models: Whisper (speech-to-text), text-to-speech

All models run locally at your nearest Cloudflare edge location. No egress to centralized infrastructure.

env.AI.run() API Basics

Workers AI exposes models via the env.AI binding:

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const response = await env.AI.run('@cf/meta/llama-3-8b-instruct', {
      prompt: 'Explain edge computing in 3 sentences',
      max_tokens: 256,
    });

    return new Response(JSON.stringify(response));
  },
};

The @cf/ prefix identifies Cloudflare's managed models. Custom fine-tuned models use different naming schemes.

Response includes generated text, token counts, and timing:

{
  "result": {
    "response": "Edge computing moves computation closer to users...",
    "tokens_generated": 42,
    "tokens_total": 55
  }
}

Streaming AI Responses From Workers

Enable real-time token streaming:

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const { readable, writable } = new TransformStream();
    const writer = writable.getWriter();

    (async () => {
      try {
        const stream = await env.AI.run('@cf/meta/llama-3-70b-instruct', {
          prompt: 'Generate a deployment checklist',
          stream: true,
        });

        for await (const chunk of stream) {
          await writer.write(
            new TextEncoder().encode(
              JSON.stringify(chunk) + '\n'
            )
          );
        }
      } finally {
        await writer.close();
      }
    })();

    return new Response(readable, {
      headers: { 'Content-Type': 'text/event-stream' },
    });
  },
};

Streaming reduces perceived latency from 3 seconds (first token) to <200ms at the edge.

Workers AI + Vectorize for Edge RAG

Vectorize stores embeddings at edge locations:

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const { query } = await request.json();

    // Generate query embedding
    const queryEmbedding = await env.AI.run('@cf/baai/bge-large-en-v1.5', {
      text: query,
    });

    // Search Vectorize index
    const matches = await env.VECTORIZE_INDEX.query(
      queryEmbedding.data[0],
      { topK: 5 }
    );

    // Fetch documents from D1
    const contexts = await Promise.all(
      matches.matches.map(m =>
        env.DB.prepare('SELECT content FROM documents WHERE id = ?')
          .bind(m.metadata.docId)
          .first()
      )
    );

    // Generate response using retrieval context
    const response = await env.AI.run('@cf/meta/llama-3-70b-instruct', {
      prompt: \`
Context:
\${contexts.map(c => c.content).join('\n')}

Question: \${query}
\`,
      max_tokens: 512,
    });

    return new Response(JSON.stringify(response));
  },
};

Vectorize distributes embeddings globally; searches execute locally without centralized database queries.

D1 for Conversation Storage at Edge

D1 (SQLite) persists conversation history at edge:

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const { userId, message } = await request.json();

    // Retrieve conversation history
    const history = await env.DB.prepare(
      'SELECT role, content FROM messages WHERE user_id = ? ORDER BY created_at DESC LIMIT 10'
    )
      .bind(userId)
      .all();

    // Build prompt from history
    const messages = history.reverse().map(row => ({
      role: row.role,
      content: row.content,
    }));
    messages.push({ role: 'user', content: message });

    // Generate response
    const response = await env.AI.run('@cf/meta/llama-3-8b-instruct', {
      prompt: messages.map(m => `${m.role}: ${m.content}`).join('\n'),
      max_tokens: 256,
    });

    // Store message and response
    await env.DB.prepare(
      'INSERT INTO messages (user_id, role, content, created_at) VALUES (?, ?, ?, ?)'
    )
      .bind(userId, 'user', message, new Date().toISOString())
      .run();

    await env.DB.prepare(
      'INSERT INTO messages (user_id, role, content, created_at) VALUES (?, ?, ?, ?)'
    )
      .bind(userId, 'assistant', response.result.response, new Date().toISOString())
      .run();

    return new Response(JSON.stringify(response));
  },
};

Conversation context stays regional; no latency spike fetching history from a central database.

R2 for File Storage

R2 (object storage) stores documents and fine-tuning data:

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    // Store uploaded document
    const file = await request.formData();
    const document = file.get('document');

    const uploadKey = `documents/${Date.now()}-${document.name}`;
    await env.BUCKET.put(uploadKey, document.stream(), {
      customMetadata: { userId: 'user-123' },
    });

    // Retrieve for processing
    const stored = await env.BUCKET.get(uploadKey);
    const text = await stored.text();

    // Generate embeddings for indexing
    const embeddings = await env.AI.run('@cf/baai/bge-large-en-v1.5', {
      text: text,
    });

    return new Response(JSON.stringify({ key: uploadKey }));
  },
};

R2 integrates seamlessly with Workers AI for document processing workflows.

Rate Limiting With Durable Objects

Durable Objects enforce per-user rate limits across the globe:

export class RateLimiter {
  private state: DurableObjectState;
  private env: Env;

  constructor(state: DurableObjectState, env: Env) {
    this.state = state;
    this.env = env;
  }

  async fetch(request: Request): Promise<Response> {
    const now = Date.now();
    const windowStart = now - 60000; // 1-minute window

    let tokens = (await this.state.storage.get('tokens')) || [];
    tokens = tokens.filter((t: number) => t &gt; windowStart);

    if (tokens.length &gt;= 100) {
      return new Response('Rate limit exceeded', { status: 429 });
    }

    tokens.push(now);
    await this.state.storage.put('tokens', tokens);

    return new Response('OK');
  }
}

Every user's rate limit state is consistent globally, even with geographically distributed requests.

Workers AI Limitations

Model selection is curated, not unlimited. Custom models require Hugging Face integration (beta). Context window varies:

Llama 3 70B: 8,192 tokens
Mistral: 32,768 tokens
Embeddings: 512 tokens per input

Long-context tasks (RAG documents >4KB) require chunking strategies.

Pricing Model

Workers AI charges per inference:

Included: 10,000 free AI inferences/day on the Free plan
Paid: $0.50 per 1M tokens (varies by model)
Bundled: Cloudflare's bundled pricing ($200/month Enterprise) includes 50M AI tokens

At scale, AWS Bedrock or dedicated inference endpoints may be cheaper. Edge AI wins for latency-sensitive, low-frequency inferences.

When Workers AI Beats Cloud LLM APIs

Workers AI is optimal when:

User latency matters (<100ms response time required)
Per-request costs are high (billions of requests/month)
Data residency or compliance requires processing at edge
Inference frequency is bursty (avoid over-provisioning)

Cloud LLM APIs win for:

High token throughput (>100M tokens/month)
Complex agent workflows (multiple tool calls)
Fine-tuned models unique to your domain
Batch processing (non-real-time)

Full Edge AI Stack Architecture

A complete edge AI application:

Workers route requests and orchestrate
Workers AI runs inference locally
Vectorize stores and searches embeddings
D1 persists conversation state
R2 archives documents
Durable Objects manage rate limiting and state
Analytics Engine logs inferences for cost tracking

All components replicate globally. No central bottleneck exists.

Checklist

Conclusion

Cloudflare Workers AI collapses latency by running LLMs at the edge. For applications prioritizing response speed and global distribution, Workers AI offers unbeatable simplicity. Pair it with Vectorize and D1 for a complete, edge-native AI stack that scales to billions of requests without centralized infrastructure.