Published on

AWS Bedrock in Production — Enterprise LLM Without Sending Data to OpenAI

Authors

Introduction

AWS Bedrock eliminates the data residency dilemma: run Claude, Llama, and Mistral models without sending customer data to external AI companies. Your data stays in AWS, model training never touches your content, and you control every byte through VPC endpoints and IAM policies.

Why Enterprises Choose Bedrock

The compliance question drives adoption: "Where does our data go?" With OpenAI or Anthropic APIs, your inference data flows to their infrastructure. Bedrock runs models in your AWS account or private VPC endpoints. PCI-DSS, HIPAA, and SOC 2 audits become simpler when customer data never leaves your AWS region.

Model training on inference data is explicitly disabled in Bedrock. Your competitive advantage—unique prompts, proprietary context, conversation history—stays yours. No model gets smarter from your usage patterns.

Available Models on Bedrock

Bedrock offers a diverse model family:

  • Claude models: Claude 3.5 Sonnet (fastest reasoning), Claude 3 Opus (most capable), Claude 3 Haiku (cost-optimized). All support 200k context windows.
  • Llama models: Meta's Llama 3.1 and 3.2, open-source weights, fine-tuning support.
  • Mistral: Mistral 7B, Mixtral 8x7B, Mistral Large for high-performance inference.
  • Titan models: Amazon-built embeddings and text generation, optimized for AWS workloads.

Each model has provisioned and on-demand pricing. Provisioned throughput (month-long commitment) costs 33% less per token but requires capacity planning.

Bedrock Runtime API With TypeScript

The @aws-sdk/client-bedrock-runtime package handles inference:

import { BedrockRuntimeClient, InvokeModelCommand } from '@aws-sdk/client-bedrock-runtime';

const client = new BedrockRuntimeClient({ region: 'us-east-1' });

const response = await client.send(new InvokeModelCommand({
  modelId: 'anthropic.claude-3-5-sonnet-20241022-v2:0',
  body: JSON.stringify({
    anthropic_version: 'bedrock-2023-06-01',
    max_tokens: 1024,
    messages: [{ role: 'user', content: 'Explain Bedrock pricing' }],
  }),
}));

const textContent = JSON.parse(new TextDecoder().decode(response.body)).content[0].text;
console.log(textContent);

Model IDs include version numbers. AWS updates model versions quarterly; stick to specific versions in production for reproducibility.

Streaming Responses With InvokeModelWithResponseStream

Non-streamed inference waits for complete model output before returning. For user-facing applications, stream tokens:

import { InvokeModelWithResponseStreamCommand } from '@aws-sdk/client-bedrock-runtime';

const command = new InvokeModelWithResponseStreamCommand({
  modelId: 'anthropic.claude-3-5-sonnet-20241022-v2:0',
  body: JSON.stringify({
    anthropic_version: 'bedrock-2023-06-01',
    max_tokens: 2048,
    messages: [{ role: 'user', content: 'Write a scaling strategy' }],
  }),
});

const response = await client.send(command);

for await (const event of response.body) {
  if (event.contentBlockDelta) {
    process.stdout.write(event.contentBlockDelta.delta.text);
  }
}

Streaming reduces time-to-first-token (TTFT) from 5 seconds to <500ms for 7B models.

Bedrock Agents for Managed Workflows

Bedrock Agents orchestrate tool calling automatically:

const response = await client.send(new InvokeAgentCommand({
  agentId: 'my-agent-id',
  agentAliasId: 'FFFFFFFF', // Production alias
  sessionId: `session-${userId}`,
  inputText: 'Check our database for Q3 revenue trends',
}));

The agent calls your Lambda functions or APIs (knowledge base, database queries, external APIs) based on the user prompt. No explicit tool-call parsing required.

Knowledge Bases for Managed RAG

Bedrock Knowledge Bases handle retrieval:

  1. Upload documents to S3
  2. Create a Knowledge Base pointing to that S3 bucket
  3. Bedrock automatically chunks, embeds, and indexes documents
const response = await client.send(new RetrieveAndGenerateCommand({
  input: { text: 'How do we handle customer refunds?' },
  retrieveAndGenerateConfiguration: {
    type: 'KNOWLEDGE_BASE',
    knowledgeBaseConfiguration: {
      knowledgeBaseId: 'kb-12345',
      modelArn: 'arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0',
    },
  },
}));

No managing vector databases, embeddings pipelines, or chunk sizes yourself.

Bedrock Guardrails for Content Filtering

Guardrails block harmful outputs before they reach users:

const response = await client.send(new InvokeModelCommand({
  modelId: 'anthropic.claude-3-5-sonnet-20241022-v2:0',
  guardrailIdentifier: 'gr-12345',
  guardrailVersion: '1',
  body: JSON.stringify({
    anthropic_version: 'bedrock-2023-06-01',
    max_tokens: 1024,
    messages: [{ role: 'user', content: userInput }],
  }),
}));

const { guardrailAction } = response; // INTERVENE or NONE

Guardrails catch PII leakage, illegal activities, and violence without custom filters.

Cost Comparison vs OpenAI

Bedrock pricing (on-demand):

  • Claude 3.5 Sonnet: $3 per 1M input tokens, $15 per 1M output tokens
  • Llama 3.1 70B: $0.99 / $1.32 per 1M tokens

OpenAI pricing (same models unavailable; closest is GPT-4o):

  • GPT-4o: $5 / $15 per 1M tokens

Bedrock saves 40% on inputs for equivalent capability. Provisioned throughput adds another 33% discount at scale (>10B tokens/month).

IAM Permissions Setup

Limit access by model and operation:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream"
      ],
      "Resource": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet*"
    },
    {
      "Effect": "Allow",
      "Action": ["bedrock:Retrieve"],
      "Resource": "arn:aws:bedrock:us-east-1:123456789012:knowledge-base/kb-*"
    }
  ]
}

Least-privilege ensures compromised application keys can't invoke expensive models.

Cross-Region Inference

Bedrock supports inference in us-east-1, us-west-2, eu-west-1, and ap-southeast-1. Route requests by latency:

const regions = ['us-east-1', 'us-west-2'];
const clients = regions.map(r => new BedrockRuntimeClient({ region: r }));

// Try primary region, fall back to secondary
let response;
try {
  response = await clients[0].send(command);
} catch (e) {
  response = await clients[1].send(command);
}

Cross-region failover adds <100ms latency but ensures availability.

Checklist

  • Enable Bedrock in target AWS regions
  • Request model access (approval is automatic)
  • Create IAM role with bedrock:InvokeModel and bedrock:InvokeModelWithResponseStream
  • Test streaming vs non-streamed for your use case
  • Set up Knowledge Bases for retrieval-augmented generation
  • Configure Guardrails for sensitive workloads
  • Estimate token usage for cost forecasting
  • Implement cross-region failover for critical applications
  • Log inference latency and token counts for observability
  • Review SOC 2 compliance documentation for audits

Conclusion

AWS Bedrock moves LLM inference behind your AWS firewall. Compliance improves, costs drop, and you control the entire stack. For enterprises handling sensitive data—healthcare, financial services, government—Bedrock eliminates the "where does it go?" question entirely. Start with on-demand, optimize to provisioned throughput once you forecast monthly token volume.