Published on

LLM Data Privacy — Preventing Your Users' Data From Training OpenAI's Models

Authors

Introduction

When you send user data to OpenAI''s API, you''re trusting OpenAI not to train their models on it. But the terms are complex, and many teams don''t understand what they''ve agreed to. Azure OpenAI offers stronger guarantees. AWS Bedrock isolates data. Self-hosted LLMs give you control.

This post clarifies privacy guarantees from major LLM providers, shows how to detect and redact PII before sending to LLMs, and helps you choose the right LLM architecture for your privacy requirements.

How LLM Providers Use Your Data

OpenAI API (public):

By default, OpenAI uses your data to train models. You can opt out of training by setting "models": ["gpt-4"] and enabling data privacy settings. But this is opt-out, not opt-in.

Azure OpenAI:

Your data stays in your Azure tenant and is not used for training. Stronger guarantees.

AWS Bedrock:

Data is isolated per region and not used for training. Similar to Azure.

Anthropic (Claude API):

By default, does not use conversations for training unless you explicitly opt in. Opt-in, not opt-out.

Azure OpenAI vs OpenAI API Privacy Differences

FeatureOpenAI APIAzure OpenAI
Default trainingYesNo
Data residencyOpenAI serversYour Azure region
HIPAA complianceNoYes (with agreement)
SOC 2 auditNoYes
CostLowerHigher
LatencyOpenAI routesDirect to Azure

Choose Azure OpenAI if:

  • Handling PII (healthcare, financial, legal)
  • HIPAA or compliance requirements
  • Data must stay in specific regions
  • Security posture matters more than cost

Choose OpenAI API if:

  • Non-sensitive data (public content, dev apps)
  • Cost is priority
  • Compliant opt-out is sufficient

AWS Bedrock Data Isolation Guarantees

AWS Bedrock provides strong isolation:

  • No training data sharing: Your data is not used to train Amazon''s models or shared with model providers.
  • Regional isolation: Data stays within the region you deploy to.
  • Encryption: Data encrypted at rest and in transit.
  • No API logging: Requests are not logged by default.

Cost: Higher than OpenAI API, comparable to Azure OpenAI.

import { BedrockRuntimeClient, InvokeModelCommand } from '@aws-sdk/client-bedrock-runtime';

const client = new BedrockRuntimeClient({ region: 'us-east-1' });

const response = await client.send(
  new InvokeModelCommand({
    modelId: 'anthropic.claude-3-sonnet-20240229-v1:0',
    contentType: 'application/json',
    body: JSON.stringify({
      anthropic_version: 'bedrock-2023-06-01',
      max_tokens: 1024,
      messages: [
        {
          role: 'user',
          content: 'Your prompt here',
        },
      ],
    }),
  })
);

AWS Bedrock is ideal for regulated industries.

Self-Hosted LLMs for Maximum Privacy

Run LLMs on your infrastructure. No data leaves your systems.

Options:

  1. Ollama (easiest):
ollama pull llama2
ollama run llama2 "Your prompt"
  1. LM Studio (GUI):

Download model, run locally. Simple UI.

  1. LLaMA 2, Mistral, or Phi (open-source models):

Fine-tune and deploy on your hardware.

// Using local Ollama
import fetch from 'node-fetch';

async function generateText(prompt: string) {
  const response = await fetch('http://localhost:11434/api/generate', {
    method: 'POST',
    body: JSON.stringify({
      model: 'llama2',
      prompt,
      stream: false,
    }),
  });

  const data = await response.json();
  return data.response;
}

const response = await generateText('Summarise this document');
console.log(response);

Tradeoff: Slower, more expensive to operate, but maximum privacy.

For on-premise deployments with strict data requirements, self-hosted is worth the cost.

PII Detection Before Sending to LLM

Use Microsoft Presidio to detect PII before sending to LLM APIs.

pip install presidio-analyzer presidio-anonymizer
import { AnalyzerEngine, AnonymizerEngine } from 'presidio';

const analyzer = new AnalyzerEngine();
const anonymizer = new AnonymizerEngine();

async function sanitize(text: string) {
  // Detect PII
  const results = await analyzer.analyze({
    text,
    language: 'en',
  });

  // Identify PII entities (email, phone, credit card, etc.)
  const piiFound = results.filter((r) => r.score > 0.75);

  if (piiFound.length > 0) {
    console.log('PII detected:', piiFound.map((r) => r.entity_type));
  }

  return piiFound;
}

const text = 'Call me at 555-123-4567 or email john@example.com';
const pii = await sanitize(text);
// Output: [{ entity_type: 'PHONE_NUMBER', ... }, { entity_type: 'EMAIL_ADDRESS', ... }]

Presidio identifies emails, phone numbers, credit cards, SSNs, names, and more.

PII Redaction and Replacement with Synthetic Data

Instead of removing PII, replace it with synthetic data that preserves structure.

import { AnonymizerEngine } from 'presidio';

const anonymizer = new AnonymizerEngine();

async function redactPII(text: string) {
  return await anonymizer.anonymize({
    text,
    analyzer_results: [
      // Pre-computed results from analyzer
      {
        start: 10,
        end: 20,
        score: 0.95,
        entity_type: 'EMAIL_ADDRESS',
      },
    ],
    operators: {
      EMAIL_ADDRESS: {
        type: 'replace',
        new_value: '[EMAIL]',
      },
      PHONE_NUMBER: {
        type: 'mask',
        masking_char: '*',
        chars_to_mask: 10,
        from_end: true,
      },
      CREDIT_CARD: {
        type: 'replace',
        new_value: '[CARD]',
      },
    },
  });
}

const text = 'Email me at john@example.com or call 555-123-4567';
const redacted = await redactPII(text);
// Output: 'Email me at [EMAIL] or call ***-****-4567'

Redaction preserves text structure. LLM sees valid sentences, not blank holes.

Conversation Data Retention Policies

Define how long conversation data is stored. Automatic deletion after N days.

// Store conversation with TTL
interface ConversationRecord {
  id: string;
  userId: string;
  messages: Message[];
  createdAt: Date;
  expiresAt: Date; // Delete after 30 days
}

// In DynamoDB
await dynamodb.putItem({
  TableName: 'conversations',
  Item: {
    id: { S: conversationId },
    userId: { S: userId },
    messages: { S: JSON.stringify(messages) },
    expiresAt: { N: (Date.now() / 1000 + 30 * 24 * 60 * 60).toString() }, // TTL
  },
});

// DynamoDB auto-deletes items after TTL

Or use PostgreSQL with UNLOGGED tables for non-persistent conversation logs:

CREATE UNLOGGED TABLE conversations (
  id UUID PRIMARY KEY,
  user_id UUID NOT NULL,
  messages JSONB NOT NULL,
  created_at TIMESTAMP DEFAULT NOW(),
  expires_at TIMESTAMP DEFAULT NOW() + INTERVAL '30 days'
);

-- Cron job to delete expired
DELETE FROM conversations WHERE expires_at < NOW();

User Data Deletion from AI Systems

Implement user deletion that removes all traces from LLM systems.

export async function deleteUser(userId: string) {
  // 1. Delete conversations
  await db.conversations.deleteMany({
    where: { userId },
  });

  // 2. Delete fine-tuning data
  const jobs = await openai.finetuning.jobs.list();
  for (const job of jobs.data) {
    if (job.training_file?.includes(userId)) {
      // Mark for deletion
      await openai.files.delete(job.training_file);
    }
  }

  // 3. Delete from custom indexes or vector DBs
  await vectorDb.delete({
    filter: { userId },
  });

  // 4. Request OpenAI delete from training (if applicable)
  // Note: This is a manual process with OpenAI support

  // 5. Log deletion
  await db.auditLog.create({
    data: {
      action: 'user_data_deleted',
      userId,
      timestamp: new Date(),
    },
  });
}

For GDPR compliance, data deletion must be fast and traceable.

Privacy Impact Assessment for AI Features

Document privacy decisions for AI features.

# AI Feature: Smart Email Summarization

## Data Processed
- User email content (subject, body)
- Metadata (sender, date, attachments)

## Third-party Services
- OpenAI GPT-4 API

## Privacy Guarantees
- OpenAI does not train on our data (privacy settings enabled)
- Conversations deleted after 30 days
- User can opt-out

## Risks
- Email content sent to OpenAI (encrypted in transit)
- Data shared with OpenAI employees for abuse detection

## Mitigations
- Enable data privacy in OpenAI settings
- Redact PII before sending
- Use Azure OpenAI for sensitive data
- Document opt-out process

## User Consent
- Checkbox in settings: "Enable smart summarization"
- Privacy policy updated

Share this with legal and security teams.

Checklist

  • Document which LLM provider you use and their privacy policy
  • Enable data privacy in OpenAI settings (if using OpenAI API)
  • Implement PII detection with Presidio
  • Redact PII before sending to LLM
  • Define conversation retention policy (< 30 days recommended)
  • Implement automatic deletion of conversation data
  • Set up user data deletion flow
  • If handling sensitive data, migrate to Azure OpenAI or AWS Bedrock
  • Document privacy impact assessment
  • Audit compliance quarterly

Conclusion

LLM privacy is not guaranteed by default. OpenAI''s opt-out approach means your data trains models unless you explicitly disable it. Azure OpenAI and AWS Bedrock offer stronger guarantees. For maximum privacy, self-host with Ollama or similar. Always detect and redact PII before sending to third-party LLMs. Document your privacy decisions. Data privacy is a feature, not an afterthought.