- Published on
LLM Data Privacy — Preventing Your Users' Data From Training OpenAI's Models
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
When you send user data to OpenAI''s API, you''re trusting OpenAI not to train their models on it. But the terms are complex, and many teams don''t understand what they''ve agreed to. Azure OpenAI offers stronger guarantees. AWS Bedrock isolates data. Self-hosted LLMs give you control.
This post clarifies privacy guarantees from major LLM providers, shows how to detect and redact PII before sending to LLMs, and helps you choose the right LLM architecture for your privacy requirements.
- How LLM Providers Use Your Data
- Azure OpenAI vs OpenAI API Privacy Differences
- AWS Bedrock Data Isolation Guarantees
- Self-Hosted LLMs for Maximum Privacy
- PII Detection Before Sending to LLM
- PII Redaction and Replacement with Synthetic Data
- Conversation Data Retention Policies
- User Data Deletion from AI Systems
- Privacy Impact Assessment for AI Features
- Checklist
- Conclusion
How LLM Providers Use Your Data
OpenAI API (public):
By default, OpenAI uses your data to train models. You can opt out of training by setting "models": ["gpt-4"] and enabling data privacy settings. But this is opt-out, not opt-in.
Azure OpenAI:
Your data stays in your Azure tenant and is not used for training. Stronger guarantees.
AWS Bedrock:
Data is isolated per region and not used for training. Similar to Azure.
Anthropic (Claude API):
By default, does not use conversations for training unless you explicitly opt in. Opt-in, not opt-out.
Azure OpenAI vs OpenAI API Privacy Differences
| Feature | OpenAI API | Azure OpenAI |
|---|---|---|
| Default training | Yes | No |
| Data residency | OpenAI servers | Your Azure region |
| HIPAA compliance | No | Yes (with agreement) |
| SOC 2 audit | No | Yes |
| Cost | Lower | Higher |
| Latency | OpenAI routes | Direct to Azure |
Choose Azure OpenAI if:
- Handling PII (healthcare, financial, legal)
- HIPAA or compliance requirements
- Data must stay in specific regions
- Security posture matters more than cost
Choose OpenAI API if:
- Non-sensitive data (public content, dev apps)
- Cost is priority
- Compliant opt-out is sufficient
AWS Bedrock Data Isolation Guarantees
AWS Bedrock provides strong isolation:
- No training data sharing: Your data is not used to train Amazon''s models or shared with model providers.
- Regional isolation: Data stays within the region you deploy to.
- Encryption: Data encrypted at rest and in transit.
- No API logging: Requests are not logged by default.
Cost: Higher than OpenAI API, comparable to Azure OpenAI.
import { BedrockRuntimeClient, InvokeModelCommand } from '@aws-sdk/client-bedrock-runtime';
const client = new BedrockRuntimeClient({ region: 'us-east-1' });
const response = await client.send(
new InvokeModelCommand({
modelId: 'anthropic.claude-3-sonnet-20240229-v1:0',
contentType: 'application/json',
body: JSON.stringify({
anthropic_version: 'bedrock-2023-06-01',
max_tokens: 1024,
messages: [
{
role: 'user',
content: 'Your prompt here',
},
],
}),
})
);
AWS Bedrock is ideal for regulated industries.
Self-Hosted LLMs for Maximum Privacy
Run LLMs on your infrastructure. No data leaves your systems.
Options:
- Ollama (easiest):
ollama pull llama2
ollama run llama2 "Your prompt"
- LM Studio (GUI):
Download model, run locally. Simple UI.
- LLaMA 2, Mistral, or Phi (open-source models):
Fine-tune and deploy on your hardware.
// Using local Ollama
import fetch from 'node-fetch';
async function generateText(prompt: string) {
const response = await fetch('http://localhost:11434/api/generate', {
method: 'POST',
body: JSON.stringify({
model: 'llama2',
prompt,
stream: false,
}),
});
const data = await response.json();
return data.response;
}
const response = await generateText('Summarise this document');
console.log(response);
Tradeoff: Slower, more expensive to operate, but maximum privacy.
For on-premise deployments with strict data requirements, self-hosted is worth the cost.
PII Detection Before Sending to LLM
Use Microsoft Presidio to detect PII before sending to LLM APIs.
pip install presidio-analyzer presidio-anonymizer
import { AnalyzerEngine, AnonymizerEngine } from 'presidio';
const analyzer = new AnalyzerEngine();
const anonymizer = new AnonymizerEngine();
async function sanitize(text: string) {
// Detect PII
const results = await analyzer.analyze({
text,
language: 'en',
});
// Identify PII entities (email, phone, credit card, etc.)
const piiFound = results.filter((r) => r.score > 0.75);
if (piiFound.length > 0) {
console.log('PII detected:', piiFound.map((r) => r.entity_type));
}
return piiFound;
}
const text = 'Call me at 555-123-4567 or email john@example.com';
const pii = await sanitize(text);
// Output: [{ entity_type: 'PHONE_NUMBER', ... }, { entity_type: 'EMAIL_ADDRESS', ... }]
Presidio identifies emails, phone numbers, credit cards, SSNs, names, and more.
PII Redaction and Replacement with Synthetic Data
Instead of removing PII, replace it with synthetic data that preserves structure.
import { AnonymizerEngine } from 'presidio';
const anonymizer = new AnonymizerEngine();
async function redactPII(text: string) {
return await anonymizer.anonymize({
text,
analyzer_results: [
// Pre-computed results from analyzer
{
start: 10,
end: 20,
score: 0.95,
entity_type: 'EMAIL_ADDRESS',
},
],
operators: {
EMAIL_ADDRESS: {
type: 'replace',
new_value: '[EMAIL]',
},
PHONE_NUMBER: {
type: 'mask',
masking_char: '*',
chars_to_mask: 10,
from_end: true,
},
CREDIT_CARD: {
type: 'replace',
new_value: '[CARD]',
},
},
});
}
const text = 'Email me at john@example.com or call 555-123-4567';
const redacted = await redactPII(text);
// Output: 'Email me at [EMAIL] or call ***-****-4567'
Redaction preserves text structure. LLM sees valid sentences, not blank holes.
Conversation Data Retention Policies
Define how long conversation data is stored. Automatic deletion after N days.
// Store conversation with TTL
interface ConversationRecord {
id: string;
userId: string;
messages: Message[];
createdAt: Date;
expiresAt: Date; // Delete after 30 days
}
// In DynamoDB
await dynamodb.putItem({
TableName: 'conversations',
Item: {
id: { S: conversationId },
userId: { S: userId },
messages: { S: JSON.stringify(messages) },
expiresAt: { N: (Date.now() / 1000 + 30 * 24 * 60 * 60).toString() }, // TTL
},
});
// DynamoDB auto-deletes items after TTL
Or use PostgreSQL with UNLOGGED tables for non-persistent conversation logs:
CREATE UNLOGGED TABLE conversations (
id UUID PRIMARY KEY,
user_id UUID NOT NULL,
messages JSONB NOT NULL,
created_at TIMESTAMP DEFAULT NOW(),
expires_at TIMESTAMP DEFAULT NOW() + INTERVAL '30 days'
);
-- Cron job to delete expired
DELETE FROM conversations WHERE expires_at < NOW();
User Data Deletion from AI Systems
Implement user deletion that removes all traces from LLM systems.
export async function deleteUser(userId: string) {
// 1. Delete conversations
await db.conversations.deleteMany({
where: { userId },
});
// 2. Delete fine-tuning data
const jobs = await openai.finetuning.jobs.list();
for (const job of jobs.data) {
if (job.training_file?.includes(userId)) {
// Mark for deletion
await openai.files.delete(job.training_file);
}
}
// 3. Delete from custom indexes or vector DBs
await vectorDb.delete({
filter: { userId },
});
// 4. Request OpenAI delete from training (if applicable)
// Note: This is a manual process with OpenAI support
// 5. Log deletion
await db.auditLog.create({
data: {
action: 'user_data_deleted',
userId,
timestamp: new Date(),
},
});
}
For GDPR compliance, data deletion must be fast and traceable.
Privacy Impact Assessment for AI Features
Document privacy decisions for AI features.
# AI Feature: Smart Email Summarization
## Data Processed
- User email content (subject, body)
- Metadata (sender, date, attachments)
## Third-party Services
- OpenAI GPT-4 API
## Privacy Guarantees
- OpenAI does not train on our data (privacy settings enabled)
- Conversations deleted after 30 days
- User can opt-out
## Risks
- Email content sent to OpenAI (encrypted in transit)
- Data shared with OpenAI employees for abuse detection
## Mitigations
- Enable data privacy in OpenAI settings
- Redact PII before sending
- Use Azure OpenAI for sensitive data
- Document opt-out process
## User Consent
- Checkbox in settings: "Enable smart summarization"
- Privacy policy updated
Share this with legal and security teams.
Checklist
- Document which LLM provider you use and their privacy policy
- Enable data privacy in OpenAI settings (if using OpenAI API)
- Implement PII detection with Presidio
- Redact PII before sending to LLM
- Define conversation retention policy (< 30 days recommended)
- Implement automatic deletion of conversation data
- Set up user data deletion flow
- If handling sensitive data, migrate to Azure OpenAI or AWS Bedrock
- Document privacy impact assessment
- Audit compliance quarterly
Conclusion
LLM privacy is not guaranteed by default. OpenAI''s opt-out approach means your data trains models unless you explicitly disable it. Azure OpenAI and AWS Bedrock offer stronger guarantees. For maximum privacy, self-host with Ollama or similar. Always detect and redact PII before sending to third-party LLMs. Document your privacy decisions. Data privacy is a feature, not an afterthought.