- Published on
Third-Party API Dependency Failure — When Twilio Goes Down and You Can't Send OTPs
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Every third-party API you depend on is an availability risk you inherit. Twilio, SendGrid, Stripe, Auth0, OpenAI, Google Maps — when any of them has an outage, your users experience your service as broken. The dependency failure becomes your incident. The engineering question isn't "will our dependencies fail?" (they will) but "what does our service do when they do?" The answer needs to be planned before the outage, not improvised during it.
- Dependency Failure Patterns
- Fix 1: Multi-Provider Fallback for Critical Services
- Fix 2: Circuit Breaker With Graceful Degradation
- Fix 3: Email Delivery With Queue and Fallback
- Fix 4: Alternative Auth Flows When SMS Is Down
- Fix 5: Monitor Third-Party Dependencies Proactively
- Third-Party Dependency Checklist
- Conclusion
Dependency Failure Patterns
How third-party failures manifest:
1. Hard dependency — service stops entirely
→ SMS provider down: no OTPs → no logins
→ Payment processor down: no checkouts
→ Auth provider down: nobody can log in
2. Latency degradation — service is slow
→ Calls to geocoding API take 10s instead of 100ms
→ Every request waiting on the slow call → timeouts cascade
3. Partial failure — some calls succeed, some fail
→ Email delivery: 30% of emails dropped silently
→ You don't know what was delivered
4. Rate limit hit — you're being throttled
→ API key exhausted for the month
→ All requests rejected until next billing cycle
5. Breaking change — provider changed their API
→ New required field, new response format, deprecated endpoint
→ Integration broken until you update
Fix 1: Multi-Provider Fallback for Critical Services
// SMS provider with automatic fallback: Twilio → AWS SNS
interface SMSProvider {
sendSMS(to: string, message: string): Promise<void>
}
class TwilioProvider implements SMSProvider {
async sendSMS(to: string, message: string): Promise<void> {
await twilioClient.messages.create({
body: message,
from: process.env.TWILIO_FROM_NUMBER!,
to,
})
}
}
class AWSSNSProvider implements SMSProvider {
async sendSMS(to: string, message: string): Promise<void> {
await sns.publish({
PhoneNumber: to,
Message: message,
}).promise()
}
}
class ResilientSMSService {
private providers: SMSProvider[]
constructor() {
this.providers = [
new TwilioProvider(), // Primary
new AWSSNSProvider(), // Fallback
]
}
async sendSMS(to: string, message: string): Promise<void> {
let lastError: Error | null = null
for (const provider of this.providers) {
try {
await provider.sendSMS(to, message)
return // Success on first working provider
} catch (err) {
logger.warn({ provider: provider.constructor.name, to, error: err.message },
'SMS provider failed, trying next')
lastError = err as Error
}
}
// All providers failed
throw new Error(`All SMS providers failed. Last error: ${lastError?.message}`)
}
}
Fix 2: Circuit Breaker With Graceful Degradation
import CircuitBreaker from 'opossum'
// Geocoding service with circuit breaker
const geocodeBreaker = new CircuitBreaker(
(address: string) => geocodingAPI.geocode(address),
{
timeout: 3_000,
errorThresholdPercentage: 30,
resetTimeout: 60_000,
}
)
// Fallback when geocoding is unavailable
geocodeBreaker.fallback((address: string) => {
logger.warn({ address }, 'Geocoding unavailable — using null coordinates')
// Don't fail the whole request — return null and handle gracefully
return null
})
// In the application:
async function getProductsNearby(address: string, radius: number) {
const coords = await geocodeBreaker.fire(address)
if (!coords) {
// Geocoding is down: return products without distance sorting
return db.query('SELECT * FROM products LIMIT 50')
}
return db.query(`
SELECT *, ST_Distance(location, ST_Point($1, $2)) as distance
FROM products
WHERE ST_DWithin(location, ST_Point($1, $2), $3)
ORDER BY distance
`, [coords.lng, coords.lat, radius])
}
Fix 3: Email Delivery With Queue and Fallback
// Email is critical — queue it, retry it, fall back to secondary provider
import Bull from 'bull'
const emailQueue = new Bull('email', { redis: process.env.REDIS_URL })
// Producer: never block on sending email
async function sendEmail(options: EmailOptions): Promise<void> {
await emailQueue.add(options, {
attempts: 5,
backoff: { type: 'exponential', delay: 5000 },
removeOnComplete: true,
})
}
// Consumer: try primary provider, fall back to secondary
emailQueue.process(async (job) => {
const { to, subject, html } = job.data
// Try SendGrid first
try {
await sendgrid.send({ to, from: 'noreply@myapp.com', subject, html })
return
} catch (err) {
logger.warn({ attempt: job.attemptsMade, error: err.message }, 'SendGrid failed')
}
// Fall back to AWS SES on retry
if (job.attemptsMade >= 2) {
await ses.sendEmail({
Destination: { ToAddresses: [to] },
Message: {
Subject: { Data: subject },
Body: { Html: { Data: html } },
},
Source: 'noreply@myapp.com',
}).promise()
return
}
throw new Error('Email send failed — will retry')
})
Fix 4: Alternative Auth Flows When SMS Is Down
// OTP auth with fallback when SMS is unavailable
async function sendOTP(userId: string, phone: string): Promise<{ method: 'sms' | 'email' | 'totp' }> {
const otp = generateOTP()
await storeOTP(userId, otp, { expiresInMinutes: 5 })
// Try SMS first
try {
await smsService.sendSMS(phone, `Your login code: ${otp}`)
return { method: 'sms' }
} catch (err) {
logger.warn({ userId }, 'SMS OTP failed, falling back to email')
}
// Fall back to email OTP
const user = await db.query('SELECT email FROM users WHERE id = $1', [userId])
if (user.rows[0]?.email) {
try {
await emailService.sendEmail({
to: user.rows[0].email,
subject: 'Your login code',
html: `<p>Your login code is: <strong>${otp}</strong></p>`,
})
return { method: 'email' }
} catch (err) {
logger.warn({ userId }, 'Email OTP also failed')
}
}
// If user has TOTP enrolled, suggest that
const hasTOTP = await db.query('SELECT id FROM totp_enrollments WHERE user_id = $1', [userId])
if (hasTOTP.rows.length > 0) {
return { method: 'totp' } // User can use their authenticator app
}
throw new Error('All OTP delivery methods unavailable')
}
Fix 5: Monitor Third-Party Dependencies Proactively
// external-health.ts — monitor your dependencies before they affect users
import cron from 'node-cron'
interface DependencyCheck {
name: string
check: () => Promise<void>
timeout: number
}
const dependencies: DependencyCheck[] = [
{
name: 'Twilio',
check: async () => {
// Lightweight check: verify account status
await twilioClient.api.accounts(process.env.TWILIO_ACCOUNT_SID!).fetch()
},
timeout: 5000,
},
{
name: 'Stripe',
check: async () => {
await stripe.balance.retrieve()
},
timeout: 5000,
},
{
name: 'SendGrid',
check: async () => {
// Check API key validity
const response = await fetch('https://api.sendgrid.com/v3/user/account', {
headers: { Authorization: `Bearer ${process.env.SENDGRID_API_KEY}` },
})
if (!response.ok) throw new Error(`SendGrid status: ${response.status}`)
},
timeout: 5000,
},
]
cron.schedule('*/5 * * * *', async () => {
for (const dep of dependencies) {
try {
await Promise.race([
dep.check(),
new Promise((_, reject) =>
setTimeout(() => reject(new Error('timeout')), dep.timeout)
),
])
metrics.gauge(`dependency.health.${dep.name}`, 1)
} catch (err) {
logger.error({ dependency: dep.name, error: err.message }, 'Dependency unhealthy')
metrics.gauge(`dependency.health.${dep.name}`, 0)
await alerting.warn(`Dependency degraded: ${dep.name} — ${err.message}`)
}
}
})
Third-Party Dependency Checklist
- ✅ Every critical third-party service has a fallback (secondary provider or degraded mode)
- ✅ Circuit breakers protect against slow/failing third-party APIs
- ✅ Critical flows (auth, payment, email) are queued with retry — never fire-and-forget
- ✅ Alternative auth flows available when SMS is down (email, TOTP)
- ✅ Proactive health checks monitor dependency status before users are affected
- ✅ Dependency status page monitored (subscribe to Twilio/Stripe/SendGrid status pages)
- ✅ On-call runbook documents what to do when each critical dependency goes down
Conclusion
Third-party API dependencies are risk you import. The architecture question is: what does your service do when each of these dependencies fails? For every critical dependency, you need an answer that isn't "the feature stops working." Multi-provider fallback handles SMS/email outages. Circuit breakers prevent latency cascades. Queuing with retry turns synchronous failures into delayed successes. And proactive monitoring tells you a dependency is degraded before your users start reporting it. Build the fallbacks when you build the integration — retrofitting them during an outage is the worst possible time.