Published on

Third-Party API Dependency Failure — When Twilio Goes Down and You Can't Send OTPs

Authors

Introduction

Every third-party API you depend on is an availability risk you inherit. Twilio, SendGrid, Stripe, Auth0, OpenAI, Google Maps — when any of them has an outage, your users experience your service as broken. The dependency failure becomes your incident. The engineering question isn't "will our dependencies fail?" (they will) but "what does our service do when they do?" The answer needs to be planned before the outage, not improvised during it.

Dependency Failure Patterns

How third-party failures manifest:

1. Hard dependency — service stops entirely
SMS provider down: no OTPs → no logins
Payment processor down: no checkouts
Auth provider down: nobody can log in

2. Latency degradation — service is slow
Calls to geocoding API take 10s instead of 100ms
Every request waiting on the slow call → timeouts cascade

3. Partial failure — some calls succeed, some fail
Email delivery: 30% of emails dropped silently
You don't know what was delivered

4. Rate limit hit — you're being throttled
API key exhausted for the month
All requests rejected until next billing cycle

5. Breaking change — provider changed their API
New required field, new response format, deprecated endpoint
Integration broken until you update

Fix 1: Multi-Provider Fallback for Critical Services

// SMS provider with automatic fallback: Twilio → AWS SNS
interface SMSProvider {
  sendSMS(to: string, message: string): Promise<void>
}

class TwilioProvider implements SMSProvider {
  async sendSMS(to: string, message: string): Promise<void> {
    await twilioClient.messages.create({
      body: message,
      from: process.env.TWILIO_FROM_NUMBER!,
      to,
    })
  }
}

class AWSSNSProvider implements SMSProvider {
  async sendSMS(to: string, message: string): Promise<void> {
    await sns.publish({
      PhoneNumber: to,
      Message: message,
    }).promise()
  }
}

class ResilientSMSService {
  private providers: SMSProvider[]

  constructor() {
    this.providers = [
      new TwilioProvider(),  // Primary
      new AWSSNSProvider(),   // Fallback
    ]
  }

  async sendSMS(to: string, message: string): Promise<void> {
    let lastError: Error | null = null

    for (const provider of this.providers) {
      try {
        await provider.sendSMS(to, message)
        return  // Success on first working provider
      } catch (err) {
        logger.warn({ provider: provider.constructor.name, to, error: err.message },
          'SMS provider failed, trying next')
        lastError = err as Error
      }
    }

    // All providers failed
    throw new Error(`All SMS providers failed. Last error: ${lastError?.message}`)
  }
}

Fix 2: Circuit Breaker With Graceful Degradation

import CircuitBreaker from 'opossum'

// Geocoding service with circuit breaker
const geocodeBreaker = new CircuitBreaker(
  (address: string) => geocodingAPI.geocode(address),
  {
    timeout: 3_000,
    errorThresholdPercentage: 30,
    resetTimeout: 60_000,
  }
)

// Fallback when geocoding is unavailable
geocodeBreaker.fallback((address: string) => {
  logger.warn({ address }, 'Geocoding unavailable — using null coordinates')
  // Don't fail the whole request — return null and handle gracefully
  return null
})

// In the application:
async function getProductsNearby(address: string, radius: number) {
  const coords = await geocodeBreaker.fire(address)

  if (!coords) {
    // Geocoding is down: return products without distance sorting
    return db.query('SELECT * FROM products LIMIT 50')
  }

  return db.query(`
    SELECT *, ST_Distance(location, ST_Point($1, $2)) as distance
    FROM products
    WHERE ST_DWithin(location, ST_Point($1, $2), $3)
    ORDER BY distance
  `, [coords.lng, coords.lat, radius])
}

Fix 3: Email Delivery With Queue and Fallback

// Email is critical — queue it, retry it, fall back to secondary provider
import Bull from 'bull'

const emailQueue = new Bull('email', { redis: process.env.REDIS_URL })

// Producer: never block on sending email
async function sendEmail(options: EmailOptions): Promise<void> {
  await emailQueue.add(options, {
    attempts: 5,
    backoff: { type: 'exponential', delay: 5000 },
    removeOnComplete: true,
  })
}

// Consumer: try primary provider, fall back to secondary
emailQueue.process(async (job) => {
  const { to, subject, html } = job.data

  // Try SendGrid first
  try {
    await sendgrid.send({ to, from: 'noreply@myapp.com', subject, html })
    return
  } catch (err) {
    logger.warn({ attempt: job.attemptsMade, error: err.message }, 'SendGrid failed')
  }

  // Fall back to AWS SES on retry
  if (job.attemptsMade >= 2) {
    await ses.sendEmail({
      Destination: { ToAddresses: [to] },
      Message: {
        Subject: { Data: subject },
        Body: { Html: { Data: html } },
      },
      Source: 'noreply@myapp.com',
    }).promise()
    return
  }

  throw new Error('Email send failed — will retry')
})

Fix 4: Alternative Auth Flows When SMS Is Down

// OTP auth with fallback when SMS is unavailable
async function sendOTP(userId: string, phone: string): Promise<{ method: 'sms' | 'email' | 'totp' }> {
  const otp = generateOTP()
  await storeOTP(userId, otp, { expiresInMinutes: 5 })

  // Try SMS first
  try {
    await smsService.sendSMS(phone, `Your login code: ${otp}`)
    return { method: 'sms' }
  } catch (err) {
    logger.warn({ userId }, 'SMS OTP failed, falling back to email')
  }

  // Fall back to email OTP
  const user = await db.query('SELECT email FROM users WHERE id = $1', [userId])
  if (user.rows[0]?.email) {
    try {
      await emailService.sendEmail({
        to: user.rows[0].email,
        subject: 'Your login code',
        html: `<p>Your login code is: <strong>${otp}</strong></p>`,
      })
      return { method: 'email' }
    } catch (err) {
      logger.warn({ userId }, 'Email OTP also failed')
    }
  }

  // If user has TOTP enrolled, suggest that
  const hasTOTP = await db.query('SELECT id FROM totp_enrollments WHERE user_id = $1', [userId])
  if (hasTOTP.rows.length > 0) {
    return { method: 'totp' }  // User can use their authenticator app
  }

  throw new Error('All OTP delivery methods unavailable')
}

Fix 5: Monitor Third-Party Dependencies Proactively

// external-health.ts — monitor your dependencies before they affect users
import cron from 'node-cron'

interface DependencyCheck {
  name: string
  check: () => Promise<void>
  timeout: number
}

const dependencies: DependencyCheck[] = [
  {
    name: 'Twilio',
    check: async () => {
      // Lightweight check: verify account status
      await twilioClient.api.accounts(process.env.TWILIO_ACCOUNT_SID!).fetch()
    },
    timeout: 5000,
  },
  {
    name: 'Stripe',
    check: async () => {
      await stripe.balance.retrieve()
    },
    timeout: 5000,
  },
  {
    name: 'SendGrid',
    check: async () => {
      // Check API key validity
      const response = await fetch('https://api.sendgrid.com/v3/user/account', {
        headers: { Authorization: `Bearer ${process.env.SENDGRID_API_KEY}` },
      })
      if (!response.ok) throw new Error(`SendGrid status: ${response.status}`)
    },
    timeout: 5000,
  },
]

cron.schedule('*/5 * * * *', async () => {
  for (const dep of dependencies) {
    try {
      await Promise.race([
        dep.check(),
        new Promise((_, reject) =>
          setTimeout(() => reject(new Error('timeout')), dep.timeout)
        ),
      ])
      metrics.gauge(`dependency.health.${dep.name}`, 1)
    } catch (err) {
      logger.error({ dependency: dep.name, error: err.message }, 'Dependency unhealthy')
      metrics.gauge(`dependency.health.${dep.name}`, 0)
      await alerting.warn(`Dependency degraded: ${dep.name}${err.message}`)
    }
  }
})

Third-Party Dependency Checklist

  • ✅ Every critical third-party service has a fallback (secondary provider or degraded mode)
  • ✅ Circuit breakers protect against slow/failing third-party APIs
  • ✅ Critical flows (auth, payment, email) are queued with retry — never fire-and-forget
  • ✅ Alternative auth flows available when SMS is down (email, TOTP)
  • ✅ Proactive health checks monitor dependency status before users are affected
  • ✅ Dependency status page monitored (subscribe to Twilio/Stripe/SendGrid status pages)
  • ✅ On-call runbook documents what to do when each critical dependency goes down

Conclusion

Third-party API dependencies are risk you import. The architecture question is: what does your service do when each of these dependencies fails? For every critical dependency, you need an answer that isn't "the feature stops working." Multi-provider fallback handles SMS/email outages. Circuit breakers prevent latency cascades. Queuing with retry turns synchronous failures into delayed successes. And proactive monitoring tells you a dependency is degraded before your users start reporting it. Build the fallbacks when you build the integration — retrofitting them during an outage is the worst possible time.