Bot Traffic Killing Your APIs — When 80% of Your Traffic Isn't Human

Introduction

Bot traffic is the background noise of the internet, but for most APIs, it's not background — it's the majority of traffic. Content scrapers, price monitors, competitor intelligence bots, credential stuffers, and inventory hoarders collectively generate more traffic than real users on most public-facing services. They pay nothing, strain your infrastructure, and if you're not distinguishing them from legitimate traffic, they're affecting your real users' experience. The defense is layered: rate limiting, bot fingerprinting, behavioral analysis, and CAPTCHAs at the right friction points.

Bot Traffic Patterns
Fix 1: Rate Limiting With Multiple Dimensions
Fix 2: Bot Fingerprinting
Fix 3: Behavioral Analysis for Credential Stuffing
Fix 4: Protect Scraping-Prone Endpoints
Fix 5: Cloudflare as the First Line of Defense
Bot Defense Checklist
Conclusion

Bot Traffic Patterns

How to identify bot traffic:

1. Velocity attacks — clearly non-human
   → 1,000 requests in 10 seconds from one IP
   → Same endpoint, same payload, sequential IDs
   → No pause between requests

2. Credential stuffing — bot checking stolen passwords
   → Login endpoint: many failed attempts
   → Rotating IPs, but same user-agent
   → Attempts distributed across IPs but same timing pattern

3. Scrapers — taking your content
   → All product pages visited in rapid sequence
   → No CSS/image requests (headless browser or direct HTTP)
   → No session cookies carried between requests

4. Inventory manipulation
   → Add-to-cart without checkout (hoarding)
   → Price change events trigger immediate response
   → No "browse" behavior before purchase

5. Account creation abuse
   → Signup with disposable email domains
   → Same IP, slightly varied user data
   → No email verification follow-through

Fix 1: Rate Limiting With Multiple Dimensions

// Rate limit by IP, by user, and by endpoint — not just one dimension
import { RateLimiterRedis } from 'rate-limiter-flexible'
import { createClient } from 'redis'

const redis = createClient({ url: process.env.REDIS_URL })

// Burst limiter: short window, catches immediate spikes
const burstLimiter = new RateLimiterRedis({
  storeClient: redis,
  keyPrefix: 'rl_burst',
  points: 30,      // 30 requests
  duration: 10,    // per 10 seconds
  blockDuration: 60, // Block for 60 seconds if exceeded
})

// Sustained limiter: longer window, catches sustained bots
const sustainedLimiter = new RateLimiterRedis({
  storeClient: redis,
  keyPrefix: 'rl_sustained',
  points: 500,     // 500 requests
  duration: 3600,  // per hour
  blockDuration: 3600,
})

async function rateLimitMiddleware(req: Request, res: Response, next: NextFunction) {
  const ip = req.ip
  const userId = req.user?.id

  try {
    // Check both burst and sustained limits
    await Promise.all([
      burstLimiter.consume(ip),
      sustainedLimiter.consume(ip),
      userId ? burstLimiter.consume(`user_${userId}`) : Promise.resolve(),
    ])
    next()
  } catch (err) {
    const retryAfter = Math.ceil((err as any).msBeforeNext / 1000) ?? 60

    res.set({
      'Retry-After': retryAfter,
      'X-RateLimit-Limit': burstLimiter.points,
      'X-RateLimit-Reset': new Date(Date.now() + (err as any).msBeforeNext).toISOString(),
    })

    return res.status(429).json({
      error: 'Too many requests',
      retryAfter,
    })
  }
}

Fix 2: Bot Fingerprinting

// Identify bots by their request characteristics
interface BotSignal {
  signal: string
  weight: number
}

function calculateBotScore(req: Request): number {
  const signals: BotSignal[] = []
  let score = 0

  // Missing common browser headers
  if (!req.headers['accept-language']) {
    signals.push({ signal: 'no_accept_language', weight: 20 })
  }

  if (!req.headers['accept-encoding']) {
    signals.push({ signal: 'no_accept_encoding', weight: 20 })
  }

  // Known bot user agents
  const ua = req.headers['user-agent'] ?? ''
  const botPatterns = [/bot/i, /crawler/i, /spider/i, /curl/i, /wget/i, /python-requests/i, /go-http/i]
  if (botPatterns.some(p => p.test(ua))) {
    signals.push({ signal: 'bot_user_agent', weight: 60 })
  }

  // Missing or wrong referer for page navigations
  if (req.path.startsWith('/api/') && !req.headers['referer']) {
    signals.push({ signal: 'no_referer_on_api', weight: 10 })
  }

  // Request timing too fast (< 50ms since last request from this IP)
  const lastRequest = recentRequests.get(req.ip)
  if (lastRequest && Date.now() - lastRequest < 50) {
    signals.push({ signal: 'too_fast', weight: 30 })
  }

  score = signals.reduce((sum, s) => sum + s.weight, 0)
  recentRequests.set(req.ip, Date.now())

  if (score > 40) {
    logger.warn({ ip: req.ip, score, signals }, 'High bot score detected')
  }

  return score
}

// Apply bot score to routing
app.use((req, res, next) => {
  const botScore = calculateBotScore(req)
  req.botScore = botScore

  if (botScore >= 80) {
    // High confidence bot — block or challenge
    return res.status(403).json({ error: 'Request blocked' })
  }

  if (botScore >= 50) {
    // Suspected bot — rate limit more aggressively
    req.rateLimit = 'strict'
  }

  next()
})

Fix 3: Behavioral Analysis for Credential Stuffing

// credential-stuffing-detector.ts
// Multiple failed logins across many accounts from same IP cluster

interface LoginAttempt {
  ip: string
  email: string
  success: boolean
  timestamp: Date
}

async function detectCredentialStuffing(ip: string, email: string, success: boolean): Promise<void> {
  await redis.lpush(`login_attempts:${ip}`, JSON.stringify({
    email,
    success,
    timestamp: Date.now(),
  }))
  await redis.expire(`login_attempts:${ip}`, 3600)

  const attempts = await redis.lrange(`login_attempts:${ip}`, 0, 99)
  const parsed: LoginAttempt[] = attempts.map(a => JSON.parse(a))

  // Credential stuffing pattern: many different emails, mostly failures
  const uniqueEmails = new Set(parsed.map(a => a.email)).size
  const failureRate = parsed.filter(a => !a.success).length / parsed.length

  if (uniqueEmails > 10 && failureRate > 0.8) {
    // This IP is credential stuffing
    await redis.set(`blocked_ip:${ip}`, '1', { EX: 86400 })  // Block for 24 hours
    await alerting.critical(`Credential stuffing detected from ${ip}: ${uniqueEmails} unique emails, ${(failureRate * 100).toFixed(0)}% failure rate`)
  }

  // Single account stuffing: too many failures on one account
  const accountAttempts = parsed.filter(a => a.email === email)
  if (accountAttempts.length > 10) {
    // Lock the account temporarily
    await redis.set(`account_locked:${email}`, '1', { EX: 900 })  // 15 minutes
    logger.warn({ email, ip, attempts: accountAttempts.length }, 'Account temporarily locked')
  }
}

Fix 4: Protect Scraping-Prone Endpoints

// Anti-scraping for product catalog or content pages
// Make it expensive to scrape without blocking legitimate users

// 1. Require a browser challenge for high-value pages
// (Use Cloudflare Turnstile, hCaptcha, or similar — not reCAPTCHA v2 which hurts UX)

// 2. Return data gradually — force pagination that bots find expensive
router.get('/api/products', async (req, res) => {
  const page = parseInt(req.query.page as string) ?? 1
  const limit = Math.min(parseInt(req.query.limit as string) ?? 20, 20)  // Max 20 per page

  // Add artificial delay for non-authenticated requests to raise scraping cost
  if (!req.user && req.botScore > 30) {
    await sleep(500 + Math.random() * 500)  // 500-1000ms delay
  }

  const products = await db.query(
    'SELECT id, name, price FROM products LIMIT $1 OFFSET $2',
    [limit, (page - 1) * limit]
  )

  // Don't include next_page cursor in response for high bot-score requests
  const nextPage = req.botScore < 30 ? page + 1 : undefined

  res.json({ products: products.rows, nextPage })
})

// 3. Honeypot endpoint — only bots will visit it
router.get('/api/internal/all-products', (req, res) => {
  // Any request here is a bot — log and block the IP
  logger.warn({ ip: req.ip, userAgent: req.headers['user-agent'] }, 'Honeypot triggered')
  redis.set(`blocked_ip:${req.ip}`, '1', { EX: 86400 })
  res.status(404).json({ error: 'Not found' })
})

Fix 5: Cloudflare as the First Line of Defense

# cloudflare-rules.yaml — WAF rules to block obvious bots
# (Even on free plan, Cloudflare handles much of this automatically)

rules:
  # Block known bad user agents
  - name: Block known crawlers
    expression: '(http.user_agent contains "python-requests") or
                 (http.user_agent contains "Go-http-client") or
                 (http.user_agent contains "curl") or
                 (http.user_agent eq "")'
    action: block

  # Rate limit login endpoint specifically
  - name: Login rate limit
    expression: 'http.request.uri.path eq "/api/auth/login"'
    action: rate_limit
    ratelimit:
      requests_per_period: 5
      period: 60
      mitigation_timeout: 300

  # Challenge high-risk countries if applicable to your business
  - name: Challenge traffic from anonymizing proxies
    expression: '(ip.geoip.asnum in {396982 14061 16276})'  # Known VPN ASNs
    action: managed_challenge

Bot Defense Checklist

✅ Rate limiting: burst (30 req/10s) + sustained (500/hour) per IP
✅ Bot fingerprinting scores requests by header patterns and timing
✅ Credential stuffing detection: block IPs with high failure rates across many accounts
✅ Account lockout after repeated failures (with CAPTCHA to unlock, not just time)
✅ Honeypot endpoints catch aggressive scrapers
✅ Cloudflare or equivalent WAF handles volumetric bot traffic at the edge
✅ Bot traffic metrics separated from legitimate traffic in dashboards

Conclusion

Bot traffic is a cost and reliability problem disguised as a security problem. The bots that scrape your product catalog are paying your RDS and egress bills. The credential stuffers are hammering your authentication service. The inventory hoarders are degrading your real users' experience. The defense is layered: rate limiting handles volume, bot fingerprinting handles suspicious clients, behavioral analysis catches sophisticated attacks, and Cloudflare catches everything else at the edge before it hits your servers. No single layer is sufficient — but all five together make your service economically uninteresting to scrape.