Published on

Temporal.io — Durable Workflows That Survive Server Crashes and Network Failures

Authors

Introduction

Temporal.io is a workflow orchestration engine that guarantees workflow execution even when your servers crash, your database goes down, or your network flakes. Unlike queues (BullMQ, SQS) that lose messages or require complex retry logic, Temporal executes workflows deterministically, replaying them from state to recover from failures. This post covers Temporal's core concepts, production patterns, and how to architect scalable workflow systems.

Temporal Core Concepts

Temporal's architecture separates concerns into three roles: Workflows (durable state machines), Activities (side effects), and Workers (execution engines). Understanding this separation is crucial.

A Workflow is a deterministic state machine. It must be idempotent and replay-safe because Temporal replays the entire workflow history to recover state. You cannot call Math.random() or Date.now() directly; use Temporal's utilities instead.

An Activity is where side effects happen: API calls, database writes, file uploads. Activities are durable (retried automatically) but not replayed. Each activity execution is a transaction boundary.

A Task Queue is a work distribution channel. Workers poll task queues and execute workflows or activities. Multiple workers can process the same queue for scalability.

// workflow.ts - Deterministic state machine
import { proxyActivities, defineQuery, setHandler, sleep } from '@temporalio/workflow';
import * as activities from './activities';

const { fetchUserData, sendWelcomeEmail, chargeCard } = proxyActivities<typeof activities>({
  startToCloseTimeout: '5 minutes',
});

export interface OnboardingInput {
  userId: string;
  email: string;
}

export async function onboardingWorkflow(input: OnboardingInput): Promise<void> {
  // Step 1: Fetch user data (activity with built-in retry)
  const userData = await fetchUserData(input.userId);

  // Step 2: Wait deterministically
  await sleep('2 hours');

  // Step 3: Charge card (can fail and be retried transparently)
  await chargeCard(userData.paymentId);

  // Step 4: Send email
  await sendWelcomeEmail(input.email, userData.firstName);
}

// Query: read current workflow state
export const getWorkflowStatus = defineQuery<string>('getStatus');
setHandler(getWorkflowStatus, () => 'completed');

Activity Retries and Timeouts

Activities fail. Temporal's retry policy ensures transient failures don't kill your workflow. You configure:

  • initialInterval: Start with this delay (e.g., 1 second)
  • maximumInterval: Cap exponential backoff (e.g., 1 minute)
  • backoffCoefficient: Multiply by this each retry (2.0 is standard)
  • maximumAttempts: Max retries before giving up
// activities.ts - Real production activity
import axios from 'axios';

interface PaymentGatewayResponse {
  transactionId: string;
  status: 'success' | 'failed';
}

export async function chargeCard(paymentId: string, amount: number): Promise<PaymentGatewayResponse> {
  try {
    const response = await axios.post('https://payment-gateway.example.com/charge', {
      paymentId,
      amount,
      idempotencyKey: `charge-${paymentId}-${Date.now()}`,
    }, {
      timeout: 10000,
    });

    return response.data;
  } catch (error) {
    if (axios.isAxiosError(error)) {
      // Idempotent key ensures duplicate charges don't happen on retry
      if (error.response?.status === 409) {
        throw new Error('Duplicate charge attempt');
      }
      // Transient errors: Temporal retries automatically
      if (error.code === 'ECONNRESET' || error.code === 'ETIMEDOUT') {
        throw error;
      }
    }
    throw error;
  }
}

export async function fetchUserData(userId: string) {
  const response = await axios.get(`https://api.example.com/users/${userId}`, {
    timeout: 5000,
  });
  return response.data;
}

export async function sendWelcomeEmail(email: string, firstName: string): Promise<void> {
  // Email service call with idempotent semantics
  await axios.post('https://email-service.example.com/send', {
    to: email,
    template: 'welcome',
    context: { firstName },
    deduplicationId: `welcome-${email}`,
  });
}

Workflow Signals and Queries

Workflows can be interrupted mid-execution by signals. Signals communicate external events: user cancelled subscription, admin paused process, payment received.

Queries let external systems read workflow state without modifying it. Use queries for dashboards, status pages.

// workflow-with-signals.ts
import { proxyActivities, defineSignal, defineQuery, setHandler, sleep } from '@temporalio/workflow';
import * as activities from './activities';

const { sendRefund, logEvent } = proxyActivities<typeof activities>({
  startToCloseTimeout: '10 minutes',
});

export interface SubscriptionWorkflowInput {
  subscriptionId: string;
  monthlyPrice: number;
}

let subscriptionActive = true;
let cancellationReason = '';

// Signal: pause subscription
export const pauseSubscription = defineSignal<[string]>('pauseSubscription');
setHandler(pauseSubscription, (reason: string) => {
  subscriptionActive = false;
  cancellationReason = reason;
});

// Query: get current status
export const getSubscriptionStatus = defineQuery<{ active: boolean; reason: string }>('getStatus');
setHandler(getSubscriptionStatus, () => ({
  active: subscriptionActive,
  reason: cancellationReason,
}));

export async function subscriptionWorkflow(input: SubscriptionWorkflowInput): Promise<void> {
  let billingCycleCount = 0;

  while (subscriptionActive) {
    // Wait one month
    await sleep('30 days');

    if (subscriptionActive) {
      billingCycleCount++;
      await logEvent('billing-cycle-started', { cycleNumber: billingCycleCount });
    }
  }

  // User paused: refund pro-rata
  await sendRefund(input.subscriptionId, input.monthlyPrice);
  await logEvent('subscription-cancelled', { reason: cancellationReason });
}

Deterministic Workflow Constraints

Workflows replay from history. This means:

  1. No side effects in workflow code: Only call activities or use Temporal utilities
  2. No external time: Use sleep(), not setTimeout()
  3. No randomness: Use workflow.random() if needed (rarely)
  4. No mutable external state: Workflows can't depend on changing globals

Violating these causes non-deterministic replay errors. The fix: move code to an activity.

// WRONG: Non-deterministic workflow
export async function badWorkflow() {
  const random = Math.random(); // DON'T DO THIS
  const now = new Date(); // DON'T DO THIS
  // This will fail on replay because random and now will differ
}

// CORRECT: Deterministic workflow
import { sleep } from '@temporalio/workflow';

export async function goodWorkflow() {
  await sleep('1 hour'); // Deterministic timer
  // All external calls are activities
}

Child Workflows for Orchestration

For complex workflows with branches or parallel work, spawn child workflows. Each child is independently durable and can be monitored.

// parent-workflow.ts
import { proxyActivities, executeChild } from '@temporalio/workflow';
import { sendEmailWorkflow, processPaymentWorkflow } from './child-workflows';

export interface LargeOrderWorkflowInput {
  orderId: string;
  items: Array<{ productId: string; quantity: number }>;
  customerId: string;
}

export async function largeOrderWorkflow(input: LargeOrderWorkflowInput) {
  // Parallel child workflows using Promise.all
  const [paymentResult, emailResult] = await Promise.all([
    executeChild(processPaymentWorkflow, {
      args: [input.orderId, input.customerId],
      workflowId: `payment-${input.orderId}`,
    }),
    executeChild(sendEmailWorkflow, {
      args: [input.customerId, `Order ${input.orderId} confirmed`],
      workflowId: `email-${input.orderId}`,
    }),
  ]);

  return { paymentResult, emailResult };
}

Worker Setup in Node.js

Workers are the execution runtime. They connect to Temporal Server, poll task queues, and execute code.

// worker.ts - Production setup
import { Worker, NativeConnection } from '@temporalio/worker';
import * as workflows from './workflows';
import * as activities from './activities';

async function runWorker() {
  const connection = await NativeConnection.connect({
    address: process.env.TEMPORAL_ADDRESS || 'localhost:7233',
  });

  const worker = await Worker.create({
    connection,
    namespace: process.env.TEMPORAL_NAMESPACE || 'default',
    taskQueue: 'onboarding-queue',
    workflowsPath: require.resolve('./workflows'),
    activitiesPath: require.resolve('./activities'),
    maxActivitiesPerSecond: 100,
    maxConcurrentActivityExecutions: 10,
    maxConcurrentWorkflowTaskExecutions: 40,
  });

  console.log('Worker listening on task queue: onboarding-queue');
  await worker.run();
}

runWorker().catch((err) => {
  console.error('Worker failed:', err);
  process.exit(1);
});

Testing Workflows with TestWorkflowEnvironment

Test workflows without a real Temporal server using the test environment. This enables fast, deterministic tests.

// workflow.test.ts
import { TestWorkflowEnvironment } from '@temporalio/testing';
import { Worker } from '@temporalio/worker';
import { onboardingWorkflow } from './workflow';
import * as activities from './activities';

describe('onboardingWorkflow', () => {
  let testEnv: TestWorkflowEnvironment;

  beforeAll(async () => {
    testEnv = await TestWorkflowEnvironment.createLocal();
  });

  afterAll(async () => {
    await testEnv?.teardown();
  });

  test('completes onboarding successfully', async () => {
    const { client, nativeConnection } = testEnv;

    const worker = await Worker.create({
      connection: nativeConnection,
      taskQueue: 'test-queue',
      workflows: { onboardingWorkflow },
      activities: {
        fetchUserData: async () => ({
          userId: '123',
          firstName: 'John',
          paymentId: 'pm_123',
        }),
        chargeCard: async () => ({ transactionId: 'tx_123', status: 'success' }),
        sendWelcomeEmail: async () => {},
      },
    });

    const handle = await client.workflow.start(onboardingWorkflow, {
      args: [{ userId: '123', email: 'john@example.com' }],
      taskQueue: 'test-queue',
      workflowId: 'test-workflow-1',
    });

    const result = await handle.result();
    expect(result).toBeUndefined();
  });
});

Temporal vs BullMQ vs SQS Decision Matrix

FeatureTemporalBullMQSQS
Workflow orchestrationYesBasic chainsNo
DurabilityServer stateRedisAWS managed
ScalabilityMillions of workflowsDepends on RedisUnlimited
Long-running jobsYes (months)HoursHours
Deterministic replayYesNoNo
PriceSelf-hosted or Temporal CloudRedis costPer request
Operational complexityTemporal clusterRedis clusterAWS managed

Use Temporal for complex multi-step workflows spanning days/weeks. Use BullMQ for job queues with moderate complexity. Use SQS for simple message passing.

Checklist

  • Configure workflow retry policies and timeouts
  • Use activities for all side effects
  • Never call Math.random() or Date.now() in workflow code
  • Implement query handlers for monitoring
  • Handle signals for pause/cancel operations
  • Set up worker autoscaling based on task queue depth
  • Use child workflows for parallel orchestration
  • Write tests using TestWorkflowEnvironment
  • Monitor workflow execution with Temporal UI
  • Plan for Temporal Server HA and persistence

Conclusion

Temporal.io trades infrastructure complexity for application simplicity. Your code becomes a simple state machine; Temporal handles retries, durability, and recovery. Start with a single worker, test with TestWorkflowEnvironment, and scale by adding workers to task queues. For mission-critical workflows—subscriptions, onboarding, payments—Temporal eliminates entire classes of bugs that plague queue-based systems.