Published on

Change Data Capture With Debezium — Sync Your Database to Anywhere in Real Time

Authors

Introduction

Your source of truth is PostgreSQL. Your search index is Elasticsearch. Your vector DB is Pinecone. Keeping them synchronized is a nightmare: polling is slow and wasteful, application-side change tracking is error-prone.

Change Data Capture (CDC) solves this. Debezium reads your database's transaction log in real time, extracts changes, and streams them to Kafka. Subscribe to Kafka and sync anywhere. No polling, no triggers, no consistency headaches.

What CDC is and Why It Beats Polling

Polling: Every 5 minutes, query for rows with updated_at > last_sync. Slow, expensive, races with incoming data.

CDC: Listen to the database's transaction log. When data changes, you know instantly. Stream the change to subscribers.

Debezium implements CDC. It reads PostgreSQL's Write-Ahead Log (WAL), MySQL's binlog, or MongoDB's oplog. These logs are already written for durability; Debezium just reads them.

No polling. No triggers. No application code. Just database logs flowing to Kafka.

Debezium PostgreSQL Connector Setup

Install Kafka Connect and the Debezium PostgreSQL connector. For local development, use Docker Compose:

version: '3'
services:
  postgres:
    image: postgres:15
    environment:
      POSTGRES_DB: mydb
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: secret
    command:
      - '-c'
      - 'wal_level=logical'
      - '-c'
      - 'max_wal_senders=4'
      - '-c'
      - 'max_replication_slots=4'
    ports:
      - '5432:5432'

  kafka:
    image: confluentinc/cp-kafka:7.5.0
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092,PLAINTEXT_HOST://localhost:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1

  kafka-connect:
    image: confluentinc/cp-kafka-connect:7.5.0
    environment:
      CONNECT_BOOTSTRAP_SERVERS: kafka:29092
      CONNECT_REST_ADVERTISED_HOST_NAME: kafka-connect
      CONNECT_GROUP_ID: compose-group
      CONNECT_CONFIG_STORAGE_TOPIC: docker-connect-configs
      CONNECT_OFFSET_STORAGE_TOPIC: docker-connect-offsets
      CONNECT_STATUS_STORAGE_TOPIC: docker-connect-status
    depends_on:
      - kafka
    ports:
      - '8083:8083'

  zookeeper:
    image: confluentinc/cp-zookeeper:7.5.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181

Create the PostgreSQL replication user:

CREATE USER replicator WITH REPLICATION ENCRYPTED PASSWORD 'secret';
GRANT CONNECT ON DATABASE mydb TO replicator;

Now configure the Debezium connector via HTTP:

curl -X POST http://localhost:8083/connectors \
  -H "Content-Type: application/json" \
  -d '{
    "name": "postgres-cdc",
    "config": {
      "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
      "database.hostname": "postgres",
      "database.port": 5432,
      "database.user": "replicator",
      "database.password": "secret",
      "database.dbname": "mydb",
      "database.server.name": "myserver",
      "plugin.name": "pgoutput",
      "table.include.list": "public.users,public.products"
    }
  }'

Debezium now streams changes from users and products tables to Kafka topics myserver.public.users and myserver.public.products.

WAL-Based Capture (No Polling, No Triggers)

PostgreSQL writes all changes to the Write-Ahead Log (WAL) before committing to disk. Debezium reads the WAL via logical replication.

No triggers: Debezium doesn't add triggers. Your application doesn't change.

No polling: Debezium streams changes as they happen, not on a schedule.

No overhead: WAL reading is a standard PostgreSQL feature. Your database doesn't slow down.

Filtering Events by Table and Operation

You don't need all changes. Filter by table and operation type (insert, update, delete):

{
  "name": "postgres-cdc-filtered",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": 5432,
    "database.user": "replicator",
    "database.password": "secret",
    "database.dbname": "mydb",
    "database.server.name": "myserver",
    "table.include.list": "public.products",
    "column.include.list": "public.products.id,public.products.name,public.products.price",
    "skipped.operations": "u,d"
  }
}

Now only insert operations on the products table flow to Kafka.

Schema Evolution Handling

Your schema changes over time. Debezium includes schema information in Kafka messages (via Avro or JSON Schema). Consumers can adapt:

{
  "schema": {
    "type": "record",
    "fields": [
      { "name": "id", "type": "int" },
      { "name": "name", "type": "string" },
      { "name": "email", "type": ["null", "string"], "default": null }
    ]
  },
  "payload": {
    "id": 1,
    "name": "Alice",
    "email": "alice@example.com"
  }
}

When you add a column to PostgreSQL, the schema updates in Kafka. Consumers that don't understand the new field ignore it; consumers that do, process it.

Debezium + Kafka Connect

Kafka Connect is a framework for connectors: sources (read from DBs) and sinks (write to systems).

Debezium provides the PostgreSQL source connector. Third-party sinks exist for Elasticsearch, S3, JDBC, and more. Chain them:

Database (Postgres) → Debezium Source → Kafka → Elasticsearch Sink → Elasticsearch

Setup:

curl -X POST http://localhost:8083/connectors \
  -H "Content-Type: application/json" \
  -d '{
    "name": "elasticsearch-sink",
    "config": {
      "connector.class": "com.example.ElasticsearchSink",
      "topics": "myserver.public.products",
      "es.host": "elasticsearch",
      "es.port": 9200,
      "key.converter": "org.apache.kafka.connect.json.JsonConverter",
      "value.converter": "org.apache.kafka.connect.json.JsonConverter"
    }
  }'

Now every change to your products table appears in Elasticsearch instantly.

Routing Events to Redis, Elasticsearch, Vector DB

Use Kafka's routing flexibility. Emit changes to a Kafka topic, then subscribe from any system:

Redis cache: Invalidate cache on product update:

import { Kafka } from 'kafkajs';
import redis from 'redis';

const kafka = new Kafka({
  clientId: 'cache-invalidator',
  brokers: [process.env.KAFKA_BROKER]
});

const consumer = kafka.consumer({ groupId: 'cache-group' });
const redisClient = redis.createClient();

await consumer.subscribe({ topic: 'myserver.public.products' });
await consumer.run({
  eachMessage: async ({ message }) => {
    const change = JSON.parse(message.value?.toString() || '{}');
    const productId = change.payload.after?.id;
    await redisClient.del(`product:${productId}`);
  }
});

Elasticsearch: Debezium Elasticsearch sink connector handles this automatically.

Vector DB: A consumer process reads changes and computes embeddings:

const change = JSON.parse(message.value?.toString() || '{}');
const product = change.payload.after;

if (product) {
  const embedding = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: `${product.name} ${product.description}`
  });

  await pinecone.upsert({
    vectors: [
      {
        id: String(product.id),
        values: embedding.data[0].embedding,
        metadata: { name: product.name }
      }
    ]
  });
}

One database change, multiple systems synchronized.

Outbox Pattern With Debezium

Traditional outbox pattern: application writes to both orders table and outbox table. A polling process reads outbox and publishes to Kafka. Slow.

Debezium outbox: Write to orders. Debezium captures it. Application writes to outbox in the same transaction for reliability. Debezium streams both changes; a transformer extracts outbox events and publishes to Kafka topics.

This replaces the polling outbox processor entirely. Changes flow to Kafka as soon as they commit to PostgreSQL.

Debezium Server for Lightweight CDC Without Kafka

Not everyone runs Kafka. Debezium Server is a standalone service that reads CDC logs and pushes to HTTP, S3, Pulsar, or RabbitMQ without requiring Kafka:

debezium:
  server:
    engine:
      name: postgres
      connector:
        class: io.debezium.connector.postgresql.PostgresConnector
        database.hostname: localhost
        database.user: replicator
        database.password: secret
        database.dbname: mydb
    transforms:
      - type: ExtractNewRecordState
    sink:
      type: http
      http.url: http://localhost:3000/webhooks/cdc

Debezium Server POSTs each change to your webhook. Process it in your application. No Kafka, no extra infrastructure.

Monitoring Lag and Offset

Track how far behind CDC is:

kafka-consumer-groups --bootstrap-server localhost:9092 \
  --group debezium --describe

Output shows lag per topic partition. High lag means Debezium can't keep up. Scale Kafka partitions or optimize the connector.

Monitor in application:

const admin = kafka.admin();
const groupOffsets = await admin.fetchOffsets('debezium');

const lag = (highWatermark) - (currentOffset);
console.log(`Lag: ${lag} messages`);

Set alerts when lag exceeds threshold (e.g., >10k messages).

Checklist

  • Enable logical replication in PostgreSQL
  • Create replication user with appropriate permissions
  • Deploy Kafka and Kafka Connect
  • Configure Debezium PostgreSQL connector
  • Test that changes appear in Kafka topics
  • Set up sink connectors for target systems
  • Implement custom consumers for complex transformations
  • Monitor CDC lag and alerting
  • Document your CDC topology and schema mappings
  • Test failover scenarios (connector crash, DB restart)

Conclusion

Debezium eliminates the polling and consistency headaches of syncing data across systems. Read from your database's transaction log, stream changes to Kafka, and subscribe from anywhere.

For PostgreSQL shops, Debezium is table stakes for real-time data architecture. Start small: sync one table to Redis or Elasticsearch. Expand to the full platform as confidence grows.

Your data will finally stay in sync.