Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Change Data Capture (CDC)

Change Data Capture (CDC) is the practice of streaming database mutations as ordered events so downstream systems can stay in sync without periodic full loads. Rather than asking “what does my data look like now?”, CDC answers “what changed, and when?”

DeltaForge is a CDC engine built to make log-based change streams reliable, observable, and operationally simple in modern, containerized environments. It focuses on row-level CDC from MySQL binlog and Postgres logical replication to keep consumers accurate and latency-aware while minimizing impact on source databases.

In short: DeltaForge tails committed transactions from MySQL and Postgres logs, preserves ordering and transaction boundaries, and delivers events to Kafka and Redis with checkpointed delivery, configurable batching, and Prometheus metrics - without requiring a JVM or distributed coordinator.


Why CDC matters

Traditional data integration relies on periodic batch jobs that query source systems, compare snapshots, and push differences downstream. This approach worked for decades, but modern architectures demand something better.

The batch ETL problem: A nightly sync means your analytics are always a day stale. An hourly sync still leaves gaps and hammers your production database with expensive SELECT * queries during business hours. As data volumes grow, these jobs take longer, fail more often, and compete for the same resources your customers need.

CDC flips the model. Instead of pulling data on a schedule, you subscribe to changes as they happen.

AspectBatch ETLCDC
LatencyMinutes to hoursSeconds to milliseconds
Source loadHigh (repeated scans)Minimal (log tailing)
Data freshnessStale between runsNear real-time
Failure recoveryRe-run entire jobResume from checkpoint
Change detectionDiff comparisonNative from source

These benefits compound as systems scale and teams decentralize:

  • Deterministic replay: Ordered events allow consumers to reconstruct state or power exactly-once delivery with checkpointing.
  • Polyglot delivery: The same change feed can serve caches, queues, warehouses, and search indexes simultaneously without additional source queries.

How CDC works

All CDC implementations share a common goal: detect changes and emit them as events. The approaches differ in how they detect those changes.

Log-based CDC

Databases maintain transaction logs (MySQL binlog, Postgres WAL) that record every committed change for durability and replication. Log-based CDC reads these logs directly, capturing changes without touching application tables.

┌─────────────┐    commits    ┌─────────────┐    tails     ┌─────────────┐
│ Application │──────────────▶│  Database   │─────────────▶│ CDC Engine  │
└─────────────┘               │   + WAL     │              └──────┬──────┘
                              └─────────────┘                     │
                                                                  ▼
                                                           ┌─────────────┐
                                                           │   Kafka /   │
                                                           │   Redis     │
                                                           └─────────────┘

Advantages:

  • Zero impact on source table performance
  • Captures all changes including those from triggers and stored procedures
  • Preserves transaction boundaries and ordering
  • Can capture deletes without soft-delete columns

Trade-offs:

  • Requires database configuration (replication slots, binlog retention)
  • Schema changes need careful handling
  • Log retention limits how far back you can replay

DeltaForge uses log-based CDC exclusively. This allows DeltaForge to provide stronger ordering guarantees, lower source impact, and simpler operational semantics than hybrid approaches that mix log tailing with polling or triggers.

Trigger-based CDC

Database triggers fire on INSERT, UPDATE, and DELETE operations, writing change records to a shadow table that a separate process polls.

Advantages:

  • Works on databases without accessible transaction logs
  • Can capture application-level context unavailable in logs

Trade-offs:

  • Adds write overhead to every transaction
  • Triggers can be disabled or forgotten during schema migrations
  • Shadow tables require maintenance and can grow unbounded

Polling-based CDC

A process periodically queries tables for rows modified since the last check, typically using an updated_at timestamp or incrementing ID.

Advantages:

  • Simple to implement
  • No special database configuration required

Trade-offs:

  • Cannot reliably detect deletes
  • Requires updated_at columns on every table
  • Polling frequency trades off latency against database load
  • Clock skew and transaction visibility can cause missed or duplicate events

Anatomy of a CDC event

A well-designed CDC event contains everything downstream consumers need to process the change correctly.

{
  "id": "evt_01J7K9X2M3N4P5Q6R7S8T9U0V1",
  "source": {
    "database": "shop",
    "table": "orders",
    "server_id": "mysql-prod-1"
  },
  "operation": "update",
  "timestamp": "2025-01-15T14:32:01.847Z",
  "transaction": {
    "id": "gtid:3E11FA47-71CA-11E1-9E33-C80AA9429562:42",
    "position": 15847293
  },
  "before": {
    "id": 12345,
    "status": "pending",
    "total": 99.99,
    "updated_at": "2025-01-15T14:30:00.000Z"
  },
  "after": {
    "id": 12345,
    "status": "shipped",
    "total": 99.99,
    "updated_at": "2025-01-15T14:32:01.000Z"
  },
  "schema_version": "v3"
}

Key components:

FieldPurpose
operationINSERT, UPDATE, DELETE, or DDL
before / afterRow state before and after the change (enables diff logic)
transactionGroups changes from the same database transaction
timestampWhen the change was committed at the source
schema_versionHelps consumers handle schema evolution

Not all fields are present for every operation-before is omitted for INSERTs, after is omitted for DELETEs - but the event envelope and metadata fields are consistent across MySQL and Postgres sources.


Real-world use cases

Cache invalidation and population

Problem: Your Redis cache serves product catalog data, but cache invalidation logic is scattered across dozens of services. Some forget to invalidate, others invalidate too aggressively, and debugging staleness issues takes hours.

CDC solution: Stream changes from the products table to a message queue. A dedicated consumer reads the stream and updates or deletes cache keys based on the operation type. This centralizes invalidation logic, provides replay capability for cache rebuilds, and removes cache concerns from application code entirely.

┌──────────┐     CDC      ┌─────────────┐   consumer   ┌─────────────┐
│  MySQL   │─────────────▶│   Stream    │─────────────▶│ Cache Keys  │
│ products │              │  (Kafka/    │              │ product:123 │
└──────────┘              │   Redis)    │              └─────────────┘
                          └─────────────┘

Event-driven microservices

Problem: Your order service needs to notify inventory, shipping, billing, and analytics whenever an order changes state. Direct service-to-service calls create tight coupling and cascade failures.

CDC solution: Publish order changes to a durable message queue. Each downstream service subscribes independently, processes at its own pace, and can replay events after failures or during onboarding. The order service doesn’t need to know about its consumers.

┌─────────────┐     CDC      ┌─────────────┐
│   Orders    │─────────────▶│   Message   │
│   Database  │              │   Queue     │
└─────────────┘              └──────┬──────┘
                    ┌───────────────┼───────────────┐
                    ▼               ▼               ▼
             ┌───────────┐   ┌───────────┐   ┌───────────┐
             │ Inventory │   │ Shipping  │   │ Analytics │
             └───────────┘   └───────────┘   └───────────┘

Search index synchronization

Problem: Your Elasticsearch index drifts from the source of truth. Full reindexing takes hours and blocks search updates during the process.

CDC solution: Stream changes continuously to keep the index synchronized. Use the before image to remove old documents and the after image to index new content.

Data warehouse incremental loads

Problem: Nightly full loads into your warehouse take 6 hours and block analysts until noon. Business users complain that dashboards show yesterday’s numbers.

CDC solution: Stream changes to a staging topic, then micro-batch into your warehouse every few minutes. Analysts get near real-time data without impacting source systems.

Audit logging and compliance

Problem: Regulations require you to maintain a complete history of changes to sensitive data. Application-level audit logging is inconsistent and can be bypassed.

CDC solution: The transaction log captures every committed change regardless of how it was made - application code, admin scripts, or direct SQL. Stream these events to immutable storage for compliance.

Cross-region replication

Problem: You need to replicate data to a secondary region for disaster recovery, but built-in replication doesn’t support the transformations you need.

CDC solution: Stream changes through a CDC pipeline that filters, transforms, and routes events to the target region’s databases or message queues.


Architecture patterns

The outbox pattern

When you need to update a database and publish an event atomically, the outbox pattern provides exactly-once semantics without distributed transactions.

┌─────────────────────────────────────────┐
│              Single Transaction         │
│  ┌─────────────┐    ┌─────────────────┐ │
│  │ UPDATE      │    │ INSERT INTO     │ │
│  │ orders SET  │ +  │ outbox (event)  │ │
│  │ status=...  │    │                 │ │
│  └─────────────┘    └─────────────────┘ │
└─────────────────────────────────────────┘
                      │
                      ▼ CDC
               ┌─────────────┐
               │   Kafka     │
               └─────────────┘
  1. The application writes business data and an event record in the same transaction.
  2. CDC tails the outbox table and publishes events to Kafka.
  3. A cleanup process (or CDC itself) removes processed outbox rows.

This guarantees that events are published if and only if the transaction commits.

When to skip the outbox: If your only requirement is to react to committed database state (not application intent), direct CDC from business tables is often simpler than introducing an outbox. The outbox pattern adds value when you need custom event payloads, explicit event versioning, or when the event schema differs significantly from table structure.

Event sourcing integration

CDC complements event sourcing by bridging legacy systems that weren’t built event-first. Stream changes from existing tables into event stores, then gradually migrate to native event sourcing.

CQRS (Command Query Responsibility Segregation)

CDC naturally supports CQRS by populating read-optimized projections from the write model. Changes flow through the CDC pipeline to update denormalized views, search indexes, or cache layers.


Challenges and solutions

Schema evolution

Databases change. Columns get added, renamed, or removed. Types change. CDC pipelines need to handle this gracefully.

Strategies:

  • Schema registry: Store and version schemas centrally (e.g., Confluent Schema Registry with Avro/Protobuf).
  • Forward compatibility: Add columns as nullable; avoid removing columns that consumers depend on.
  • Consumer tolerance: Design consumers to ignore unknown fields and handle missing optional fields.
  • Processor transforms: Use DeltaForge’s JavaScript processors to normalize schemas before sinks.

Ordering guarantees

Events must arrive in the correct order for consumers to reconstruct state accurately. A DELETE arriving before its corresponding INSERT would be catastrophic.

DeltaForge guarantees:

  • Source-order preservation per table partition (always enabled)
  • Transaction boundary preservation when respect_source_tx: true is configured in batch settings

Kafka sink uses consistent partitioning by primary key to maintain ordering within a partition at the consumer.

Exactly-once delivery

Network failures, process crashes, and consumer restarts can cause duplicates or gaps. True exactly-once semantics require coordination between source, pipeline, and sink.

DeltaForge approach:

  • Checkpoints track the last committed position in the source log.
  • Configurable commit policies (all, required, quorum) control when checkpoints advance.
  • Kafka sink supports idempotent producers; transactional writes available via exactly_once: true.

Default behavior: DeltaForge provides at-least-once delivery out of the box. Exactly-once semantics require sink support and explicit configuration.

High availability

Production CDC pipelines need to handle failures without data loss or extended downtime.

Best practices:

  • Run multiple pipeline instances with leader election.
  • Store checkpoints in durable storage (DeltaForge persists to local files, mountable volumes in containers).
  • Monitor lag between source position and checkpoint position.
  • Set up alerts for pipeline failures and excessive lag.

Expectations: DeltaForge checkpoints ensure no data loss on restart, but does not currently include built-in leader election. For HA deployments, use external coordination (Kubernetes leader election, etcd locks) or run active-passive with health-check-based failover.

Backpressure

When sinks can’t keep up with the change rate, pipelines need to slow down gracefully rather than dropping events or exhausting memory.

DeltaForge handles backpressure through:

  • Configurable batch sizes (max_events, max_bytes, max_ms).
  • In-flight limits (max_inflight) that bound concurrent sink writes.
  • Blocking reads from source when batches queue up.

Performance considerations

Batching trade-offs

SettingLow valueHigh value
max_eventsLower latency, more overheadHigher throughput, more latency
max_msFaster flush, smaller batchesLarger batches, delayed flush
max_bytesMemory-safe, frequent commitsEfficient for large rows

Start with DeltaForge defaults and tune based on observed latency and throughput.

Source database impact

Log-based CDC has minimal impact, but consider:

  • Replication slot retention: Paused pipelines cause WAL/binlog accumulation.
  • Connection limits: Each pipeline holds a replication connection.
  • Network bandwidth: High-volume tables generate significant log traffic.

Sink throughput

  • Kafka: Tune batch.size, linger.ms, and compression in client_conf.
  • Redis: Use pipelining and connection pooling for high-volume streams.

Monitoring and observability

CDC pipelines are long-running, stateful processes; without metrics and alerts, failures are silent by default. A healthy pipeline requires visibility into lag, throughput, and errors.

Key metrics to track

MetricDescriptionAlert threshold
cdc_lag_secondsTime between event timestamp and processing> 60s
events_processed_totalThroughput counterSudden drops
checkpoint_lag_eventsEvents since last checkpoint> 10,000
sink_errors_totalFailed sink writesAny sustained errors
batch_size_avgEvents per batchOutside expected range

DeltaForge exposes Prometheus metrics on the configurable metrics endpoint (default :9000).

Health checks

  • GET /healthz: Liveness probe - is the process running?
  • GET /readyz: Readiness probe - are pipelines connected and processing?
  • GET /pipelines: Detailed status of each pipeline including configuration.

Choosing a CDC solution

When evaluating CDC tools, consider:

FactorQuestions to ask
Source supportDoes it support your databases? MySQL binlog? Postgres logical replication?
Sink flexibilityCan it write to your target systems? Kafka, Redis, HTTP, custom?
TransformationCan you filter, enrich, or reshape events in-flight?
Operational overheadHow much infrastructure does it require? JVM? Distributed coordinator?
Resource efficiencyWhat’s the memory/CPU footprint per pipeline?
Cloud-nativeDoes it containerize cleanly? Support health checks? Emit metrics?

Where DeltaForge fits

DeltaForge intentionally avoids the operational complexity of JVM-based CDC stacks (Kafka Connect-style deployments with Zookeeper, Connect workers, and converter configurations) while remaining compatible with Kafka-centric architectures.

DeltaForge is designed for teams that want:

  • Lightweight runtime: Single binary, minimal memory footprint, no JVM warmup.
  • Config-driven pipelines: YAML specs instead of code for common patterns.
  • Inline transformation: JavaScript processors for custom logic without recompilation.
  • Container-native operations: Built for Kubernetes with health endpoints and Prometheus metrics.

DeltaForge is not designed for:

  • Complex DAG-based stream processing with windowed aggregations
  • Stateful joins across multiple streams
  • Sources beyond MySQL and Postgres (currently)
  • Built-in schema registry integration (use external registries)

If you need those capabilities, consider dedicated stream processors or the broader Kafka ecosystem. DeltaForge excels at getting data out of databases and into your event infrastructure reliably and efficiently.


Getting started

Ready to try CDC with DeltaForge? Head to the Quickstart guide to run your first pipeline in minutes.

For production deployments, review the Development guide for container builds and operational best practices.