Configuration

Pipelines are defined as YAML documents that map directly to the PipelineSpec type. Environment variables are expanded before parsing using ${VAR} syntax, so secrets and connection strings can be injected at runtime. Unknown variables (e.g. ${source.table}) pass through for use as routing templates.

Document structure

apiVersion: deltaforge/v1
kind: Pipeline
metadata:
  name: <pipeline-name>
  tenant: <tenant-id>
spec:
  source: { ... }
  processors: [ ... ]
  sinks: [ ... ]
  batch: { ... }
  commit_policy: { ... }
  schema_sensing: { ... }

Metadata

Field	Type	Required	Description
`name`	string	Yes	Unique pipeline identifier. Used in API routes and metrics.
`tenant`	string	Yes	Business-oriented tenant label for multi-tenancy.

Spec fields

Field	Type	Required	Description
`source`	object	Yes	Database source configuration. See Sources.
`processors`	array	No	Ordered list of processors. See Processors.
`sinks`	array	Yes (at least one)	One or more sinks that receive each batch. See Sinks.
`sharding`	object	No	Optional hint for downstream distribution.
`connection_policy`	object	No	How the runtime establishes upstream connections.
`batch`	object	No	Commit unit thresholds. See Batching.
`commit_policy`	object	No	How sink acknowledgements gate checkpoints. See Commit policy.
`schema_sensing`	object	No	Automatic schema inference from event payloads. See Schema sensing.
`journal`	object	No	Event journal (DLQ). See Dead Letter Queue.

Sources

MySQL

Captures row-level changes via binlog replication. See MySQL source documentation for prerequisites and detailed configuration.

source:
  type: mysql
  config:
    id: orders-mysql
    dsn: ${MYSQL_DSN}
    tables:
      - shop.orders
      - shop.order_items
    outbox:
      tables: ["shop.outbox"]
    snapshot:
      mode: initial
    on_schema_drift: adapt

Field	Type	Description
`id`	string	Unique identifier for checkpoints and metrics
`dsn`	string	MySQL connection string with replication privileges
`tables`	array	Table patterns to capture; omit for all tables
`outbox.tables`	array	Table patterns to tag as outbox events. Must also appear in `tables`. Supports globs: `shop.outbox`, `*.outbox`, `shop.outbox_%`.
`snapshot.mode`	string	`never` (default), `initial` - run once if no checkpoint exists, `always` - re-snapshot on every restart
`snapshot.max_parallel_tables`	int	Tables snapshotted concurrently (default: `8`)
`snapshot.chunk_size`	int	Rows per range chunk for integer-PK tables (default: `10000`)
`on_schema_drift`	string	`adapt` (default) - reload schema and continue after failover drift; `halt` — stop and require operator intervention. See Failover Handling.

Table patterns support SQL LIKE syntax:

db.table - exact match
db.prefix% - tables matching prefix
db.% - all tables in database

PostgreSQL

Captures row-level changes via logical replication. See PostgreSQL source documentation for prerequisites and detailed configuration.

source:
  type: postgres
  config:
    id: users-postgres
    dsn: ${POSTGRES_DSN}
    slot: deltaforge_users
    publication: users_pub
    tables:
      - public.users
      - public.sessions
    start_position: earliest
    outbox:
      prefixes: ["outbox", "order_outbox_%"]
    snapshot:
      mode: initial
    on_schema_drift: adapt

Field	Type	Description
`id`	string	Unique identifier
`dsn`	string	PostgreSQL connection string
`slot`	string	Replication slot name
`publication`	string	Publication name
`tables`	array	Table patterns to capture
`start_position`	string	`earliest`, `latest`, or `lsn`
`outbox.prefixes`	array	`pg_logical_emit_message` prefixes to tag as outbox events. Supports globs: `outbox`, `outbox_%`, `*`.
`snapshot.mode`	string	`never` (default), `initial` - run once if no checkpoint exists, `always` - re-snapshot on every restart
`snapshot.max_parallel_tables`	int	Tables snapshotted concurrently (default: `8`)
`snapshot.chunk_size`	int	Rows per range chunk (default: `10000`)
`on_schema_drift`	string	`adapt` (default) — reload schema and continue after failover drift; `halt` — stop and require operator intervention. See Failover Handling.

Note: The source-level outbox field only tags matching events with the __outbox sentinel. Routing and transformation are handled by the outbox processor.

Processors

Processors transform events between source and sinks. They run in order and can filter, enrich, or modify events.

JavaScript

processors:
  - type: javascript
    id: transform
    inline: |
      function processBatch(events) {
        return events.map(e => {
          e.tags = ["processed"];
          return e;
        });
      }
    limits:
      cpu_ms: 50
      mem_mb: 128
      timeout_ms: 500

Field	Type	Description
`id`	string	Processor identifier
`inline`	string	JavaScript code
`limits.cpu_ms`	int	CPU time limit
`limits.mem_mb`	int	Memory limit
`limits.timeout_ms`	int	Execution timeout

Flatten

processors:
  - type: flatten
    id: flat
    separator: "__"
    max_depth: 3
    on_collision: last
    empty_object: preserve
    lists: preserve
    empty_list: drop

Field	Type	Default	Description
`id`	string	`"flatten"`	Processor identifier
`separator`	string	`"__"`	Separator between path segments
`max_depth`	int	unlimited	Stop recursing at this depth; objects at the boundary kept as-is
`on_collision`	string	`last`	Key collision policy: `last`, `first`, or `error`
`empty_object`	string	`preserve`	Empty object policy: `preserve`, `drop`, or `null`
`lists`	string	`preserve`	Array policy: `preserve` or `index`
`empty_list`	string	`preserve`	Empty array policy: `preserve`, `drop`, or `null`

Filter

processors:
  - type: filter
    id: only-active-orders
    ops: [create, update]
    tables:
      include: ["shop.orders"]
      exclude: ["*.tmp"]
    fields:
      - path: status
        op: eq
        value: "active"
      - path: total
        op: gte
        value: 100
    match: all

Field	Type	Default	Description
`id`	string	`"filter"`	Processor identifier
`ops`	list	`[]`	Op types to keep. Empty = all. `create`, `update`, `delete`, `read`, `truncate`
`tables.include`	list	`[]`	Table glob patterns to include. Empty = all
`tables.exclude`	list	`[]`	Table glob patterns to exclude. Takes priority over include
`fields`	list	`[]`	Field predicates against `event.after`. See Filter operators
`match`	string	`all`	`all` - every predicate must match; `any` - at least one

Outbox

Transforms raw outbox events into routed, sink-ready events. Requires the source to have outbox configured so events are tagged before reaching this processor. See Outbox pattern documentation for full details.

processors:
  - type: outbox
    id: outbox
    topic: "${aggregate_type}.${event_type}"
    default_topic: "events.unrouted"
    raw_payload: true
    columns:
      payload: data
    additional_headers:
      x-trace-id: trace_id

Field	Type	Default	Description
`id`	string	`"outbox"`	Processor identifier
`tables`	array	`[]`	Filter: only process outbox events matching these patterns. Empty = all outbox events.
`topic`	string	—	Topic template resolved against the raw payload using `${field}` placeholders
`default_topic`	string	—	Fallback topic when template resolution fails and no `topic` column exists
`key`	string	—	Key template resolved against raw payload. Default: `aggregate_id` value.
`raw_payload`	bool	`false`	Deliver the extracted payload as-is to sinks, bypassing envelope wrapping
`strict`	bool	`false`	Fail the batch if required fields are missing rather than silently falling back
`columns.payload`	string	`payload`	Column containing the event payload
`columns.aggregate_type`	string	`aggregate_type`	Column for aggregate type
`columns.aggregate_id`	string	`aggregate_id`	Column for aggregate ID
`columns.event_type`	string	`event_type`	Column for event type
`columns.topic`	string	`topic`	Column for pre-computed topic override
`columns.event_id`	string	`id`	Column extracted as `df-event-id` header
`additional_headers`	map	`{}`	Forward extra payload fields as routing headers: `header-name: column-name`

Sinks

Sinks deliver events to downstream systems. Each sink supports configurable envelope formats and wire encodings to match consumer expectations. See the Sinks documentation for detailed information on multi-sink patterns, commit policies, and failure handling.

Envelope and encoding

All sinks support these serialization options:

Field	Type	Default	Description
`envelope`	object	`native`	Output structure format. See Envelopes.
`encoding`	string	`json`	Wire encoding format

Envelope types:

native - Direct Debezium payload structure (default, most efficient)
debezium - Full {"payload": ...} wrapper
cloudevents - CloudEvents 1.0 specification (requires type_prefix)

# Native (default)
envelope:
  type: native

# Debezium wrapper
envelope:
  type: debezium

# CloudEvents
envelope:
  type: cloudevents
  type_prefix: "com.example.cdc"

Kafka

See Kafka sink documentation for detailed configuration options and best practices.

sinks:
  - type: kafka
    config:
      id: orders-kafka
      brokers: ${KAFKA_BROKERS}
      topic: orders
      envelope:
        type: debezium
      encoding: json
      required: true
      exactly_once: false
      send_timeout_secs: 30
      client_conf:
        security.protocol: SASL_SSL

Field	Type	Default	Description
`id`	string	-	Sink identifier
`brokers`	string	-	Kafka broker addresses
`topic`	string	-	Target topic or template
`key`	string	-	Message key template
`envelope`	object	`native`	Output format
`encoding`	string	`json`	Wire encoding
`required`	bool	`true`	Gates checkpoints
`exactly_once`	bool	`false`	Transactional mode
`send_timeout_secs`	int	`30`	Send timeout
`client_conf`	map	-	librdkafka overrides

CloudEvents example:

sinks:
  - type: kafka
    config:
      id: events-kafka
      brokers: ${KAFKA_BROKERS}
      topic: events
      envelope:
        type: cloudevents
        type_prefix: "com.acme.cdc"
      encoding: json

Redis

See Redis sink documentation for detailed configuration options and best practices.

sinks:
  - type: redis
    config:
      id: orders-redis
      uri: ${REDIS_URI}
      stream: orders
      envelope:
        type: native
      encoding: json
      required: true

Field	Type	Default	Description
`id`	string	-	Sink identifier
`uri`	string	-	Redis connection URI
`stream`	string	-	Redis stream key or template
`key`	string	-	Entry key template
`envelope`	object	`native`	Output format
`encoding`	string	`json`	Wire encoding
`required`	bool	`true`	Gates checkpoints
`send_timeout_secs`	int	`5`	XADD timeout
`batch_timeout_secs`	int	`30`	Pipeline timeout
`connect_timeout_secs`	int	`10`	Connection timeout

NATS

See NATS sink documentation for detailed configuration options and best practices.

sinks:
  - type: nats
    config:
      id: orders-nats
      url: ${NATS_URL}
      subject: orders.events
      stream: ORDERS
      envelope:
        type: native
      encoding: json
      required: true
      send_timeout_secs: 5
      batch_timeout_secs: 30

Field	Type	Default	Description
`id`	string	-	Sink identifier
`url`	string	-	NATS server URL
`subject`	string	-	Subject or template
`key`	string	-	Message key template
`stream`	string	-	JetStream stream name
`envelope`	object	`native`	Output format
`encoding`	string	`json`	Wire encoding
`required`	bool	`true`	Gates checkpoints
`send_timeout_secs`	int	`5`	Publish timeout
`batch_timeout_secs`	int	`30`	Batch timeout
`connect_timeout_secs`	int	`10`	Connection timeout
`credentials_file`	string	-	NATS credentials file
`username`	string	-	Auth username
`password`	string	-	Auth password
`token`	string	-	Auth token

Batching

batch:
  max_events: 500
  max_bytes: 1048576
  max_ms: 1000
  respect_source_tx: true
  max_inflight: 2

Field	Type	Default	Description
`max_events`	int	`500`	Flush after this many events
`max_bytes`	int	`1048576`	Flush after size reaches limit
`max_ms`	int	`1000`	Flush after time (ms)
`respect_source_tx`	bool	`true`	Never split source transactions
`max_inflight`	int	`2`	Max concurrent batches

Commit policy

commit_policy:
  mode: required

# For quorum mode:
commit_policy:
  mode: quorum
  quorum: 2

Mode	Description
`all`	Every sink must acknowledge before checkpoint
`required`	Only `required: true` sinks must acknowledge (default)
`quorum`	Checkpoint after `quorum` sinks acknowledge

Schema sensing

Schema sensing automatically infers and tracks schema from event payloads. See the Schema Sensing documentation for detailed information on how it works, drift detection, and API endpoints.

Performance tip: Schema sensing can be CPU-intensive, especially with deep JSON inspection. Consider your throughput requirements when configuring:

Set enabled: false if you don’t need runtime schema inference

Limit deep_inspect.max_depth to avoid traversing deeply nested structures

Increase sampling.sample_rate to analyze fewer events (e.g., 1 in 100 instead of 1 in 10)

Reduce sampling.warmup_events if you’re confident in schema stability

schema_sensing:
  enabled: true
  deep_inspect:
    enabled: true
    max_depth: 3
    max_sample_size: 500
  sampling:
    warmup_events: 50
    sample_rate: 5
    structure_cache: true
    structure_cache_size: 50
  high_cardinality:
    enabled: true
    min_events: 100
    stable_threshold: 0.5
    min_dynamic_fields: 5

Field	Type	Default	Description
`enabled`	bool	`false`	Enable schema sensing
`deep_inspect.enabled`	bool	`true`	Inspect nested JSON
`deep_inspect.max_depth`	int	`10`	Max nesting depth
`deep_inspect.max_sample_size`	int	`1000`	Max events for deep analysis
`sampling.warmup_events`	int	`1000`	Events to fully analyze first
`sampling.sample_rate`	int	`10`	After warmup, analyze 1 in N
`sampling.structure_cache`	bool	`true`	Cache structure fingerprints
`sampling.structure_cache_size`	int	`100`	Max cached structures
`high_cardinality.enabled`	bool	`true`	Detect dynamic map keys
`high_cardinality.min_events`	int	`100`	Events before classification
`high_cardinality.stable_threshold`	float	`0.5`	Frequency for stable fields
`high_cardinality.min_dynamic_fields`	int	`5`	Min unique fields for map

Complete examples

MySQL to Kafka with Debezium envelope

apiVersion: deltaforge/v1
kind: Pipeline
metadata:
  name: orders-mysql-to-kafka
  tenant: acme

spec:
  source:
    type: mysql
    config:
      id: orders-mysql
      dsn: ${MYSQL_DSN}
      tables:
        - shop.orders

  processors:
    - type: javascript
      id: transform
      inline: |
        function processBatch(events) {
          return events.map(event => {
            event.tags = (event.tags || []).concat(["normalized"]);
            return event;
          });
        }
      limits:
        cpu_ms: 50
        mem_mb: 128
        timeout_ms: 500

  sinks:
    - type: kafka
      config:
        id: orders-kafka
        brokers: ${KAFKA_BROKERS}
        topic: orders
        envelope:
          type: debezium
        encoding: json
        required: true
        exactly_once: false
        client_conf:
          message.timeout.ms: "5000"

  batch:
    max_events: 500
    max_bytes: 1048576
    max_ms: 1000
    respect_source_tx: true
    max_inflight: 2

  commit_policy:
    mode: required

PostgreSQL to Kafka with CloudEvents

apiVersion: deltaforge/v1
kind: Pipeline
metadata:
  name: users-postgres-to-kafka
  tenant: acme

spec:
  source:
    type: postgres
    config:
      id: users-postgres
      dsn: ${POSTGRES_DSN}
      slot: deltaforge_users
      publication: users_pub
      tables:
        - public.users
        - public.user_sessions
      start_position: earliest

  sinks:
    - type: kafka
      config:
        id: users-kafka
        brokers: ${KAFKA_BROKERS}
        topic: user-events
        envelope:
          type: cloudevents
          type_prefix: "com.acme.users"
        encoding: json
        required: true

  batch:
    max_events: 500
    max_ms: 1000
    respect_source_tx: true

  commit_policy:
    mode: required

Multi-sink with different formats

apiVersion: deltaforge/v1
kind: Pipeline
metadata:
  name: orders-multi-sink
  tenant: acme

spec:
  source:
    type: mysql
    config:
      id: orders-mysql
      dsn: ${MYSQL_DSN}
      tables:
        - shop.orders

  sinks:
    # Kafka Connect expects Debezium format
    - type: kafka
      config:
        id: connect-sink
        brokers: ${KAFKA_BROKERS}
        topic: connect-events
        envelope:
          type: debezium
        required: true

    # Lambda expects CloudEvents
    - type: kafka
      config:
        id: lambda-sink
        brokers: ${KAFKA_BROKERS}
        topic: lambda-events
        envelope:
          type: cloudevents
          type_prefix: "com.acme.cdc"
        required: false

    # Redis cache uses native format
    - type: redis
      config:
        id: cache-redis
        uri: ${REDIS_URI}
        stream: orders-cache
        envelope:
          type: native
        required: false

  batch:
    max_events: 500
    max_ms: 1000
    respect_source_tx: true

  commit_policy:
    mode: required

MySQL to NATS

apiVersion: deltaforge/v1
kind: Pipeline
metadata:
  name: orders-mysql-to-nats
  tenant: acme

spec:
  source:
    type: mysql
    config:
      id: orders-mysql
      dsn: ${MYSQL_DSN}
      tables:
        - shop.orders
        - shop.order_items

  sinks:
    - type: nats
      config:
        id: orders-nats
        url: ${NATS_URL}
        subject: orders.events
        stream: ORDERS
        envelope:
          type: native
        encoding: json
        required: true

  batch:
    max_events: 500
    max_ms: 1000
    respect_source_tx: true

  commit_policy:
    mode: required

Keyboard shortcuts

DeltaForge CDC Framework - User Guide