Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Observability playbook

DeltaForge already ships a Prometheus exporter, structured logging, and a panic hook. The runtime now emits source ingress counters, batching/processor histograms, and sink latency/throughput so operators can build production dashboards immediately. The tables below capture what is wired today and the remaining gaps to make the platform production ready for data and infra engineers.

What exists today

  • Prometheus endpoint served at /metrics (default 0.0.0.0:9000) with descriptors for pipeline counts, source/sink counters, and a stage latency histogram. The recorder is installed automatically when metrics are enabled.
  • Structured logging via tracing_subscriber with JSON output by default, optional targets, and support for RUST_LOG overrides.
  • Panic hook increments a deltaforge_panics_total counter and logs captured panics before delegating to the default hook.

Instrumentation gaps and recommendations

The sections below call out concrete metrics and log events to add per component. All metrics should include pipeline, tenant, and component identifiers where applicable so users can aggregate across fleets.

Sources (MySQL/Postgres)

StatusMetric/logRationale
✅ Implementeddeltaforge_source_events_total{pipeline,source,table} counter increments when MySQL events are handed to the coordinator.Surfaces ingress per table and pipeline.
✅ Implementeddeltaforge_source_reconnects_total{pipeline,source} counter when binlog reads reconnect.Makes retry storms visible.
🚧 Gapdeltaforge_source_lag_seconds{pipeline,source} gauge based on binlog/WAL position vs. server time.Alert when sources fall behind.
🚧 Gapdeltaforge_source_idle_seconds{pipeline,source} gauge updated when no events arrive within the inactivity window.Catch stuck readers before downstream backlogs form.

Coordinator and batching

StatusMetric/logRationale
✅ Implementeddeltaforge_batch_events{pipeline} and deltaforge_batch_bytes{pipeline} histograms in Coordinator::process_deliver_and_maybe_commit.Tune batching policies with data.
✅ Implementeddeltaforge_stage_latency_seconds{pipeline,stage,trigger} histogram for processor stage.Provides batch timing per trigger (timer/limits/shutdown).
✅ Implementeddeltaforge_processor_latency_seconds{pipeline,processor} histogram around every processor invocation.Identify slow user functions.
🚧 Gapdeltaforge_pipeline_channel_depth{pipeline} gauge from mpsc::Sender::capacity()/len().Detect backpressure between sources and coordinator.
🚧 GapCheckpoint outcome counters/logs (deltaforge_checkpoint_success_total / _failure_total).Alert on persistence regressions and correlate to data loss risk.

Sinks (Kafka/Redis/custom)

StatusMetric/logRationale
✅ Implementeddeltaforge_sink_events_total{pipeline,sink} counter and deltaforge_sink_latency_seconds{pipeline,sink} histogram around each send.Throughput and responsiveness per sink.
✅ Implementeddeltaforge_sink_batch_total{pipeline,sink} counter for send.Number of batches sent per sink.
🚧 GapError taxonomy in deltaforge_sink_failures_total (add kind/details).Easier alerting on specific failure classes (auth, timeout, schema).
🚧 GapBackpressure gauge for client buffers (rdkafka queue, Redis pipeline depth).Early signal before errors occur.
🚧 GapDrop/skip counters from processors/sinks.Auditing and reconciliation.

Control plane and health endpoints

NeedSuggested metric/logRationale
API request accountingdeltaforge_api_requests_total{route,method,status} counter and latency histogram using Axum middleware.Production-grade visibility of operator actions.
Ready/Liveness transitionsLogs with pipeline counts and per-pipeline status when readiness changes.Explain probe failures in log aggregation.
Pipeline lifecycleCounters for create/patch/stop actions with success/error labels; include tenant and caller metadata in logs.Auditable control-plane operations.