Integrating Databricks with ClickHouse: ETL patterns and connectors
connectorsetlhybrid-analytics

Integrating Databricks with ClickHouse: ETL patterns and connectors

UUnknown
2026-02-22
9 min read
Advertisement

Practical guide to moving data between Delta Lake and ClickHouse: CDC, batch syncs, materialized views, and hybrid analytics patterns.

Stop trading latency for scale: practical patterns to move data between Delta Lake and ClickHouse

Hook: If your teams struggle to deliver sub-second analytics on streaming events while keeping Delta Lake as the governed source of truth, you’re not alone. Hybrid stacks — Delta Lake for reliable, auditable storage and ClickHouse for high-concurrency, low-latency OLAP — are now the pragmatic standard for 2026. This guide gives engineers a hands-on playbook for CDC, batch sync, materialized views, and production-ready architecture patterns for moving data between Delta Lake and ClickHouse.

Executive summary (what this article gives you)

Most important first: use Delta Change Data Feed (CDF) + Structured Streaming or Kafka to deliver near-real-time change events into ClickHouse for sub-second analytics. For cost and governance, keep Delta Lake as your durable, auditable source and use ClickHouse for hot query access via ReplacingMergeTree (idempotent upserts) and ClickHouse materialized views (pre-aggregations). When real-time is not required, use partitioned micro-batch exports. Below you’ll find concrete code snippets, table DDLs, operational checklists, and architecture diagrams to implement these patterns safely at scale.

Why integrate Delta Lake + ClickHouse in 2026?

Two market forces shaped this integration landscape in late 2025–early 2026:

  • ClickHouse’s rapid growth and ecosystem funding (major rounds in late 2025) accelerated connector development and adoption for real-time analytics.
  • Enterprises demand governed, auditable storage (Delta Lake + Unity Catalog) while offloading high-concurrency BI and event analytics to specialized OLAP engines.

That combination — robust governance in Delta and sub-second query performance in ClickHouse — is now a common hybrid architecture for analytics at scale.

High-level architecture patterns

Choose the pattern that matches your latency, consistency, and operational complexity requirements.

  • CDC (near-real-time): Delta CDF -> Spark Structured Streaming -> Kafka / Direct ClickHouse writes -> ClickHouse MergeTree / Materialized View.
  • Batch sync (micro-batch): Scheduled Spark jobs export partition-level snapshots from Delta -> ClickHouse bulk load (HTTP, CSV/TSV, or native connector).
  • Materialized views: ClickHouse consumes Kafka topics (Kafka engine) and maintains materialized views for pre-aggregated dashboards.
  • Hybrid (hot+warm): Keep cold history in Delta; stream recent windows to ClickHouse. Archive from ClickHouse to Delta as TTL expires.

Pattern 1 — CDC: Delta CDF to ClickHouse for near-real-time analytics

When you need sub-second to second freshness, use Delta’s Change Data Feed (CDF) to emit row-level changes and deliver them to ClickHouse via an event bus (Kafka) or direct writes.

Why CDF?

  • Auditability: Delta remains authoritative with commits and versions.
  • Efficiency: Only changed rows are streamed.
  • Schema awareness: Delta CDF includes commit metadata for safe replay and debugging.

Delta (CDF) -> Databricks Structured Streaming job -> Kafka topic -> ClickHouse Kafka Engine -> Materialized View -> MergeTree target table

Implementation: key steps

  1. Enable CDF on your Delta table.
  2. Create a Structured Streaming job that reads the CDF and emits canonical events to Kafka (topic per table).
  3. Configure ClickHouse to consume that Kafka topic and hydrate MergeTree tables via a materialized view.
  4. Use ReplacingMergeTree or version columns to ensure idempotent upserts and safe replays.

Example: Structured Streaming -> Kafka (PySpark)

from pyspark.sql import SparkSession
from pyspark.sql.functions import (col, struct, to_json)

spark = SparkSession.builder.getOrCreate()

# Read Delta change feed (CDF)
df = (spark.readStream
      .format("delta")
      .option("readChangeFeed", "true")
      .option("startingVersion", 0)
      .table("catalog.db.events_delta"))

# Build canonical payload
payload = df.select(to_json(struct("*")) .alias("value"))

(payload.writeStream
 .format("kafka")
 .option("kafka.bootstrap.servers", "kafka:9092")
 .option("topic", "events.topic")
 .option("checkpointLocation", "/mnt/checkpoints/events_cdf")
 .start())

ClickHouse side: consume Kafka and maintain MergeTree

-- Create a MergeTree table with versioning for idempotent upserts
CREATE TABLE events_mv (
  event_id UInt64,
  payload String,
  ts DateTime,
  version UInt64
) ENGINE = ReplacingMergeTree(version)
PARTITION BY toYYYYMM(ts)
ORDER BY (event_id);

-- Create Kafka engine table (reads the Kafka topic)
CREATE TABLE events_kafka (
  payload String
) ENGINE = Kafka
SETTINGS
  kafka_broker_list = 'kafka:9092',
  kafka_topic_list = 'events.topic',
  kafka_group_name = 'ch_events_group',
  kafka_format = 'JSONEachRow';

-- Materialized view to populate events_mv
CREATE MATERIALIZED VIEW events_from_kafka TO events_mv AS
SELECT
  JSONExtractUInt(payload, 'event_id') AS event_id,
  payload AS payload,
  parseDateTimeBestEffort(JSONExtractString(payload, 'ts')) AS ts,
  JSONExtractUInt(payload, '_commit_version') AS version
FROM events_kafka;

Notes: Use ReplacingMergeTree (or CollapsingMergeTree with operation markers) for idempotent upserts. Store a version/commit column from Delta (CDF) to prevent out-of-order application.

Pattern 2 — Batch sync: partitioned snapshots for predictable cost

When near-real-time is not required or ClickHouse cluster cost is a concern, schedule micro-batches to copy changed partitions from Delta into ClickHouse.

When to choose batch sync

  • Data freshness tolerance: minutes to hours
  • Large-volume backfills or initial bulk loads
  • Strict transactional semantics only needed in Delta, not in ClickHouse

Implementation options

  • Export partition files (Parquet/CSV) from Delta and bulk-load using ClickHouse HTTP/CSV interface.
  • Use Spark JDBC/connector to write directly if data sizes are moderate.
  • Schema mapping: keep column types compatible; prefer simple types (String, UInt64, DateTime) for ClickHouse.

Example: Export partition and POST into ClickHouse HTTP

# 1) On Databricks: write partition to CSV
(spark.table('catalog.db.events_delta')
 .filter("dt='2026-01-15'")
 .select('event_id', 'payload', 'ts')
 .write
 .mode('overwrite')
 .csv('/tmp/exports/events_2026-01-15'))

# 2) From an edge node or job runner: upload CSV via ClickHouse HTTP
curl -sS -u user:pass --data-binary @/tmp/exports/events_2026-01-15/part-00000.csv \
  "http://clickhouse-host:8123/?query=INSERT%20INTO%20events%20FORMAT%20CSV"

Bulk loads are fastest when you use native ClickHouse ingestion formats (TSV/CSV/JSONEachRow). Use compression (gzip) across network and parallelize uploads for large exports.

Pattern 3 — ClickHouse materialized views for pre-aggregations

Offload expensive aggregations from dashboards to ClickHouse materialized views and keep source logic centralized in Delta for governance.

Pattern mechanics

  1. Stream events from Delta via Kafka.
  2. Create materialized views in ClickHouse that aggregate into summary tables.
  3. Expose summary tables to BI with fast, low-latency queries.

Example: hourly aggregation

CREATE TABLE events_hourly
(
    ts_hour DateTime,
    event_type String,
    cnt UInt64
) ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(ts_hour)
ORDER BY (event_type, ts_hour);

CREATE MATERIALIZED VIEW events_hourly_mv TO events_hourly AS
SELECT
  toStartOfHour(ts) AS ts_hour,
  JSONExtractString(payload, 'type') AS event_type,
  count() AS cnt
FROM events_kafka
GROUP BY ts_hour, event_type;

Materialized views make BI queries extremely fast and reduce load on ClickHouse by precomputing heavy aggregations.

Practical operational considerations

Schema evolution and mapping

  • Map complex nested types in Delta to flattened JSON strings in ClickHouse, or pre-flatten during streaming.
  • Version columns and commit metadata from Delta CDF to enable safe replays and deduplication.
  • When schemas change, use contract tests and CI to ensure ClickHouse table DDLs are updated in-sync.

Idempotency and ordering

ClickHouse is eventually consistent in its merge logic. To implement idempotent upserts:

  • Use ReplacingMergeTree with a version column derived from Delta commit version.
  • Alternatively, use a separate 'operation' column and CollapsingMergeTree for explicit insert/delete semantics.
  • Always apply events with a deterministic sort key (e.g., commit version + event timestamp).

Latency vs cost trade-offs

  • Realtime CDC (Kafka + ClickHouse consumers) increases operational complexity and ClickHouse hot-storage cost.
  • Micro-batches lower cost but increase time-to-insight; choose hourly or minute-level jobs based on SLOs.
  • Hybrid: keep a short hot window (e.g., 7–14 days) in ClickHouse and store the remainder in Delta Lake.

Monitoring, SLOs, and observability

  • Track lag and consumer offsets for Kafka topics.
  • Export ClickHouse metrics (query latency, merges, memory) into Prometheus/Grafana.
  • Alert on increasing merge backlog or long-running joins that indicate schema drift or broken pipelines.

Security and governance

  • Use network-level controls: private VPC peering or PrivateLink equivalents between Databricks and ClickHouse clusters.
  • Encrypt transport (TLS) and enforce mutual TLS when possible.
  • Keep Delta Lake as the governed dataset with Unity Catalog for column-level policies; log all downstream consumer access.
  • Use Delta Sharing for controlled cross-account access rather than exporting raw files when collaborating across organizations.

Concrete example: end-to-end recipe (CDC -> ClickHouse via Kafka)

Below is a compact, runnable recipe outline you can adapt:

  1. Enable CDF on Delta table: ALTER TABLE ... SET TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true').
  2. Start a Structured Streaming job that reads CDF and writes JSON events into Kafka topics (see snippet above).
  3. Create ClickHouse Kafka engine table + materialized views to populate MergeTree tables with ReplacingMergeTree versioning.
  4. Use monitoring for Kafka lag and ClickHouse merge queue; schedule compactions or TTLs for older data.

Common pitfalls and how to avoid them

  • Pitfall: Applying events out of order. Fix: include commit version and enforce ordering at consumer side or rely on ReplacingMergeTree.
  • Pitfall: Schema drift breaks materialized views. Fix: CI gating for DDL changes and backward-compatible schema evolution patterns (nullable fields, new JSON column for unknowns).
  • Pitfall: High ClickHouse costs for long retention. Fix: implement hot-window strategy and TTL / archiving to Delta Lake.

Through early 2026, the ecosystem matured in three ways relevant to integrations:

  • Major investments in ClickHouse’s ecosystem accelerated stable connectors (Kafka, JDBC advances, HTTP ingestion), making streaming at scale easier.
  • Data platform teams standardized on a “Delta as source, ClickHouse as serving” pattern for event analytics and streaming ML feature stores.
  • Tooling for governance and lineage bridges (Unity Catalog, Delta Sharing, and open source lineage solutions) reduced risks of hybrid architectures.

Prediction: by end of 2026, most large analytics platforms will adopt automated tiering (hot ClickHouse, warm Delta files) and richer change capture integrations out of the box.

Quick checklist before you go to production

  • Enable and validate Delta CDF for all source tables you plan to stream.
  • Design ClickHouse table engines (ReplacingMergeTree, CollapsingMergeTree) to match your update/delete semantics.
  • Instrument Kafka consumer lag, ClickHouse merges, and Spark streaming checkpoints.
  • Implement schema migration strategy that’s backward-compatible.
  • Set an archival policy: how and when to move data back to Delta and free ClickHouse storage.
  • Secure the pipeline (TLS, network peering, role-based access).
Practical takeaway: use Delta for authoritative storage and governance, ClickHouse for hot, high-concurrency analytics. Build CDC pipelines with versioned events and idempotent sinks for safe, scalable integrations.

Actionable takeaways

  • For immediate speed: Start with Kafka + ClickHouse Kafka engine + materialized views for sub-second dashboards.
  • For minimal ops: Use hourly micro-batches exporting partition files into ClickHouse HTTP ingestion.
  • For correctness: Always include Delta commit version in events and choose ReplacingMergeTree or CollapsingMergeTree patterns.
  • For cost control: enforce a hot-window policy and TTL-based archiving to Delta Lake.

Further resources & next steps

Start with a small pilot: pick a single Delta table with moderate write volume, enable CDF, and deploy a streaming job to push events into a ClickHouse staging cluster. Measure latency, query patterns, and cost. Iterate on table engines and compaction settings.

Call to action: Ready to build a pilot? Download our reference implementation (Spark Structured Streaming + Kafka + ClickHouse materialized view examples), run the end-to-end pipeline in a test environment, and adapt the table engines for your update semantics. If you want help designing a production-grade hybrid stack, contact our solutions engineering team for an architecture review and cost projection.

Advertisement

Related Topics

#connectors#etl#hybrid-analytics
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-22T00:06:42.874Z