Mikołaj Sędek’s Post

View profile for Mikołaj Sędek, graphic

Data Architect @ Limango / Lakehouse Architecture / AWS / Databricks / Terraform / PySpark / SQL / Applied ML

https://lnkd.in/dyZcR9RP I really think it’s a great pattern for building ETLs based on your operational RDBMs sources (use database replica if you can, unless you can ingest messages from topics or event buses, then switch to events for sure, they are immutable and easy to ingest): - deploy AWS DMS service to track your databases changelog on table level and write the results to storage buckets (parquet files, date partitioning, KMS encryption) - this can be done with terraform and CICD deployed - attach S3 buckets with CDC files to your Unity Catalog as volumes - again terraform automation is a great way to go - use Spark Streaming-based AutoLoader to load data in parallel way to bronze delta tables in your UC (materialized views, actually, you won’t see a differences unless you start using single-user clusters ;) ) - this is even easier when you use Delta Live Tables (much less configuration needed for AutoLoader) - the greatest feature is the checkpoints usage - you don’t need to load the same files twice or run complex state checks - plus AutoLoader pattern can be used both for streaming and batch loads (you don’t need to run it continuously, it can run on batch schedule too) - use SCD (Slowly Changing Dimensions) features in Delta Live Tables to rebuild source tables in UC (a great feature of UC - you’ve got time travel for every table and lineage too, this is so handy in debugging) - then you can build all ETLs on UC delta tables without any live connection with RDBMs live connection - which should be avoided at all costs on production systems - all your ETLs will then run with full power of Spark / Photon compute engines and fast delta reads - all your PySpark code can be deployed with testing and linting using your favourite tool, e.g. GitHub Actions and deployed to the Lakehouse with Asset Bundles automation As a result, you avoid direct database integration on source database system, you can track data changes continuously, your ETLs are reproducible and idempotent, and can be securely monitored (with AWS CloudWatch + Alerts, Databricks Workflows and Data Alerts) and in case of reprocessing - you avoid major bottlenecks and can scale resources up, plus all your configurations are safely stored and versioned in code :) #Lakehouse #AWS #CloudETLs #Databricks #DeltaLiveTables #UnityCatalog #PySpark #CDC #DataEngineering

Using DMS and DLT for Change Data Capture

https://www.youtube.com/

To view or add a comment, sign in

Explore topics