https://lnkd.in/dyZcR9RP I really think it’s a great pattern for building ETLs based on your operational RDBMs sources (use database replica if you can, unless you can ingest messages from topics or event buses, then switch to events for sure, they are immutable and easy to ingest): - deploy AWS DMS service to track your databases changelog on table level and write the results to storage buckets (parquet files, date partitioning, KMS encryption) - this can be done with terraform and CICD deployed - attach S3 buckets with CDC files to your Unity Catalog as volumes - again terraform automation is a great way to go - use Spark Streaming-based AutoLoader to load data in parallel way to bronze delta tables in your UC (materialized views, actually, you won’t see a differences unless you start using single-user clusters ;) ) - this is even easier when you use Delta Live Tables (much less configuration needed for AutoLoader) - the greatest feature is the checkpoints usage - you don’t need to load the same files twice or run complex state checks - plus AutoLoader pattern can be used both for streaming and batch loads (you don’t need to run it continuously, it can run on batch schedule too) - use SCD (Slowly Changing Dimensions) features in Delta Live Tables to rebuild source tables in UC (a great feature of UC - you’ve got time travel for every table and lineage too, this is so handy in debugging) - then you can build all ETLs on UC delta tables without any live connection with RDBMs live connection - which should be avoided at all costs on production systems - all your ETLs will then run with full power of Spark / Photon compute engines and fast delta reads - all your PySpark code can be deployed with testing and linting using your favourite tool, e.g. GitHub Actions and deployed to the Lakehouse with Asset Bundles automation As a result, you avoid direct database integration on source database system, you can track data changes continuously, your ETLs are reproducible and idempotent, and can be securely monitored (with AWS CloudWatch + Alerts, Databricks Workflows and Data Alerts) and in case of reprocessing - you avoid major bottlenecks and can scale resources up, plus all your configurations are safely stored and versioned in code :) #Lakehouse #AWS #CloudETLs #Databricks #DeltaLiveTables #UnityCatalog #PySpark #CDC #DataEngineering
Mikołaj Sędek’s Post
More Relevant Posts
-
Announcement 📣 📣 We are excited to announce the launch of our new data-driven ETL framework, Smart ETL V1.0, within the Microsoft Fabric environment. Smart ETL combines the power of Fabric Pipeline, Delta Lake, and PySpark technologies to bring data from various sources into the Fabric Lakehouse and implement the medallion architecture (Bronze, Silver, Gold) with a no-code/low-code approach. Key Features and Functions of Smart ETL: 🔹 Easy Integration: Seamlessly integrate Smart ETL into Fabric workspaces. 🔹 Smooth Configuration and Deployment: Simplified setup and deployment processes. 🔹 Secure Data Transfer: Safely and securely transfer data from relational databases. 🔹 Flexible Data Loading: Load data from various file formats such as text or CSV into Fabric. 🔹 Versatile Load Management: Handle both full and incremental data loads. 🔹 SDC Handling: Manage Slowly Changing Dimensions (SCD) Type #1 and Type #2 with ease. 🔹 Optimized Performance: Maximize Fabric capacity by dividing workloads into multiple streams. Stay tuned for more features and updates with Smart ETL as we continue to innovate and enhance our offerings! #SmartETL #DataIntegration #ETL #MicrosoftFabric #DataAnalytics #DataLakehouse #Innovation
To view or add a comment, sign in
-
You can now deliver your application in a fraction of the time with one solution. HarperDB unifies user-programmed applications, high-performance database, and real-time data streaming into one technology. Logic with Direct Data Access: Applications require data to complete most tasks. HarperDB's unified-system architecture ensures that logic executes fast and overhead stays low. Distributed Data: Geo-replicated and globally synchronized data only takes minutes to configure. Learn more below or connect with our team! #AppDevelopment #DistributedData #Database https://hubs.li/Q02dCGzn0
To view or add a comment, sign in
-
🚀 Handling Sync Delays in Data Pipelines with Apache Airflow 🛠️ Data pipelines that move data from live systems to warehouses often face a common challenge: sync delays. Whether you're working with real-time or batch data, ensuring you don't miss late-arriving records is crucial for data integrity. Here are some strategies I've found helpful to handle sync delay issues using Apache Airflow: Catch-Up Mechanism: Enable catchup=True to automatically backfill missed intervals when systems experience delays. Window-Based Extraction: Instead of just pulling yesterday's data, create time windows (e.g., pulling data from T-2 days to T-1 day) to account for late arrivals. Data Validation Checks: Always validate data before marking a sync as complete. Simple checks like row counts or validation tasks can catch gaps. Retries and Idempotent Loads: Set up retries for failed tasks and ensure that your loads are idempotent (safe to re-run without duplicating data). Sensors for Sync Monitoring: Use Airflow's sensors to wait for the source system to finish syncing before starting extraction. Upserts for Late Data: Implement upserts to avoid missing records and handle updates to existing ones seamlessly. By combining these techniques, you can significantly reduce the risks of missing data due to sync delays, ensuring your data pipeline is both robust and resilient. How do you handle sync delays in your workflows? Feel free to share your experiences or tips! #DataEngineering #ApacheAirflow #ETL #DataPipeline #SyncDelays #DataIntegrity #Automation #BigData #DataOps
To view or add a comment, sign in
-
3 tips for effectively deploying your vector database in your #RAG application production environment with #Milvus: 📈 Design an effective schema Carefully consider your data structure and how it will be queried to create a schema that optimizes performance and scalability. 📈 Plan for scalability Anticipate future growth and design your architecture to accommodate increasing data volumes and user traffic. 📈 Select the optimal index and fine-tune performance Choose the most suitable indexing method for your use case and continuously monitor and adjust performance settings. Read more: https://bit.ly/3UH9oeI
Practical Tips and Tricks for Developers Building RAG Applications - Zilliz blog
zilliz.com
To view or add a comment, sign in
-
⏯ Unlocking the Power of Delta Lake: My Favorite Features and Why They Matter Hi folks, What is your favorite Delta Lake feature? Personally, I value its ACID transactions and robust data management capabilities, including INSERT, UPDATE, DELETE, and MERGE commands. For those who have worked with relational databases and experienced the challenges of performing the same transactions with parquet files in data lakes, Delta Lake offers a transformative solution. Despite being built on parquet, Delta Lake significantly enhances usability. Here are my top 5 favorite features of Delta Lake: 1. ACID Transactions: Delta Lake ensures data integrity with ACID (Atomicity, Consistency, Isolation, Durability) transactions, simplifying data operations and preserving data quality. 2. Time Travel: This feature enables access to previous versions of your data, facilitating auditing, rollback, and recovery from accidental updates. 3. Efficient Data Management: Delta Lake optimizes storage and management, supporting extensive data processing and integrating batch and streaming data operations seamlessly. 4. Scalability: Designed for large-scale data handling, Delta Lake maintains reliable performance even with substantial datasets. 5. Schema Enforcement and Evolution: It supports schema enforcement and evolution, ensuring data consistency while allowing for flexibility. For more detailed information on these features, check out the Delta Lake Documentation: https://lnkd.in/dm84r_V5 #DataEngineering #BigData #DataLakes #DeltaLake
What are all the Delta things in Databricks?
docs.databricks.com
To view or add a comment, sign in
-
3 tips for effectively deploying your vector database in your #RAG application production environment with #Milvus: 📈 Design an effective schema Carefully consider your data structure and how it will be queried to create a schema that optimizes performance and scalability. 📈 Plan for scalability Anticipate future growth and design your architecture to accommodate increasing data volumes and user traffic. 📈 Select the optimal index and fine-tune performance Choose the most suitable indexing method for your use case and continuously monitor and adjust performance settings. Read more: https://bit.ly/3UH9oeI
Practical Tips and Tricks for Developers Building RAG Applications - Zilliz blog
zilliz.com
To view or add a comment, sign in
-
In the software world, data persistence is crucial. please spend some time analyzing options and approaches to follow. Invest time to learn about the following: - Transactions - Query optimization - Entity relationship models - Indexing - ... Bad persistence logic will ruin the entire software's functionalities. #databases #softwareengineering
To view or add a comment, sign in
-
🤔 What is the simplest cost efficient data platform/mesh architecture? Let me start with one possibility.. 1. Stream data to S3 as Parquet with scale (Data Taps) 2. Run SQL over S3 with scale with any query engine (e.g. BoilingData) Data ends up on S3 anyway.. So why not put it there from the start? The ultimate interworking interface, S3 and Parquet 😀. You need couple of building blocks on top to make it usable. Namely, *easy* secure data sharing and data discovery ("Data Mesh"). Well, maybe also Iceberg for cases where you need to update existing data. Data Taps and BoilingData provide easy data sharing (authenticated users with ACLs for sharing). But we lack behind on data discovery (catalog)... To achieve realtime streaming, you can configure the flushing thresholds. But if that's not enough write additional aggregating/filtering SQL clause(s) with output to... another Data Tap? ... or directly to "BI Tool" with embedded db like DuckDB (WASM)!? You can always backfill from S3 as the data is there anyway? What am I missing? Can we get simpler?
To view or add a comment, sign in
-
Can your business benefit from leveraging mainframe data for modern applications — via a seamless, no-code integration with over 100 databases? If so, this blog is for you! https://lnkd.in/gmM6VmNj
Connect Your Mainframe Data to Over 100 Applications and Databases with PropelZ™ | VirtualZ Computing
https://virtualzcomputing.com
To view or add a comment, sign in