Selected Data Engineering Posts . . . November 2024

Selected Data Engineering Posts . . . November 2024

The most popular of my data engineering posts in November 2024 ... with additional references ...


Welcome to the NoWelcome to the November 2024 edition of Selected Data Engineering Posts! This month, we delve into a diverse range of topics reshaping the data engineering landscape.

Explore how empathy-driven engineering fosters inclusive user experiences and how AI is accelerating product-market fit. Learn how CFOs are transforming reporting with data-driven insights. Discover techniques to tackle time-sensitive data pipelines and understand the nuances of dataset and field lineage.

Delve into cutting-edge topics like reverse concept drift, dbt on Databricks, small language models, compound AI systems, and the EU AI Act. Uncover the potential of Shift Left Architecture and the synergy between Data Mesh and Data Fabric.

Each post includes references for deeper insights, and offers valuable advice and practical tips.

Subscribe to stay updated and leverage these advancements to elevate your data engineering practices.


This issue:

Engineering with Empathy: Engineers can enhance user experiences by integrating empathy into their work. Through accessibility team insights, it explores the importance of understanding diverse user needs, collaborating across roles, and making informed product decisions that drive inclusivity.

Innovation and AI: Achieving product-market fit (PMF) is vital for growth and sustainability, requiring deep customer understanding, agile iteration, and a clear value proposition. AI accelerates PMF by enhancing market insights, predicting behaviors, and personalizing experiences, helping companies align products with evolving market needs.

Impactful Reporting: CFOs can transform reporting by prioritizing decision-focused insights, system-based management, and user-centric design. Shifting from static data to actionable insights, this approach addresses biases and adapts to evolving needs, enhancing strategic decisions and driving significant business value.

Time-Sensitive Data Pipelines: The Time Warp Dilemma in data engineering arises when data updates at different frequencies, creating challenges like version inconsistency and join complexity. Solutions include full refreshes, snapshots, versioning, and time-aware joins to maintain data integrity across varying timelines.

Dataset- and Field-Lineage: Data lineage has multiple types, each offering unique insights into data flow. Dataset lineage captures high-level relationships between data sources and targets, while field lineage traces transformations at a column level. Together, they enhance data governance, integrity, and informed decision-making in complex ecosystems.

Reverse Concept Drift: Post-deployment monitoring is critical for maintaining model relevance as data evolves. Focus on detecting concept drift, using methods like residual analysis, resampling, and the Reverse Concept Drift Algorithm to measure its impact and maintain model performance.

dbt on Databricks: dbt on Databricks enables robust data quality by validating sources, transformations, and outputs. Leverage unit tests, constraints, freshness checks, and data contracts to ensure pipeline integrity while optimizing metadata management with Unity Catalog for seamless data discovery and lineage.

Small Language Models (SLM): Small language models (SLMs) like Meta's Llama 3.2 enable cost-effective, energy-efficient AI applications on edge devices, enhancing productivity, privacy, and accessibility across industries while reducing reliance on large-scale cloud-based solutions. SLMs unlock transformative potential for real-time, localized AI use cases.

Compound AI Systems: Compound AI systems combine multiple AI components to efficiently handle complex tasks and improve performance, privacy and scalability. BentoML is presented as a platform that streamlines the setup, optimization and deployment of these systems with full observability.

EU AI Act Compliance: The EU AI Act establishes a comprehensive legal framework to ensure AI systems are safe, transparent, non-discriminatory, and respectful of fundamental rights. It categorizes AI by risk, mandates compliance for high-risk systems, and sets a global regulatory benchmark for ethical AI.

Shift Left Architecture: The Shift Left Architecture integrates real-time data streaming with tools like Kafka, Flink, and Iceberg to unify transactional and analytical workloads, reducing costs, increasing flexibility, ensuring data consistency, and accelerating time-to-market for innovative applications.

Blending Data Mesh and Data Fabric: Data mesh and data fabric are complementary approaches to data strategy. The text explores their differences, benefits and trade-offs and shows how the combination of elements creates a balance between autonomy and centralization, enabling adaptive, scalable and efficient data ecosystems tailored to the needs of the business.

We look forward to sharing this knowledge with you and supporting you on your path to data excellence.

Enjoy reading!


𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗘𝗺𝗽𝗮𝘁𝗵𝘆: 𝗔 𝗚𝘂𝗶𝗱𝗲 𝗳𝗼𝗿 𝗗𝗮𝘁𝗮 𝗣𝗿𝗼𝗳𝗲𝘀𝘀𝗶𝗼𝗻𝗮𝗹𝘀

Data professionals should prioritize empathy in their work to create more user-centered and effective products. As Lena Wang (Slack) points out, by understanding the needs and perspectives of users, engineers can make informed decisions, collaborate effectively and contribute meaningfully to product development.

𝗥𝗲𝗰𝗼𝗺𝗺𝗲𝗻𝗱𝗮𝘁𝗶𝗼𝗻𝘀:

  • 𝗢𝘃𝗲𝗿𝗰𝗼𝗺𝗶𝗻𝗴 𝗜𝗺𝗽𝗼𝘀𝘁𝗲𝗿 𝗦𝘆𝗻𝗱𝗿𝗼𝗺𝗲: Embrace self-doubt and seek guidance from experienced professionals.

  • 𝗗𝗲𝘃𝗲𝗹𝗼𝗽𝗶𝗻𝗴 𝗘𝗺𝗽𝗮𝘁𝗵𝘆: Practice active listening, observe user behavior, and engage in open discussions with stakeholders.

  • 𝗖𝗼𝗹𝗹𝗮𝗯𝗼𝗿𝗮𝘁𝗶𝗻𝗴 𝗘𝗳𝗳𝗲𝗰𝘁𝗶𝘃𝗲𝗹𝘆: Work closely with designers, product managers, and other team members to understand user needs and perspectives.

  • 𝗖𝗼𝗻𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗻𝗴 𝗠𝗲𝗮𝗻𝗶𝗻𝗴𝗳𝘂𝗹𝗹𝘆: Focus on creating products that solve real problems and improve user experiences.

The most surprising or unusual piece of advice in this text is the recommendation to 𝗮𝗹𝗹𝗼𝘄 𝘀𝗲𝗹𝗳-𝗱𝗼𝘂𝗯𝘁 and 𝘀𝗲𝗲𝗸 𝗮𝗱𝘃𝗶𝗰𝗲 from experienced professionals. This advice challenges the common notion that self-confidence is always a positive trait and emphasizes the importance of humility and seeking help when needed.

𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻: Empathetic engineering is essential to the development of successful data-driven products. By understanding the needs and perspectives of users, data experts can develop products that are not only technically sound, but also meaningful and impactful.

Go to Article

Further Reading

How to Foster a Culture of Empathy Within Your Data Team

Empathy Mapping: The First Step in Design Thinking

Empathy might be the one skill every engineer should focus on

Data engineering: a role redefined by business needs and interpersonal skills


𝗠𝗮𝘀𝘁𝗲𝗿 𝗣𝗿𝗼𝗱𝘂𝗰𝘁-𝗠𝗮𝗿𝗸𝗲𝘁 𝗙𝗶𝘁: 𝗧𝗵𝗲 𝗣𝗼𝘄𝗲𝗿 𝗼𝗳 𝗔𝗜

Unleash Your Product's Full Potential ...

Product-market fit is crucial for business success. As David Henkin (Forbes) points out, it's about aligning the product offering with market needs to drive growth and profitability. AI can significantly improve the PMF process by providing valuable insights into market trends, customer behavior and personalization opportunities.

𝗖𝗵𝗮𝗹𝗹𝗲𝗻𝗴𝗲𝘀:

  • 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 𝗖𝘂𝘀𝘁𝗼𝗺𝗲𝗿 𝗡𝗲𝗲𝗱𝘀: Conduct thorough market research and leverage AI tools to identify and understand customer pain points.

  • 𝗔𝘃𝗼𝗶𝗱𝗶𝗻𝗴 𝗙𝗲𝗮𝘁𝘂𝗿𝗲 𝗖𝗿𝗲𝗲𝗽: Focus on core value propositions and avoid unnecessary features that can dilute the product's appeal.

  • 𝗖𝗼𝗻𝘁𝗶𝗻𝘂𝗼𝘂𝘀 𝗜𝘁𝗲𝗿𝗮𝘁𝗶𝗼𝗻: Prioritize customer feedback and adapt product offerings based on market changes and evolving needs.

𝗥𝗲𝗰𝗼𝗺𝗺𝗲𝗻𝗱𝗮𝘁𝗶𝗼𝗻𝘀:

  • 𝗨𝘁𝗶𝗹𝗶𝘇𝗲 𝗔𝗜 𝗳𝗼𝗿 𝗠𝗮𝗿𝗸𝗲𝘁 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀: Leverage AI tools to analyze market trends, segment customers, and identify unmet needs.

  • 𝗣𝗿𝗶𝗼𝗿𝗶𝘁𝗶𝘇𝗲 𝗖𝘂𝘀𝘁𝗼𝗺𝗲𝗿 𝗙𝗲𝗲𝗱𝗯𝗮𝗰𝗸: Implement AI-driven feedback analysis to understand customer preferences and pain points.

  • 𝗣𝗲𝗿𝘀𝗼𝗻𝗮𝗹𝗶𝘇𝗲 𝗣𝗿𝗼𝗱𝘂𝗰𝘁 𝗢𝗳𝗳𝗲𝗿𝗶𝗻𝗴𝘀: Use AI to tailor products and marketing efforts to individual customer preferences.

  • 𝗠𝗲𝗮𝘀𝘂𝗿𝗲 𝗮𝗻𝗱 𝗔𝗱𝗮𝗽𝘁: Continuously track product performance and customer satisfaction to make informed adjustments.

Adapting products to the market requires a deep understanding of the market, a customer-centric approach and the ability to adapt to changing dynamics. AI can provide valuable insights and tools to accelerate this process and drive long-term business success.

Go to Article

Further Reading

Unlocking Product-Market Fit: How AI is Revolutionizing Market Research

Mastering Product-Market Fit in AI: Strategies and Steps

Betting on AI? You must first consider product-market fit


𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗶𝗻𝗴 𝗥𝗲𝗽𝗼𝗿𝘁𝗶𝗻𝗴 𝗳𝗿𝗼𝗺 𝗕𝘂𝗿𝗱𝗲𝗻 𝘁𝗼 𝗕𝘂𝘀𝗶𝗻𝗲𝘀𝘀 𝗗𝗿𝗶𝘃𝗲𝗿 ...

𝗮 𝟲-𝗣𝗼𝗶𝗻𝘁 𝗔𝗽𝗽𝗿𝗼𝗮𝗰𝗵 𝗖𝗵𝗮𝗹𝗹𝗲𝗻𝗴𝗲𝘀 𝗮𝗻𝗱 𝗥𝗲𝗰𝗼𝗺𝗺𝗲𝗻𝗱𝗮𝘁𝗶𝗼𝗻𝘀

Traditional reports often fail to provide actionable insights, leading to information overload and hindering decision-making.  The Boston Consulting Group (BCG) points out that by implementing a strategic approach to reporting, CFOs can transform reporting into a powerful tool that drives business value. This article presents six key tactics that can achieve a quantum leap in reporting effectiveness.

𝗥𝗲𝗰𝗼𝗺𝗺𝗲𝗻𝗱𝗮𝘁𝗶𝗼𝗻𝘀:

  • 𝗗𝗲𝗰𝗶𝘀𝗶𝗼𝗻-𝗖𝗲𝗻𝘁𝗿𝗶𝗰: Focus on reports influencing management decisions. Identify crucial questions and tailor reports to provide the necessary information.

  • 𝗦𝘆𝘀𝘁𝗲𝗺 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴: Embed reporting within a structured performance management process. This includes clear roles, preparation, follow-up, and fostering a culture of constructive discussion.

  • 𝗗𝗲𝗰𝗶𝘀𝗶𝗼𝗻 𝗗𝗲𝗯𝗶𝗮𝘀𝗶𝗻𝗴: Recognize and counteract cognitive biases that can hinder decision-making. Reporting strategies might require adjustments to address these.

  • 𝗘𝗺𝗽𝗶𝗿𝗶𝗰𝗮𝗹 𝗩𝗮𝗹𝗶𝗱𝗶𝘁𝘆: Move beyond traditional bookkeeping methods. Future reports should be multi-dimensional and use statistical analysis to assess trends, driver significance, and success probabilities. This may involve training finance professionals in advanced analytics.

  • 𝗨𝘀𝗲𝗿 𝗘𝘅𝗽𝗲𝗿𝗶𝗲𝗻𝗰𝗲 (𝗨𝗫) 𝗗𝗲𝘀𝗶𝗴𝗻: Design reports with user needs in mind. Understand how decision-makers use reports, their starting points, and preferences. Tell stories with data and embed reports in user-friendly tools.

  • 𝗖𝗼𝗻𝘁𝗶𝗻𝘂𝗼𝘂𝘀 𝗜𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁: Reporting needs to evolve. Regularly gather feedback, monitor user behavior, and adapt dashboards to changing needs.

Redesigning reporting is not a one-time event, but an ongoing process. By implementing these tactics, CFOs can unlock the true value of reporting by driving data-driven decision making across the organization.

Go to Article

Further Reading

Six Ways to Drive Value with Impact Reporting

8 Tips For Writing An Impactful Impact Report

Ultimate Guide to Impact Reporting

Data Reporting: Best Practices for Crafting Impactful Reports


𝗧𝗮𝗺𝗶𝗻𝗴 𝘁𝗵𝗲 𝗧𝗶𝗺𝗲 𝗪𝗮𝗿𝗽 ...

𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝗶𝗲𝘀 𝗳𝗼𝗿 𝗖𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝘁 𝗗𝗮𝘁𝗮 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲𝘀

Data engineers face a persistent challenge: ensuring data consistency when source data updates at different frequencies, while downstream processes operate on various timelines. As 𝗞𝗲𝘃𝗶𝗻 𝗪𝗼𝗻𝗴 points out, this inconsistency can lead to flawed decisions, compliance issues, and wasted time.

𝗞𝗲𝘆 𝗜𝘀𝘀𝘂𝗲𝘀:

  • Temporal Disconnect: Source data evolves daily, but not all processes run daily (e.g., sales reports generated weekly).

  • Version Inconsistency: Different tasks might use different versions of the same data source.

  • Join Complexity: Joining data from multiple sources with different time stamps becomes tricky.

  • Historical Accuracy: Reports for past periods should reflect the data state at that time.

𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝗶𝗲𝘀 𝗳𝗼𝗿 𝗦𝘂𝗰𝗰𝗲𝘀𝘀:

  • Full Refresh: This straightforward approach rebuilds everything from scratch whenever the source data changes. While simple to implement, it can be resource-intensive and lead to data downtime.

  • Snapshot-Based Processing: This strategy involves taking regular snapshots of the entire dataset, preserving its state at specific points in time. This enables historical analysis and simplifies data reconciliation but increases storage requirements.

  • Data Genealogy: This approach assigns version numbers to your data and maintains relationships between entities across time. It allows for accurate historical analysis, data lineage, but increases data model complexity.

  • Temporal Data Modeling: This advanced strategy leverages database features to track the full history of data changes. But it requires databases with specific features and can impact storage and performance.

𝗠𝗮𝘀𝘁𝗲𝗿𝗶𝗻𝗴 𝗧𝗶𝗺𝗲-𝗩𝗮𝗿𝗶𝗮𝗻𝘁 𝗝𝗼𝗶𝗻𝘀: Joining tables with different timestamps requires careful consideration:

  • Version-Aware Joins: These explicitly consider data versions when joining tables, ensuring temporal consistency but potentially missing valid matches.

  • Latest Common Version Joins: These align joins to the most recent version common to both tables, handling data with different update frequencies but potentially not using the absolute latest data.

  • Asof Joins: These join the latest available data from each table up to a certain point in time, offering flexibility but potentially being computationally expensive.

𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻: By understanding these strategies, data engineers can create robust, time-aware data pipelines that maintain data integrity across different time scales and processing frequencies. This not only solves immediate problems, but also provides valuable functionality for historical analysis, audit trails and data lineage tracking.

Go to Article

Further Reading

What is: Time-Sensitive Data

Understanding the Complexities of Real-time Data Pipelines

Data Integration Techniques for Time-Sensitive Data

What Techniques Help You Conduct Time-Sensitive Data Analyses?


𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 𝗗𝗮𝘁𝗮 𝗙𝗹𝗼𝘄: 𝗔 𝗚𝘂𝗶𝗱𝗲 𝘁𝗼 𝗗𝗮𝘁𝗮𝘀𝗲𝘁 𝗮𝗻𝗱 𝗙𝗶𝗲𝗹𝗱 𝗟𝗶𝗻𝗲𝗮𝗴𝗲

... critical for effective data governance and management

Data lineage provides a critical map of data's journey, revealing its transformations, relationships, and dependencies. This article by Louise de Leyritz (CastorDoc) explores the importance of data lineage and its two main types: dataset lineage and field lineage. Dataset lineage tracks relationships between datasets, while field lineage focuses on individual data points.

𝗖𝗵𝗮𝗹𝗹𝗲𝗻𝗴𝗲𝘀:

  • Lack of Transparency: Difficulty in understanding how data flows through data pipelines can lead to errors, compliance issues, and wasted resources.

  • Data Integrity Concerns: Unclear data lineage makes it challenging to ensure data accuracy and consistency across visualizations and reports.

  • Limited Impact Analysis: Inability to trace the source and transformations of data hinders impact analysis when changes are made to data pipelines.

𝗥𝗲𝗰𝗼𝗺𝗺𝗲𝗻𝗱𝗮𝘁𝗶𝗼𝗻𝘀:

There are two main types of data lineage that address these challenges:

𝗗𝗮𝘁𝗮𝘀𝗲𝘁 𝗟𝗶𝗻𝗲𝗮𝗴𝗲: Provides a high-level overview of how data moves between tables and dashboards.

  • Computation Methods: SQL parsing, pattern recognition.

  • Benefits: Enables understanding of overall data flow, simplifies troubleshooting data movement issues.

  • Examples: Table-to-table lineage (tracks connections between tables in data warehouses), table-to-dashboard lineage (shows how tables feed into dashboards), dashboard-to-dashboard lineage (maps data flow between visualizations).

𝗙𝗶𝗲𝗹𝗱 𝗟𝗶𝗻𝗲𝗮𝗴𝗲: Offers a granular view of transformations happening to individual data points (columns) within tables and visualization tools.

  • Computation Methods: SQL parsing, API integration, pattern recognition.

  • Benefits: Enables detailed analysis of data transformations, promotes data integrity in visualizations, ensures compliance with regulations.

  • Examples: Column-to-column lineage (tracks specific transformations within tables), column-to-field lineage (maps column data to visualization tool fields), field-to-field lineage (traces data flow between fields in different visualizations).

𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 𝗯𝗼𝘁𝗵 𝗱𝗮𝘁𝗮𝘀𝗲𝘁 𝗮𝗻𝗱 𝗳𝗶𝗲𝗹𝗱 𝗹𝗶𝗻𝗲𝗮𝗴𝗲 𝗶𝘀 𝗰𝗿𝗶𝘁𝗶𝗰𝗮𝗹 𝗳𝗼𝗿 𝗲𝗳𝗳𝗲𝗰𝘁𝗶𝘃𝗲 𝗱𝗮𝘁𝗮 𝗴𝗼𝘃𝗲𝗿𝗻𝗮𝗻𝗰𝗲 𝗮𝗻𝗱 𝗺𝗮𝗻𝗮𝗴𝗲𝗺𝗲𝗻𝘁. By implementing these forms of lineage, organizations can:

- Improve data accuracy and consistency

- Reduce errors and troubleshooting time

- Facilitate impact analysis

- Ensure compliance with regulations

- Make informed decisions about data usage

Go to Article

Further Reading

Table-Level vs. Field-Level Data Lineage: What’s the Difference?

Building End-to-End Field Level Lineage for Modern Data Systems

The Ultimate Guide To Data Lineage


𝗠𝗼𝗻𝗶𝘁𝗼𝗿𝗶𝗻𝗴 𝗠𝗼𝗱𝗲𝗹 𝗗𝗿𝗶𝗳𝘁: 𝗔 𝗚𝘂𝗶𝗱𝗲 𝗳𝗼𝗿 𝗗𝗮𝘁𝗮 𝗦𝗰𝗶𝗲𝗻𝘁𝗶𝘀𝘁𝘀

... for maintaining model performance in production environments

Model drift occurs when the relationship between input features and target outcomes changes over time. It can significantly impact model performance and must be actively monitored. This article by Kavita Rana (NannyML) explores the challenges of detecting and measuring concept drift and introduces the Reverse Concept Drift algorithm as a potential solution.

𝗖𝗵𝗮𝗹𝗹𝗲𝗻𝗴𝗲𝘀 𝗮𝗻𝗱 𝗥𝗲𝗰𝗼𝗺𝗺𝗲𝗻𝗱𝗮𝘁𝗶𝗼𝗻𝘀:

  • 𝗗𝗲𝘁𝗲𝗰𝘁𝗶𝗻𝗴 𝗖𝗼𝗻𝗰𝗲𝗽𝘁 𝗗𝗿𝗶𝗳𝘁: Residual analysis and resampling residuals can provide insights into concept drift, but they have limitations.

  • 𝗤𝘂𝗮𝗻𝘁𝗶𝗳𝘆𝗶𝗻𝗴 𝗖𝗼𝗻𝗰𝗲𝗽𝘁 𝗗𝗿𝗶𝗳𝘁: The Kolmogorov-Smirnov test can help quantify distributional differences between reference and monitored data.

  • 𝗔𝗱𝗱𝗿𝗲𝘀𝘀𝗶𝗻𝗴 𝗖𝗼𝘃𝗮𝗿𝗶𝗮𝘁𝗲 𝗦𝗵𝗶𝗳𝘁: The Reverse Concept Drift algorithm aims to control for covariate shift and accurately measure concept drift.

𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻: Monitoring model drift is essential for maintaining model performance in production environments. By leveraging techniques like residual analysis, resampling, and the Reverse Concept Drift algorithm, data scientists can effectively detect and address concept drift, ensuring their models remain relevant and accurate over time.

Go to Article

Further Reading

What is concept drift in ML, and how to detect and address it

Using Concept Drift as a Model Retraining Trigger

The 10 most popular Concept Drift Algorithms: An overview and optimization potentials

What Is Concept Drift and How to Detect I


𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗮 𝗥𝗼𝗯𝘂𝘀𝘁 𝗗𝗮𝘁𝗮 𝗤𝘂𝗮𝗹𝗶𝘁𝘆 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲 𝘄𝗶𝘁𝗵 𝗱𝗯𝘁 𝗶𝗻 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀

This article by Tobi O. (Databricks) explores best practices for implementing data quality checks within your Databricks data pipelines using dbt. Ensuring data integrity throughout your workflow is crucial for reliable analytics. It looks at a range of tests, including anomaly detection, unit testing and data contracts, which helps to maintain high quality data from source to output.

𝗥𝗲𝗰𝗼𝗺𝗺𝗲𝗻𝗱𝗮𝘁𝗶𝗼𝗻𝘀:

  • 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗲 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀 𝗖𝗼𝗺𝗽𝘂𝘁𝗲 𝗥𝗲𝘀𝗼𝘂𝗿𝗰𝗲𝘀: Adjust the "threads" configuration in your dbt project's profiles.yml file to leverage concurrency effectively.

  • 𝗜𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁 𝗖𝗼𝗺𝗽𝗿𝗲𝗵𝗲𝗻𝘀𝗶𝘃𝗲 𝗧𝗲𝘀𝘁𝗶𝗻𝗴: Conduct source data validation, ingested data checks, and data transformation and output validation.

  • 𝗟𝗲𝘃𝗲𝗿𝗮𝗴𝗲 𝗱𝗯𝘁 𝗙𝗲𝗮𝘁𝘂𝗿𝗲𝘀: Utilize dbt's built-in generic tests and consider creating custom tests as needed.

  • 𝗘𝗻𝗳𝗼𝗿𝗰𝗲 𝗗𝗮𝘁𝗮 𝗤𝘂𝗮𝗹𝗶𝘁𝘆 𝗦𝘁𝗮𝗻𝗱𝗮𝗿𝗱𝘀: Employ constraints and data contracts to prevent invalid data from entering your models.

𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻: Effective data quality testing is essential for building reliable and trustworthy data pipelines. By following the recommendations described in this article, you can set up a robust data quality framework with dbt in Databricks.

Go to Article

Further Reading

dbt and Databricks

Databricks and dbt Partnership: What It Means for Your Data Engineering Pipeline

Delivering cost-effective data in real time with dbt and Databricks

New dbt Labs and Databricks Report Highlights The Evolving Roles of Data Practitioners in the AI Era


𝗦𝗺𝗮𝗹𝗹 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 (𝗦𝗟𝗠) : 𝗕𝗶𝗴 𝗣𝗼𝘁𝗲𝗻𝘁𝗶𝗮𝗹 𝗳𝗼𝗿 𝗘𝗱𝗴𝗲 𝗖𝗼𝗺𝗽𝘂𝘁𝗶𝗻𝗴

... bringing AI capabilities closer to where data is generated

Large language models (LLMs) offer powerful capabilities but can be expensive and require significant computing power. As Aili McConnon points out, Small language models (SLMs) represent a growing trend, offering lower cost, better privacy, and the ability to run on edge devices like smartphones and factory machines. This unlocks new use cases in areas like manufacturing and agriculture, where real-time data analysis is crucial.

𝗖𝗵𝗮𝗹𝗹𝗲𝗻𝗴𝗲𝘀 𝗮𝗻𝗱 𝗖𝗼𝗻𝘀𝗶𝗱𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝘀:

  • 𝗧𝗿𝗮𝗱𝗲𝗼𝗳𝗳 𝗯𝗲𝘁𝘄𝗲𝗲𝗻 𝗔𝗰𝗰𝘂𝗿𝗮𝗰𝘆 𝗮𝗻𝗱 𝗟𝗮𝘁𝗲𝗻𝗰𝘆: While cost-effective and efficient, SLMs may sacrifice some accuracy compared to LLMs.

  • 𝗟𝗶𝗺𝗶𝘁𝗲𝗱 𝗙𝘂𝗻𝗰𝘁𝗶𝗼𝗻𝗮𝗹𝗶𝘁𝘆: SLMs require careful tailoring for specific tasks and may not be suitable for complex applications.

𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻: SLMs represent a significant step forward for edge computing, bringing AI capabilities closer to where data is generated. They offer new opportunities for industries like manufacturing and agriculture to improve efficiency and decision-making.

Go to Article

Further Reading

The Beginner’s Guide to Small Language Models

Tiny but mighty: The Phi-3 small language models with big potential

Small Language Models: SLM vs LLM

Tiny Titans: How Small Language Models Outperform LLMs for Less


𝗨𝗻𝗹𝗼𝗰𝗸𝗶𝗻𝗴 𝘁𝗵𝗲 𝗣𝗼𝘄𝗲𝗿 𝗼𝗳 𝗖𝗼𝗺𝗽𝗼𝘂𝗻𝗱 𝗔𝗜 𝗦𝘆𝘀𝘁𝗲𝗺𝘀

Compound AI systems, which combine multiple components to tackle complex tasks, are becoming increasingly prevalent. This article by Sean Sheng and Sherlock Xu 📖 looks at the benefits and challenges of building composite AI systems and emphasizes the importance of effective system design, integration and scaling.

𝗥𝗲𝗰𝗼𝗺𝗺𝗲𝗻𝗱𝗮𝘁𝗶𝗼𝗻𝘀:

  • 𝗦𝘆𝘀𝘁𝗲𝗺 𝗗𝗲𝘀𝗶𝗴𝗻: Carefully consider the components and interactions within your compound AI system to ensure optimal performance and efficiency.

  • 𝗜𝗻𝘁𝗲𝗴𝗿𝗮𝘁𝗶𝗼𝗻: Integrate your compound AI system with existing tools and data sources to leverage their capabilities and enhance functionality.

  • 𝗦𝗰𝗮𝗹𝗶𝗻𝗴: Implement strategies for scaling your system to handle increasing workloads while maintaining performance and cost-efficiency.

  • 𝗢𝗯𝘀𝗲𝗿𝘃𝗮𝗯𝗶𝗹𝗶𝘁𝘆: Monitor and analyze your system's performance to identify bottlenecks and optimize its operation.

𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻: By understanding the benefits and challenges of AI systems, organizations can harness their power and develop innovative and effective solutions. By focusing on effective system design, integration and scaling, AI teams can develop and deploy composite AI systems that deliver exceptional results.

Go to Article

Further Reading

What Are Compound AI Systems?

The Shift from Models to Compound AI Systems

From Generalists to Specialists: The Evolution of AI Systems toward Compound AI

Compound AI: Thinking in Systems, Not Models


𝗘𝗨 𝗔𝗜 𝗔𝗰𝘁 𝗖𝗼𝗺𝗽𝗹𝗶𝗮𝗻𝗰𝗲 𝗧𝗵𝗿𝗼𝘂𝗴𝗵 𝗗𝗮𝘁𝗮 𝗚𝗼𝘃𝗲𝗿𝗻𝗮𝗻𝗰𝗲 𝗮𝗻𝗱 𝗖𝗲𝗻𝘁𝗿𝗮𝗹𝗶𝘇𝗲𝗱 𝗣𝗹𝗮𝘁𝗳𝗼𝗿𝗺𝘀

The EU AI Act brings challenges related to data governance and compliance, particularly for high-risk AI systems. These systems face stringent requirements around transparency, bias mitigation, and high-quality data, with the risk of exclusion from the EU market for non-compliance. Informatica points out, that the complexity of managing governance obligations and ensuring data integrity remains a significant hurdle.

𝗥𝗲𝗰𝗼𝗺𝗺𝗲𝗻𝗱𝗮𝘁𝗶𝗼𝗻𝘀:

  • 𝗗𝗮𝘁𝗮 𝗚𝗼𝘃𝗲𝗿𝗻𝗮𝗻𝗰𝗲: Prioritizing robust data governance ensures the availability, integrity, and security of data, which is critical for making compliant decisions and deploying trustworthy AI systems that avoid biases and inaccuracies.

  • 𝗦𝗶𝗻𝗴𝗹𝗲 𝗦𝗼𝘂𝗿𝗰𝗲 𝗼𝗳 𝗧𝗿𝘂𝘁𝗵: Establishing a single source of truth for all data enables traceability, simplifying compliance with the EU AI Act's mandates on data transparency, security, and responsibility across the organization.

  • 𝗖𝗲𝗻𝘁𝗿𝗮𝗹𝗶𝘇𝗲𝗱 𝗚𝗼𝘃𝗲𝗿𝗻𝗮𝗻𝗰𝗲: A centralized data governance platform consolidates compliance efforts, offering clear oversight and enabling businesses to streamline workflows, ensuring alignment with the EU AI Act’s evolving regulatory landscape.

𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻: Implementing a single source of truth and centralized governance platform is vital for ensuring compliance with the EU AI Act, enhancing transparency, and maintaining operational efficiency.

Go to Article

Further Reading

Data governance for new EU AI Act compliance

AI Act Explained: Navigating Data Governance in the Age of Artificial Intelligence

Data Mesh & EU AI Act: A Call for more Centralized Governance

The EU AI Act: Best Practices for Data and Governance as a Strategic Opportunity!


𝗘𝗻𝗵𝗮𝗻𝗰𝗶𝗻𝗴 𝗗𝗮𝘁𝗮 𝗖𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝗰𝘆 𝗮𝗻𝗱 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆 𝘄𝗶𝘁𝗵 𝘁𝗵𝗲 𝗦𝗵𝗶𝗳𝘁 𝗟𝗲𝗳𝘁 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲

The Shift Left Architecture addresses the persistent challenges of data integration, such as high compute costs, inconsistent data, and slow updates in traditional ETL processes. Kai Waehner points out that by unifying transactional and analytical workloads through a data mesh approach, this model leverages Apache Kafka, Flink, and Iceberg for real-time data streaming.

𝗖𝗵𝗮𝗹𝗹𝗲𝗻𝗴𝗲𝘀:

  • 𝗗𝗮𝘁𝗮 𝗜𝗻𝗰𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝗰𝘆: Traditional ETL pipelines can cause data discrepancies, especially between batch and real-time applications.

  • 𝗖𝗼𝗺𝗽𝘂𝘁𝗲 𝗖𝗼𝘀𝘁𝘀: High compute demands in data warehouses increase costs; Shift Left Architecture optimizes compute efficiency.

  • 𝗥𝗲𝗱𝘂𝗻𝗱𝗮𝗻𝘁 𝗗𝗲𝘃𝗲𝗹𝗼𝗽𝗺𝗲𝗻𝘁: Multiple business units often duplicate processing steps, slowing time-to-market and adding expense.

𝗔𝗱𝘃𝗮𝗻𝘁𝗮𝗴𝗲𝘀:

  • 𝗥𝗲𝗮𝗹-𝗧𝗶𝗺𝗲 𝗗𝗮𝘁𝗮 𝗖𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝗰𝘆: Streaming supports reliable, up-to-date data across applications, ensuring seamless data quality and consistency for both analytics and operations.

  • 𝗥𝗲𝗱𝘂𝗰𝗲𝗱 𝗖𝗼𝗺𝗽𝘂𝘁𝗲 𝗖𝗼𝘀𝘁𝘀: Streamlined data processing lowers compute requirements, driving cost efficiency across multiple platforms while improving scalability.

  • 𝗔𝗰𝗰𝗲𝗹𝗲𝗿𝗮𝘁𝗲𝗱 𝗧𝗶𝗺𝗲-𝘁𝗼-𝗠𝗮𝗿𝗸𝗲𝘁: Reduced ETL complexity accelerates delivery of new business logic, enhancing speed and agility across different business units.

  • 𝗘𝗻𝗵𝗮𝗻𝗰𝗲𝗱 𝗙𝗹𝗲𝘅𝗶𝗯𝗶𝗹𝗶𝘁𝘆 𝗔𝗰𝗿𝗼𝘀𝘀 𝗣𝗹𝗮𝘁𝗳𝗼𝗿𝗺𝘀: Supports optimal technology choices for each use case, fostering adaptability and enabling diverse tools in data and analytics environments.

Shift Left unifies operational and analytical workloads in one architecture, achieving real-time data consistency across platforms.

𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻: Shift Left Architecture offers a streamlined approach to data consistency and cost efficiency, equipping organizations with a robust foundation for future data-driven initiatives.

Go to Article

Further Reading

What is Shift Left?

Embracing Shift-Left in Data Architecture

Databricks and the “Shift Left Architecture”

Shift left vs shift right: A DevOps mystery solved


𝗕𝗿𝗶𝗱𝗴𝗶𝗻𝗴 𝘁𝗵𝗲 𝗚𝗮𝗽: 𝗗𝗮𝘁𝗮 𝗠𝗲𝘀𝗵 𝗮𝗻𝗱 𝗗𝗮𝘁𝗮 𝗙𝗮𝗯𝗿𝗶𝗰

Rethinking Data Management: A Hybrid Paradigm

Data Mesh and Data Fabric are two distinct approaches to data management, each with its own strengths and weaknesses. Jay Piscioneri (Eckerson Group) points out that Data Mesh emphasizes decentralization and domain ownership, while Data Fabric prioritizes centralization and integration.

𝗞𝗲𝘆 𝗖𝗼𝗻𝘀𝗶𝗱𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝘀:

  • 𝗕𝗮𝗹𝗮𝗻𝗰𝗶𝗻𝗴 𝗖𝗲𝗻𝘁𝗿𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗮𝗻𝗱 𝗗𝗲𝗰𝗲𝗻𝘁𝗿𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻: Organizations should find the right balance between central control and domain autonomy to optimize data management.

  • 𝗗𝗮𝘁𝗮 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝘀: Data products can serve as a bridge between Data Mesh and Data Fabric, enabling data sharing and collaboration while maintaining data quality and governance.

  • 𝗗𝗮𝘁𝗮 𝗚𝗼𝘃𝗲𝗿𝗻𝗮𝗻𝗰𝗲: A robust data governance framework is essential to ensure data quality, security, and compliance.

  • 𝗗𝗮𝘁𝗮 𝗖𝘂𝗹𝘁𝘂𝗿𝗲: Fostering a data-driven culture is crucial for successful data management and adoption.

𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻: By combining the best aspects of Data Mesh and Data Fabric, organizations can develop a hybrid approach that balances flexibility, control, and scalability. This approach enables data teams to deliver value while ensuring data quality, security, and compliance.

Go to Article

Further Reading

Combining Data Mesh and Data Fabric

Combining Data Mesh and Data Fabric Capabilities

Data Mesh vs Data Fabric: Which Approach Fits Your Needs in 2024?

Combining Data Mesh and Data Fabric for Next-Gen Data Security


Takeaways

Here are the key takeaways from this month's edition, providing you with essential strategies and insights to excel in data engineering:

Engineering with Empathy: As a data engineer, you should build empathy into your approach by understanding the different needs of users and collaborating beyond code to develop accessible and impactful products. This user-centered mindset improves data quality, product relevance and overall inclusivity in data engineering.

Innovation and AI:As an engineer, you should use AI to analyse customer needs, personalize experiences and predict market trends. This approach supports continuous product improvement, ensures adaptation to market requirements and promotes sustainable, innovation-driven growth.

Impactful Reporting: Focus on building reporting systems that prioritize decision-driven insights over pure data delivery. Leverage automation, integrate advanced analytics and design user-friendly dashboards to provide finance teams with actionable, strategic insights that enable informed decision making.

Time-Sensitive Data Pipelines: Try to manage time-dependent data by implementing versioning, temporal modeling and time-dependent joins. Balancing storage, query complexity and processing time ensures data integrity, consistency and accurate historical analysis and enables reliable pipelines for complex data ecosystems.

Dataset- and Field-Lineage: Implement both dataset and field lineage to gain a comprehensive view of data flow, from high-level relationships to granular transformations. This approach strengthens data governance, ensures integrity, and supports informed decision-making across complex data ecosystems.

Reverse Concept Drift: Actively monitor machine learning models post-deployment to address concept drift. Use techniques like residual analysis and Reverse Concept Drift detection to isolate shifts, quantify impacts, and adapt models, ensuring consistent performance as data and user behaviors evolve.

dbt on Databricks: With dbt on Databricks, you can ensure robust data pipelines by implementing unit tests, constraints and freshness checks, while leveraging Unity Catalog to manage metadata and improve pipeline efficiency with optimized compute resources.

Small Language Models (SLM): Utilize Small Language Models (SLMs) to develop cost-effective, privacy-friendly AI applications for edge devices that improve real-time data processing and enable localized, energy-efficient solutions in industries such as manufacturing and healthcare.

Compound AI Systems:Utilize compound AI systems by integrating multiple models, APIs and proprietary data to improve efficiency, scalability and control. Leverage platforms like BentoML to streamline development, optimize performance and ensure data privacy for complex, dynamic AI-driven workflows.

EU AI Act Compliance: Focus on building transparent, robust, and traceable AI systems by implementing strong data governance practices to meet EU AI Act compliance, ensuring ethical AI development while leveraging reliable, high-quality data for innovation and business impact.

Shift Left Architecture: The Shift Left Architecture is ideal for unifying transactional and analytical data through real-time streaming, ensuring consistency, reducing costs and accelerating innovation by leveraging technologies such as Kafka, Flink and Iceberg for efficient, scalable and flexible data solutions.

Blending Data Mesh and Data Fabric: Combine Data Mesh and Data Fabric to balance autonomy and centralization. Leverage data products and marketplaces to ensure scalability, governance and collaboration to create adaptable, efficient data ecosystems tailored to your organization's needs and evolving business challenges.


Conclusion

In this issue, we've looked at the key data engineering strategies that drive innovation and efficiency. By taking a user-centric approach, leveraging AI for personalization and market forecasting, and focusing on meaningful reporting, you can optimize decision-making and continuous improvement. Managing time-sensitive data with advanced techniques ensures data integrity, while implementing comprehensive dataset and field delineation strengthens governance and decision-making. Monitoring models for concept drift after deployment and optimizing data pipelines with dbt on Databricks ensures consistency and performance. The inclusion of Small Language Models and compound AI systems improves real-time processing and privacy-friendly solutions. Compliance with the EU AI Act ensures ethical AI development, while the Shift Left Architecture and the combination of Data Mesh with Data Fabric enable scalable, flexible and collaborative data ecosystems. These insights are essential for the further development of data engineering practices and enable well-founded, innovation-driven growth.

Stay tuned for our next issue, where we’ll explore the latest breakthroughs in data technology and best practices.

See you next month ...


#DataScience #DataEngineering #DataAnalytics #DataStrategy #DataOps #DataDriven #DataPlatform #DataLake #DataWarehouse #DataLakehouse #DataMesh #DataFabric #DataGovernance #DataManagement #DataQuality #DataPrivacy #DataVisualization #MachineLearning #AI #ArtificialIntelligence #Innovation #BusinessStrategy #DigitalTransformation #BigData #Analytics #ApacheSpark #ETL #RealTimeData #ApacheKafka #ApacheFlink #DataIntegration #CloudComputing #Leadership #Management #Strategy #GenerativeAI #NaturalLanguageProcessing #SmallLanguageModels #MLops #DevOps #DataLineage #ConceptDrift #RiskManagement #EthicalAI #AICompliance #EUAIACT #DataArchitecture #DataStreaming #CompoundAI #Technology

Isha Taneja

Specialist in Data Engineering, Analytics, AI & Cloud Services || CEO@Complere Infosystem || Editor @The Executive Outlook || Chair @TIE Women Chd

1d

Excellent post! Data engineering continues to evolve with exciting trends and challenges, especially around data pipelines and governance. Time-sensitive data pipelines and managing version consistency are critical for ensuring the accuracy and reliability of data. Furthermore, robust data governance remains essential to maintain data integrity and security as we scale AI and machine learning solutions. Combining the best practices from Data Mesh and Data Fabric, alongside strong governance frameworks, will help organizations build scalable, flexible data architectures while ensuring compliance and high-quality insights.

syeda fatima batool

Software Engineer at none

2d

Axel, your insights on the evolving landscape of data engineering are truly enlightening. The emphasis on human-centered AI and effective data governance resonates deeply, and I believe these concepts will drive substantial progress in our field. Thank you for sharing your expertise!Axel Schwanke

To view or add a comment, sign in

Explore topics