Selected Data Engineering Posts . . . August 2024

Selected Data Engineering Posts . . . August 2024

The most popular of my posts on data engineering in August 2024 ... with additional references ...


Welcome to the latest edition of "Selected Data Engineering Posts". This month we look at key trends that are reshaping data processing. Learn how the separation of business logic and data management can optimize workflows and reduce errors. Explore the transition from traditional ETL to advanced EtLT architecture, which enables real-time data processing and integration of multiple sources. Learn how the EU AI Act is raising standards for data quality, with tools such as Databricks Lakehouse and Unity Catalog proving essential for compliance. We also explore the importance of choosing the right semantic layer, using brainwriting for innovation and ensuring data observability for reliable AI. Finally, we discuss the benefits of structured data labeling for improved governance and utilization. Dive in to expand your knowledge and stay at the forefront of advances in data engineering.

Each post is accompanied by references to further reading so that you can deepen your knowledge of these informative topics.

Subscribe now to stay updated with our monthly issues to realize the full potential of data engineering and make impactful business decisions. Expand your data engineering expertise today!

This issue:

Data Engineering, Redefined: The main problem with data engineering today is the inappropriate mixing of data management and business logic. Data engineers are overburdened with implementing business rules that should be the responsibility of application developers, resulting in inefficient and error-prone data pipelines.

Beyond Traditional ETL: The evolution of data integration has progressed from traditional ETL to the modern EtLT architecture, which integrates real-time data processing and hybrid batch-stream integration. Embrace EtLT for enhanced real-time capabilities and support for diverse data sources while preparing for future innovations like DataFabric and automated governance.

EU AI Act as a Catalyst: The EU AI Act serves as a catalyst for raising standards for data engineering by enforcing strict requirements for data quality and management. Tools such as Databricks Lakehouse and Unity Catalog play a critical role in meeting these standards, ensuring compliance and improving data management.

Understanding Semantic Layers: In the modern data landscape, "Semantic Layer" refers to two distinct types: the Metrics layer, which centralizes analytics and simplifies query access, and the Semantic Layer, which provides contextual understanding and interoperability across data. Choosing the appropriate layer is crucial for building a robust, adaptable enterprise infrastructure.

Data Trustworthiness: A flexible approach to data trustworthiness should be adopted to fully leverage data assets, with recognition that not all data requires top-tier certification. This shift enhances usability, reduces costs, and fosters innovation while necessitating updates to governance roles and policies.

Brainwriting: Brainwriting fosters innovation by allowing all team members, including quieter ones, to contribute ideas independently, reducing groupthink and dominant voices. This method enhances idea generation, promotes diverse perspectives, and improves focus, making it valuable for complex data engineering challenges.

Data Quality for AI: High-quality training data is crucial for effective LLMs and AI models. Poor data introduces noise and bias, leading to unreliable predictions. Prioritizing data accuracy, completeness, and validity is essential for achieving accurate, trustworthy AI outcomes and meaningful insights.

Data Observability: To ensure trustworthy and effective AI, data observability is essential. It allows for continuous monitoring, quality checks, and error correction of data, which enhances AI model performance, transparency, and compliance. Implementing data observability practices fosters collaboration and builds trust in AI systems.

Data Tagging: Implementing a structured tagging strategy enhances data governance by improving classification, lifecycle management, and compliance, while automating and integrating tagging with Unity Catalog boosts efficiency and utility.

We look forward to sharing these insights with you and supporting your journey towards data excellence.

Enjoy reading!


𝗥𝗲𝗱𝗲𝗳𝗶𝗻𝗶𝗻𝗴 𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝗳𝗼𝗿 𝗠𝗼𝗱𝗲𝗿𝗻 𝗡𝗲𝗲𝗱𝘀

𝐟𝐨𝐜𝐮𝐬𝐢𝐧𝐠 𝐨𝐧 𝐝𝐚𝐭𝐚 𝐩𝐫𝐨𝐜𝐞𝐬𝐬𝐢𝐧𝐠 𝐚𝐧𝐝 𝐦𝐚𝐧𝐚𝐠𝐞𝐦𝐞𝐧𝐭, 𝐰𝐢𝐭𝐡𝐨𝐮𝐭 𝐢𝐧𝐜𝐥𝐮𝐝𝐢𝐧𝐠 𝐛𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐥𝐨𝐠𝐢𝐜.

In this article Bernd Wessely points out that data engineering is often defined as developing, implementing, and maintaining systems that transform raw data into high-quality information for uses like analysis and machine learning. However, this approach has significant issues.

Transforming raw data into meaningful information involves applying the correct logic, usually handled by applications developed by software engineers. Currently, data engineers are tasked with implementing business logic, leading to inconsistent and hidden logic within brittle data pipelines.

A new definition of data engineering is proposed: 𝗳𝗼𝗰𝘂𝘀𝗶𝗻𝗴 𝘀𝗼𝗹𝗲𝗹𝘆 𝗼𝗻 𝗱𝗮𝘁𝗮 𝗺𝗼𝘃𝗲𝗺𝗲𝗻𝘁, 𝗺𝗮𝗻𝗶𝗽𝘂𝗹𝗮𝘁𝗶𝗼𝗻, 𝗮𝗻𝗱 𝗺𝗮𝗻𝗮𝗴𝗲𝗺𝗲𝗻𝘁, 𝘄𝗶𝘁𝗵𝗼𝘂𝘁 𝗶𝗻𝗰𝗹𝘂𝗱𝗶𝗻𝗴 𝗯𝘂𝘀𝗶𝗻𝗲𝘀𝘀 𝗹𝗼𝗴𝗶𝗰. Data engineers should provide tools and platforms for application developers, who handle business logic.

𝗖𝗵𝗮𝗹𝗹𝗲𝗻𝗴𝗲𝘀 𝗶𝗻 𝗖𝘂𝗿𝗿𝗲𝗻𝘁 𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴:

  • Mixing data transformation with business logic.

  • Creating brittle data pipelines with inconsistent logic.

  • Encroaching on roles traditionally handled by application developers.

𝗣𝗿𝗼𝗽𝗼𝘀𝗲𝗱 𝗥𝗲𝗱𝗲𝗳𝗶𝗻𝗶𝘁𝗶𝗼𝗻:

  • Data engineering should be about data movement, manipulation, and management.

  • Business logic should be handled by application developers.

  • Technical manipulations allowed for data engineers include partitioning, reformatting, and indexing, but not adding new business information.

𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻:

Data engineering should focus on providing a robust data infrastructure and tools and leave the business logic to the application developers. This separation ensures clear responsibilities, improves data quality and supports scalable, maintainable systems.

Go to Article

Additional References - including controversial ones

Data engineering: a role redefined by business needs and interpersonal skills

Dear Data Engineer — Get to know your Stakeholders

The Future of Data Engineering as a Data Engineer

The Synergy of Algorithms: Generative AI Redefining Data Engineering

Data Engineering in the Age of Generative AI


𝗧𝗵𝗲 𝗙𝘂𝘁𝘂𝗿𝗲 𝗼𝗳 𝗗𝗮𝘁𝗮 𝗜𝗻𝘁𝗲𝗴𝗿𝗮𝘁𝗶𝗼𝗻: 𝗠𝗼𝘃𝗶𝗻𝗴 𝗕𝗲𝘆𝗼𝗻𝗱 𝗘𝗧𝗟

𝐄𝐭𝐋𝐓 𝐚𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞 𝐢𝐬 𝐛𝐞𝐜𝐨𝐦𝐢𝐧𝐠 𝐚 𝐠𝐥𝐨𝐛𝐚𝐥 𝐬𝐭𝐚𝐧𝐝𝐚𝐫𝐝 𝐢𝐧 𝐝𝐚𝐭𝐚 𝐢𝐧𝐭𝐞𝐠𝐫𝐚𝐭𝐢𝐨𝐧

In this article, Dr. RVS Praveen Ph.D points out that data integration is evolving beyond traditional ETL (Extract, Transform, Load) to more advanced architectures such as ELT (Extract, Load, Transform) and 𝗘𝘁𝗟𝗧 (Extract, transform, Load, Transform).

𝗘𝗧𝗟 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲:

  • 𝗔𝗱𝘃𝗮𝗻𝘁𝗮𝗴𝗲𝘀: Ensures data consistency and quality, integrates complex data sources, provides clear technical architecture, and facilitates business rule implementation.

  • 𝗗𝗶𝘀𝗮𝗱𝘃𝗮𝗻𝘁𝗮𝗴𝗲𝘀: Lacks real-time processing, incurs high hardware costs, offers limited flexibility, has expensive maintenance, and poorly handles unstructured data.

𝗘𝗟𝗧 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲:

  • 𝗔𝗱𝘃𝗮𝗻𝘁𝗮𝗴𝗲𝘀: Efficiently handles large data volumes, enhances development and operational efficiency, is cost-effective, flexible, and scalable, and integrates seamlessly with new technologies.

  • 𝗗𝗶𝘀𝗮𝗱𝘃𝗮𝗻𝘁𝗮𝗴𝗲𝘀: Offers limited real-time processing support, incurs high data storage costs, may have data quality issues, and depends on target system capabilities.

𝗘𝘁𝗟𝗧 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲:

  • 𝗔𝗱𝘃𝗮𝗻𝘁𝗮𝗴𝗲𝘀: Supports real-time data processing, complex data sources, cost reduction, enhanced flexibility and scalability, optimized performance, large model handling, and improved data quality and governance.

  • 𝗗𝗶𝘀𝗮𝗱𝘃𝗮𝗻𝘁𝗮𝗴𝗲𝘀: Involves increased technical complexity, depends on target system capabilities, faces management and monitoring challenges, has complex data change management, and depends on specific tools and platforms.

𝗞𝗲𝘆 𝗧𝗿𝗲𝗻𝗱𝘀 𝗶𝗻 𝗗𝗮𝘁𝗮 𝗜𝗻𝘁𝗲𝗴𝗿𝗮𝘁𝗶𝗼𝗻:

  1. Shift from batch processing to real-time data capture and hybrid batch-stream integration.

  2. Evolution of data transformation capabilities to handle complex transformations within data warehouses.

  3. Expansion of data source support to include emerging sources, unstructured data systems, and cloud databases.

  4. Emphasis on core capabilities like data source diversity, accuracy, and ease of troubleshooting.

𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻: EtLT architecture is becoming a global standard in data integration, addressing the limitations of traditional ETL and ELT by enabling real-time data processing, flexibility, and scalability.

Go to Article

Additional References

How Data Integration Is Evolving Beyond ETL

Future Trends in Data Integration

ETL vs. ELT: Dive deeper into two data processing approaches

The convergence of ETL and ELT: The future of unified data management


𝗧𝗵𝗲 𝗘𝗨 𝗔𝗜 𝗔𝗰𝘁 𝗮𝘀 𝗮 𝗖𝗮𝘁𝗮𝗹𝘆𝘀𝘁 𝘁𝗼 𝗥𝗮𝗶𝘀𝗲 𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝗦𝘁𝗮𝗻𝗱𝗮𝗿𝗱𝘀

Effective data management and data governance are essential for maximizing the value of corporate data, ensuring its quality, transparency, and regulatory compliance. 𝗔𝗿𝘁𝗶𝗰𝗹𝗲 𝟭𝟬 𝗼𝗳 𝘁𝗵𝗲 𝗘𝗨 𝗔𝗜 𝗔𝗰𝘁 𝘂𝗻𝗱𝗲𝗿𝘀𝗰𝗼𝗿𝗲𝘀 𝘁𝗵𝗲 𝗶𝗺𝗽𝗼𝗿𝘁𝗮𝗻𝗰𝗲 𝗼𝗳 𝘂𝘀𝗶𝗻𝗴 𝗵𝗶𝗴𝗵-𝗾𝘂𝗮𝗹𝗶𝘁𝘆 𝗱𝗮𝘁𝗮 𝗳𝗼𝗿 𝗵𝗶𝗴𝗵-𝗿𝗶𝘀𝗸 𝗔𝗜 𝘀𝘆𝘀𝘁𝗲𝗺𝘀. It mandates that data for training, validation, and testing must be carefully managed to be relevant, accurate, and free from bias.

𝗧𝗵𝗲 𝗘𝗨 𝗔𝗜 𝗔𝗰𝘁 𝗽𝗿𝗼𝘃𝗶𝗱𝗲𝘀 𝗮 𝗴𝗿𝗲𝗮𝘁 𝗼𝗽𝗽𝗼𝗿𝘁𝘂𝗻𝗶𝘁𝘆 𝗳𝗼𝗿 𝗼𝗿𝗴𝗮𝗻𝗶𝘇𝗮𝘁𝗶𝗼𝗻𝘀 𝘁𝗼 𝗶𝗺𝗽𝗿𝗼𝘃𝗲 𝘁𝗵𝗲𝗶𝗿 𝗱𝗮𝘁𝗮 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝗽𝗿𝗮𝗰𝘁𝗶𝗰𝗲𝘀.  By adhering to the Act’s requirements, organizations are encouraged to adopt advanced data platforms and governance frameworks, such as Databricks Lakehouse and Unity Catalog. These tools facilitate the integration, management, and governance of data, ensuring it meets stringent quality and compliance standards.

Implementing best practices for data engineering not only supports regulatory compliance, but also 𝘀𝘁𝗿𝗲𝗻𝗴𝘁𝗵𝗲𝗻𝘀 𝘁𝗵𝗲 𝗿𝗲𝗹𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗮𝗻𝗱 𝘁𝗿𝘂𝘀𝘁𝘄𝗼𝗿𝘁𝗵𝗶𝗻𝗲𝘀𝘀 𝗼𝗳 𝗔𝗜 𝘀𝘆𝘀𝘁𝗲𝗺𝘀.

The EU AI Act can thus act as a catalyst for improving data management processes, enhancing data quality and promoting responsible AI development.

Go to Article

Additional References

How to prepare for the EU AI Act

Navigating the EU AI Act: Technical Key Priorities for Businesses

What does the AI Act mean for the use of AI in companies?

Compliance under the EU AI Act: Summary and Key Issues


𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 𝗦𝗲𝗺𝗮𝗻𝘁𝗶𝗰 𝗟𝗮𝘆𝗲𝗿𝘀: 𝗠𝗲𝘁𝗿𝗶𝗰𝘀 𝘃𝘀. 𝗧𝗿𝘂𝗲 𝗦𝗲𝗺𝗮𝗻𝘁𝗶𝗰𝘀

𝐂𝐮𝐬𝐭𝐨𝐦𝐢𝐳𝐞𝐝 𝐬𝐨𝐥𝐮𝐭𝐢𝐨𝐧𝐬 𝐰𝐢𝐥𝐥 𝐛𝐞𝐭𝐭𝐞𝐫 𝐬𝐞𝐫𝐯𝐞 𝐜𝐨𝐦𝐩𝐥𝐞𝐱 𝐧𝐞𝐞𝐝𝐬

In his article, Helyx Chase Scearce Horwitz points out that in data management, the term "Semantic Layer" can refer to 𝘁𝘄𝗼 𝗱𝗶𝘀𝘁𝗶𝗻𝗰𝘁 𝘁𝘆𝗽𝗲𝘀 𝗼𝗳 𝗹𝗮𝘆𝗲𝗿𝘀:

𝗠𝗲𝘁𝗿𝗶𝗰𝘀 𝗟𝗮𝘆𝗲𝗿: Originating from the 1990s, this layer simplifies querying by allowing users to access databases without SQL knowledge. It focuses on standardizing metrics and analytics to ensure consistent interpretations across an organization.

𝗦𝗲𝗺𝗮𝗻𝘁𝗶𝗰 𝗟𝗮𝘆𝗲𝗿:: This layer provides a broader context by linking various data and information sources. It emphasizes interoperability and extensibility, utilizing frameworks akin to the semantic web to make implicit relationships explicit and facilitate comprehensive understanding.

𝗞𝗲𝘆 𝗙𝘂𝗻𝗰𝘁𝗶𝗼𝗻𝘀:

  • 𝗠𝗲𝘁𝗿𝗶𝗰𝘀 𝗟𝗮𝘆𝗲𝗿: Standardizes and simplifies analytics for consistent, self-serve access.

  • 𝗦𝗲𝗺𝗮𝗻𝘁𝗶𝗰 𝗟𝗮𝘆𝗲𝗿: Connects disparate data sources, enhancing understanding and enabling complex insights through a graph data model and ontologies.

While AIO (All-In-One) Semantic Layer tools offer integrated analytics, they often fall short of the extensive capabilities of a true Semantic Layer. A genuine Semantic Layer not only provides consistent metrics but also integrates diverse information sources, supports advanced analytics, and ensures a unified source of truth for the entire enterprise.

𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻: For transformative insights and comprehensive data integration, 𝗲𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲𝘀 𝗻𝗲𝗲𝗱 𝗮 𝗦𝗲𝗺𝗮𝗻𝘁𝗶𝗰 𝗟𝗮𝘆𝗲𝗿 𝘁𝗵𝗮𝘁 𝗴𝗼𝗲𝘀 𝗯𝗲𝘆𝗼𝗻𝗱 𝗺𝗲𝘁𝗿𝗶𝗰𝘀 𝗮𝗻𝗱 𝗶𝗻𝘁𝗲𝗴𝗿𝗮𝘁𝗲𝘀 𝗼𝗿𝗴𝗮𝗻𝗶𝘇𝗮𝘁𝗶𝗼𝗻𝗮𝗹 𝗸𝗻𝗼𝘄𝗹𝗲𝗱𝗴𝗲 𝗱𝗲𝗲𝗽𝗹𝘆. Customized solutions with semantic depth will better serve complex needs and future-proof enterprise data strategies.

Go to Article

Additional References

Industry Panel: Different Applications of a Semantic Layer — Takeaways Blog

Semantic layer vs metric layer, or a hybrid solution. Which is right for me?

What is a Metrics Layer?

Semantic Layer — One Layer to Serve Them All

Semantic layer 101: Why your data team should focus on metrics over data


𝗥𝗲𝘁𝗵𝗶𝗻𝗸𝗶𝗻𝗴 𝗗𝗮𝘁𝗮 𝗧𝗿𝘂𝘀𝘁𝘄𝗼𝗿𝘁𝗵𝗶𝗻𝗲𝘀𝘀: 𝗕𝗲𝘆𝗼𝗻𝗱 𝗖𝗲𝗿𝘁𝗶𝗳𝗶𝗰𝗮𝘁𝗶𝗼𝗻

Flexible trustworthiness of data for more value

In his article, Francesco De Cassai points out that the "Data as a Product" concept has become a key focus in data management, defined by traits such as Discoverable, Addressable, Trustworthy, Secure, Interoperable, and Self-Describing (DATSIS). However, the term "Trustworthy" is often misconceived as requiring high certification standards, which can limit the potential of new data solutions.

𝗞𝗲𝘆 𝗧𝗮𝗸𝗲𝗮𝘄𝗮𝘆𝘀:

  • 𝗗𝗮𝘁𝗮 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝘀: A data product comprises data assets and metadata that describe its features and access methods.

  • 𝗣𝗿𝗼𝗱𝘂𝗰𝘁 𝗧𝘆𝗽𝗲𝘀: Data products can be general-purpose or specialized, similar to physical products that serve different needs.

  • 𝗧𝗿𝘂𝘀𝘁 𝗟𝗲𝘃𝗲𝗹𝘀: The concept of "Trustworthy" should not be restricted to highly certified data. Instead, a "good enough" approach should be adopted, akin to different categories of physical products.

  • 𝗢𝗽𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝗮𝗹 𝗖𝗼𝘀𝘁𝘀: Limiting interoperability to only highly certified data increases operational costs and restricts access to broader data assets.

  • 𝗖𝘂𝗹𝘁𝘂𝗿𝗮𝗹 𝗦𝗵𝗶𝗳𝘁: Moving beyond the need for perfect data requires a cultural and procedural shift, encouraging the use of "silver" and "bronze" data alongside "gold" data.

  • 𝗗𝗮𝘁𝗮 𝗣𝗼𝗹𝗶𝗰𝗶𝗲𝘀: Effective data policy management and metadata availability at all levels are essential for managing different trustworthiness levels.

𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻: A flexible approach to data trustworthiness can improve the usability and value of data products while reducing costs. This approach promotes a more comprehensive data ecosystem that supports innovation and efficiency.

Go to Article

Additional References

The 3 Data Product components and how a Data-First Stack enables each

Data Trustworthiness

What is data trust?

4 Ways to Achieve Trustworthy Data


𝗜𝗱𝗲𝗮 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻: 𝗧𝗵𝗲 𝗣𝗼𝘄𝗲𝗿 𝗼𝗳 𝗕𝗿𝗮𝗶𝗻𝘄𝗿𝗶𝘁𝗶𝗻𝗴

... 𝗮𝗻𝗱 𝗵𝗼𝘄 𝗶𝘁 𝗲𝗻𝗵𝗮𝗻𝗰𝗲𝘀 𝗰𝗿𝗲𝗮𝘁𝗶𝘃𝗶𝘁𝘆 𝗮𝗻𝗱 𝗮𝗰𝗰𝗲𝗹𝗲𝗿𝗮𝘁𝗲 𝗽𝗿𝗼𝗯𝗹𝗲𝗺-𝘀𝗼𝗹𝘃𝗶𝗻𝗴

Brainstorming, while a common method for generating ideas, often falls short due to inherent group dynamics and behaviors. Challenges such as dominant personalities, focus drift, and self-censorship can hinder innovation. Brainwriting offers a more effective alternative by allowing all participants to contribute ideas silently, thereby addressing these issues and fostering a more inclusive environment.

𝗖𝗵𝗮𝗹𝗹𝗲𝗻𝗴𝗲𝘀:

  • 𝗚𝗿𝗼𝘂𝗽 𝗗𝘆𝗻𝗮𝗺𝗶𝗰𝘀: Traditional brainstorming often suffers from dominance by vocal participants and difficulty maintaining focus.

  • 𝗦𝗲𝗹𝗳-𝗖𝗲𝗻𝘀𝗼𝗿𝘀𝗵𝗶𝗽: Participants may withhold ideas based on perceived negative feedback or fear of judgment.

  • 𝗜𝗻𝗲𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆: Large groups can struggle with time management and maintaining engagement.

𝗥𝗲𝗰𝗼𝗺𝗺𝗲𝗻𝗱𝗮𝘁𝗶𝗼𝗻𝘀:

  • 𝗗𝗲𝗳𝗶𝗻𝗲 𝘁𝗵𝗲 𝗣𝗿𝗼𝗯𝗹𝗲𝗺 𝗖𝗹𝗲𝗮𝗿𝗹𝘆: Ensure that the issue being addressed is specific and actionable.

  • 𝗘𝘀𝘁𝗮𝗯𝗹𝗶𝘀𝗵 𝗚𝗿𝗼𝘂𝗻𝗱 𝗥𝘂𝗹𝗲𝘀: Set clear guidelines for participation and idea generation to maintain structure.

  • 𝗥𝗼𝘁𝗮𝘁𝗲 𝗮𝗻𝗱 𝗕𝘂𝗶𝗹𝗱 𝗼𝗻 𝗜𝗱𝗲𝗮𝘀: Allow participants to enhance or modify ideas from others, promoting collaboration and refinement.

  • 𝗖𝗼𝗻𝘀𝗼𝗹𝗶𝗱𝗮𝘁𝗲 𝗮𝗻𝗱 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗲: Gather and assess ideas to identify the most impactful solutions.

Brainwriting effectively mitigates the common pitfalls of traditional brainstorming, offering a structured approach that promotes equal participation and reduces the influence of dominant personalities.

By incorporating brainwriting, teams can enhance creativity and accelerate problem-solving, paving the way for innovative breakthroughs and more effective solutions.

Go to Article

Additional References

Brainwriting 101: how to unlock new, innovative ideas

How to use brainwriting to generate ideas

Brainwriting: A 3-Step Approach to Generating Innovative Ideas


𝗗𝗮𝘁𝗮 𝗤𝘂𝗮𝗹𝗶𝘁𝘆: 𝗧𝗵𝗲 𝗖𝗼𝗿𝗻𝗲𝗿𝘀𝘁𝗼𝗻𝗲 𝗼𝗳 𝗘𝗳𝗳𝗲𝗰𝘁𝗶𝘃𝗲 𝗔𝗜 𝗠𝗼𝗱𝗲𝗹𝘀

Ensuring AI Effectiveness through High-Quality Data Management

In his blog post, Tejasvi Addagada points out that Large Language Models (LLMs) derive their capabilities from vast datasets collected from diverse sources. The quality of this data is crucial, as it enables LLMs to learn language patterns and generate accurate responses. However, poor data quality introduces noise, leading to incorrect embeddings and reduced model effectiveness. 𝗣𝗼𝗼𝗿 𝗱𝗮𝘁𝗮 𝗾𝘂𝗮𝗹𝗶𝘁𝘆 𝗮𝗳𝗳𝗲𝗰𝘁𝘀 𝗔𝗜 𝗼𝘂𝘁𝗰𝗼𝗺𝗲𝘀 𝗶𝗻 𝘀𝗲𝘃𝗲𝗿𝗮𝗹 𝘄𝗮𝘆𝘀:

𝗜𝗻𝗮𝗰𝗰𝘂𝗿𝗮𝘁𝗲 𝗣𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻𝘀: Errors in training data cause unreliable or incorrect model predictions.

𝗕𝗶𝗮𝘀𝗲𝗱 𝗢𝘂𝘁𝗽𝘂𝘁𝘀: Biased training data perpetuates biases in AI-generated results.

𝗡𝗼𝗻-𝘂𝘀𝗮𝗯𝗹𝗲 𝗖𝗼𝗻𝘁𝗲𝗻𝘁:: Incomplete or inconsistent data confuses models, resulting in nonsensical outputs.

𝗠𝗶𝘀𝗹𝗲𝗮𝗱𝗶𝗻𝗴 𝗜𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻: Erroneous data can produce misleading information, detrimental to decision-making processes.

𝗜𝗻𝗮𝗰𝗰𝘂𝗿𝗮𝘁𝗲 𝗱𝗮𝘁𝗮 𝗰𝗮𝗻 𝗹𝗲𝗮𝗱 𝘁𝗼 𝗳𝗶𝗻𝗮𝗻𝗰𝗶𝗮𝗹 𝗹𝗼𝘀𝘀𝗲𝘀 𝗮𝗻𝗱 𝗱𝗲𝗰𝗿𝗲𝗮𝘀𝗲𝗱 𝗰𝗼𝗻𝗳𝗶𝗱𝗲𝗻𝗰𝗲 𝗶𝗻 𝗼𝗿𝗴𝗮𝗻𝗶𝘇𝗮𝘁𝗶𝗼𝗻𝗮𝗹 𝗱𝗲𝗰𝗶𝘀𝗶𝗼𝗻𝘀. For Generative AI, data quality is critical to producing accurate insights. Data scientists spend a significant amount of time preparing data, highlighting the challenge of maintaining high-quality data.

𝗣𝗼𝗼𝗿 𝗱𝗮𝘁𝗮 𝗾𝘂𝗮𝗹𝗶𝘁𝘆 𝗮𝗹𝘀𝗼 𝗶𝗺𝗽𝗮𝗰𝘁𝘀 𝗰𝘂𝘀𝘁𝗼𝗺𝗲𝗿 𝘀𝗮𝘁𝗶𝘀𝗳𝗮𝗰𝘁𝗶𝗼𝗻 𝗮𝗻𝗱 𝗹𝗼𝘆𝗮𝗹𝘁𝘆. Inaccurate predictions can lead to wrong decisions, decreasing customer trust and satisfaction. Implementing systematic quality control and verification can mitigate these issues, much like quality checks in a production line.

In conclusion, 𝗱𝗮𝘁𝗮 𝗾𝘂𝗮𝗹𝗶𝘁𝘆 𝗶𝘀 𝗮 𝗯𝘂𝘀𝗶𝗻𝗲𝘀𝘀 𝗶𝗺𝗽𝗲𝗿𝗮𝘁𝗶𝘃𝗲 that impacts financial outcomes, customer satisfaction, and AI effectiveness. High-quality data ensures that AI models are accurate and reliable, driving growth and value in data-driven projects.

Go to Article

Additional References

The risks of poor data quality in AI systems

Data Quality in AI: Challenges, Importance & Best Practices in '24

Data Quality For Good AI Outcomes

Why a Data Governance Strategy Is Crucial to Harness the Capabilities of Artificial Intelligence

Why Data Quality Matters in the Age of Generative AI


𝗧𝗵𝗲 𝗘𝘀𝘀𝗲𝗻𝘁𝗶𝗮𝗹 𝗥𝗼𝗹𝗲 𝗼𝗳 𝗗𝗮𝘁𝗮 𝗢𝗯𝘀𝗲𝗿𝘃𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗶𝗻 𝗛𝗶𝗴𝗵-𝗤𝘂𝗮𝗹𝗶𝘁𝘆 𝗔𝗜

𝐭𝐨 𝐬𝐭𝐫𝐞𝐚𝐦𝐥𝐢𝐧𝐞 𝐜𝐨𝐥𝐥𝐚𝐛𝐨𝐫𝐚𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐞𝐧𝐡𝐚𝐧𝐜𝐞 𝐦𝐨𝐝𝐞𝐥 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞

In this article, Jatin Solanki points out that 𝗗𝗮𝘁𝗮 𝗼𝗯𝘀𝗲𝗿𝘃𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗶𝘀 𝗰𝗿𝘂𝗰𝗶𝗮𝗹 𝗳𝗼𝗿 𝗲𝗻𝘀𝘂𝗿𝗶𝗻𝗴 𝘁𝗵𝗮𝘁 𝗔𝗜 𝘀𝘆𝘀𝘁𝗲𝗺𝘀 𝗼𝗽𝗲𝗿𝗮𝘁𝗲 𝗲𝗳𝗳𝗲𝗰𝘁𝗶𝘃𝗲𝗹𝘆 𝗮𝗻𝗱 𝘁𝗿𝗮𝗻𝘀𝗽𝗮𝗿𝗲𝗻𝘁𝗹𝘆. It provides visibility into data pipelines, allowing organizations to monitor, validate, and govern data throughout its lifecycle. This proactive approach helps maintain data quality, which is essential for the reliability and performance of AI systems.

𝗞𝗲𝘆 𝗰𝗼𝗺𝗽𝗼𝗻𝗲𝗻𝘁𝘀 𝗼𝗳 𝗱𝗮𝘁𝗮 𝗼𝗯𝘀𝗲𝗿𝘃𝗮𝗯𝗶𝗹𝗶𝘁𝘆 include:

  • 𝗗𝗮𝘁𝗮 𝗤𝘂𝗮𝗹𝗶𝘁𝘆 𝗠𝗼𝗻𝗶𝘁𝗼𝗿𝗶𝗻𝗴: Continuously tracking data metrics to ensure accuracy and timeliness, preventing errors before they impact AI models.

  • 𝗗𝗮𝘁𝗮 𝗟𝗶𝗻𝗲𝗮𝗴𝗲 𝗧𝗿𝗮𝗰𝗸𝗶𝗻𝗴: Documenting the data's journey to trace errors and understand transformations, which aids in compliance and impact analysis.

  • 𝗗𝗮𝘁𝗮 𝗩𝗮𝗹𝗶𝗱𝗮𝘁𝗶𝗼𝗻: Applying rules to verify data integrity and consistency, reducing the risk of incorporating faulty data into AI models.

By integrating data observability into MLOps workflows, teams can streamline collaboration and enhance model performance. Real-world examples, such as Uber, PayPal, and Airbnb, demonstrate the benefits of this approach, including improved accuracy, transparency, and operational efficiency.

𝗗𝗮𝘁𝗮 𝗼𝗯𝘀𝗲𝗿𝘃𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗶𝘀 𝗶𝗻𝗱𝗶𝘀𝗽𝗲𝗻𝘀𝗮𝗯𝗹𝗲 𝗳𝗼𝗿 𝗯𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝘁𝗿𝘂𝘀𝘁𝘄𝗼𝗿𝘁𝗵𝘆 𝗮𝗻𝗱 𝗲𝗳𝗳𝗲𝗰𝘁𝗶𝘃𝗲 𝗔𝗜 𝘀𝘆𝘀𝘁𝗲𝗺𝘀. It not only enhances data quality and governance but also promotes greater transparency and compliance. As AI continues to evolve, prioritizing data observability will be essential for developing reliable and accountable AI solutions.

Go to Article

Additional References

What is Data Observability? 5 Key Pillars To Know

What is data observability?

What is Data Observability?

Data Observability for Data Engineers: What, Why & How?

How to Calculate the ROI of Data Observability


𝗘𝗻𝗵𝗮𝗻𝗰𝗶𝗻𝗴 𝗗𝗮𝘁𝗮 𝗚𝗼𝘃𝗲𝗿𝗻𝗮𝗻𝗰𝗲 𝗮𝗻𝗱 𝗦𝗲𝗰𝘂𝗿𝗶𝘁𝘆 𝘄𝗶𝘁𝗵 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀 𝗨𝗻𝗶𝘁𝘆 𝗖𝗮𝘁𝗮𝗹𝗼𝗴

... 𝐋𝐞𝐯𝐞𝐫𝐚𝐠𝐢𝐧𝐠 𝐓𝐚𝐠𝐠𝐢𝐧𝐠 𝐭𝐨 𝐄𝐧𝐡𝐚𝐧𝐜𝐞 𝐃𝐚𝐭𝐚 𝐌𝐚𝐧𝐚𝐠𝐞𝐦𝐞𝐧𝐭 𝐚𝐧𝐝 𝐂𝐨𝐦𝐩𝐥𝐢𝐚𝐧𝐜𝐞

Databricks Unity Catalog centralizes security and management for data and AI assets across the lakehouse. It provides fine-grained access control for databases, tables, files, and models, improving governance and reducing workload.

In this article, David Callaghan points out that tags, structured as key-value pairs, can be attached to any asset in the lakehouse. This strategy improves data classification, regulatory compliance, and data lifecycle management. Key steps include identifying a use case as Proof of Value and securing stakeholder buy-in.

𝗖𝗼𝗺𝗺𝗼𝗻 𝘁𝗮𝗴𝗴𝗶𝗻𝗴 𝘂𝘀𝗲 𝗰𝗮𝘀𝗲𝘀::

  • 𝗗𝗮𝘁𝗮 𝗖𝗹𝗮𝘀𝘀𝗶𝗳𝗶𝗰𝗮𝘁𝗶𝗼𝗻 𝗮𝗻𝗱 𝗦𝗲𝗰𝘂𝗿𝗶𝘁𝘆: Tagging data as PII integrates with access controls for better security.

  • 𝗗𝗮𝘁𝗮 𝗟𝗶𝗳𝗲𝗰𝘆𝗰𝗹𝗲 𝗠𝗮𝗻𝗮𝗴𝗲𝗺𝗲𝗻𝘁: Tags identify data stages to enforce policies and manage transitions.

  • 𝗗𝗮𝘁𝗮 𝗖𝗮𝘁𝗮𝗹𝗼𝗴𝗶𝗻𝗴 𝗮𝗻𝗱 𝗗𝗶𝘀𝗰𝗼𝘃𝗲𝗿𝘆: Descriptive tags improve data searchability and usability for analysts.

  • 𝗖𝗼𝗺𝗽𝗹𝗶𝗮𝗻𝗰𝗲 𝗮𝗻𝗱 𝗥𝗲𝗴𝘂𝗹𝗮𝘁𝗶𝗼𝗻: Tags like 'GDPR' simplify audits and regulatory compliance efforts.

  • 𝗣𝗿𝗼𝗷𝗲𝗰𝘁 𝗠𝗮𝗻𝗮𝗴𝗲𝗺𝗲𝗻𝘁 𝗮𝗻𝗱 𝗖𝗼𝗹𝗹𝗮𝗯𝗼𝗿𝗮𝘁𝗶𝗼𝗻: Tags organize assets by project or department, aiding collaboration and tracking.

Databricks Runtime supports tag management through SQL commands, preferred for ease of use. Tags can be managed manually or through automated PySpark scripts.

𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻: Utilizing Unity Catalog and a robust tagging strategy enhances governance, security, and utility of the data lakehouse, facilitating broader enterprise adoption and compliance.

Go to Article

Additional References

Apply tags to Unity Catalog securable objects

Identifying and Tagging PII data with Unity Catalog

Data tagging for Databricks


Takeaways

Here are the key takeaways from this month's edition, providing you with essential strategies and insights to excel in data engineering:

Data Engineering, Redefined: As a data engineer, focus on data movement and management while leaving the implementation of business logic to application developers to enhance efficiency and accuracy in your data processes.

Beyond Traditional ETL: Prioritize EtLT architecture to leverage real-time data processing and effectively integrate multiple data sources. Stay ahead of the curve by preparing for future advances in automated governance and data virtualization to ensure your solutions remain cutting-edge and scalable.

EU AI Act as a Catalyst: Advanced platforms such as Databricks Lakehouse and Unity Catalog should be used to meet the stringent requirements of the EU AI Act and improve data quality and management. This approach ensures compliance and increases the reliability of high-risk AI systems.

Understanding Semantic Layers: To achieve a transformative impact, organizations should choose a semantic layer that goes beyond basic analytics and provides comprehensive insights and interoperability across all data and knowledge assets, rather than relying solely on metrics-driven solutions.

Data Trustworthiness: Embrace a flexible approach to data trustworthiness with varying certification levels. Implement robust data policy management and adjust governance roles to enhance data usability, reduce operational costs, and drive innovation across data-driven projects.

Brainwriting: To address complex computing challenges, brainstorming sessions should be held to ensure equal participation, reduce groupthink and encourage idea generation. This approach encourages diverse perspectives and streamlines the ideation process, leading to more innovative and effective solutions.

Data Quality for AI: Prioritize data quality by emphasizing accuracy, completeness, and validity. Reliable AI outcomes hinge on high-quality data, which prevents biases, errors, and misleading results. Implement robust data governance practices to ensure data integrity and enhance the effectiveness of AI models and predictions.

Data Observability: To build trustworthy and reliable AI systems, organizations should prioritize data observability. By continuously monitoring data quality, tracking data lineage, and validating data integrity, teams can enhance AI performance, ensure compliance, and foster greater transparency and trust in their AI solutions.

Data Tagging: Complement Databricks Unity Catalog with a comprehensive tagging strategy to enhance data governance, improve searchability, and streamline lifecycle management. Use automation for consistent tagging across assets, ensuring better regulatory compliance and efficient management of your data lakehouse.


Conclusion

To wrap up, this issue highlights key advancements in data engineering. By focusing on data management and delegating business logic to developers, efficiency and accuracy are improved. The EtLT architecture ensures effective real-time processing and integration. The use of platforms such as Databricks Lakehouse and Unity Catalog is in line with EU AI Act and improves data quality and compliance. Choosing the right semantic layer provides deeper insights and better interoperability. Adopting flexible approaches to data trustworthiness and robust governance drives innovation and reduces costs. Brainwriting encourages diverse, innovative solutions. Prioritizing data quality and observability is essential for reliable AI outcomes, while structured data tagging improves governance and management. Stay up to date with these insights to drive effective data processing practices.

Stay tuned for our next issue, where we’ll dive into the latest advancements and discoveries in data technology.

See you next month ...


#DataEngineering #DataManagement #DataGovernance #MachineLearning #BigData #DataQuality #DataIntegration #AI #CloudComputing #DataTransformation #DataOps #DataArchitecture #DataPipelines #DataSecurity #BusinessIntelligence #AICompliance #UnityCatalog #PrivacyCompliance #DataStrategy #DataLakes #DataScience #MLOps #RealTimeData #RegulatoryCompliance #Databricks #Analytics #DataMesh #DataLifecycle #DataInfrastructure #SelfServiceAnalytics #DataClassification #DataObservability #DataLineage #AIRegulation #DataProducts #DataWarehouse #AITransparency #Automation #InnovationInData #GenerativeAI #CustomerSatisfaction

Koenraad Block

Founder @ Bridge2IT +32 471 26 11 22 | Business Analyst @ Carrefour Finance

3mo

"Selected Data Engineering Posts . . . August 2024" compiles insightful articles and discussions on the latest trends, tools, and techniques in data engineering. This curated selection provides valuable perspectives on topics such as ETL processes, data pipeline optimization, cloud data architectures, and more. Very useful for data engineers and professionals looking to stay updated with the cutting-edge developments in the field. 📊🔧🚀

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics