Gable’s Post

View organization page for Gable, graphic

6,184 followers

6mo Edited

Downstream data teams feel the pain of upstream data quality issues immensely... but are scared of addressing it... "This is just how things are, we can't fix this." "The upstream engineering team doesn't care." "They would never allow us to additional CI/CD tests." Yet something interesting happens once we get an upstream engineer in the room to talk about data contracts. "Wait... the data team isn't already doing this? We can put this into our existing CI/CD pipeline? Notifications happen directly in my GitHub pull request? We should have been doing this yesterday." Despite the challenges of data quality, upstream engineers and downstream data teams are way more aligned than most think. It's just the silos between transactional and analytical databases that make communicating this alignment so hard. #data #dataengineering ----- 📌 Want to learn more? Check out our article "OLTP Vs. OLAP: How Professional POVs Cause Data Problems" https://lnkd.in/g_8cHS7h

26 Comments

José Javier Hernández González, CDMP

Associate Director, Internal Data Services

6mo

If I had a penny for every time I have suggested this

7 Reactions

🎯 Mark Freeman II

Data Engineering | Tech Lead @ Gable.ai | O’Reilly Author: Data Contracts | LinkedIn [in]structor | Founder @ On the Mark Data

6mo

Another option that get you thrown out the window is data modeling 😭

39 Reactions

Cody Crumrine

Driving growth and engagement for online communities | Founder/CEO @ Knobi.io

6mo

IME the "upstream engineer" usually works at another company...

2 Reactions

Madison Schott

I help data professionals learn analytics engineering skills to apply to their everyday work

6mo

LOL who would have thought!

1 Reaction

Dan Poarch

Proven System Design, Electrical Engineering, IoT-Hardware-Software Integration, Product Development, Solution Architect

6mo

This is one of the great challenges of machine learning and AI. Data available/provided is frequently not the _data_ that is needed. I encountered this while working on an AI tool for a predictive maintenance solution. Our data engineers frequently pushed for data that was unavailable, and/or probably inaccurate, and/or someone else's IP. If we had the staff, budget, and executive support to investigate and solve these problems we could have driven to a solution, but no one wanted to hear about such real-world problems.

1 Reaction

Mark Davies

Data Architect

6mo

Imagine a world whrre the authoring systems had data validation checks on input

Gayle Martin

Technical Support Engineer | Software Engineer | Cybersecurity Enthusiast

6mo

Geoffrey Johnson this is the part of the venn diagram where both of our jobs meet 😂

2 Reactions

Manikesh Verma

Vice President at Goldman Sachs, Ex. J P Morgan,PayPal

6mo

So relevant and burning

See more comments

To view or add a comment, sign in

More Relevant Posts

Chad Sanderson

CEO @ Gable.ai (Shift Left Data Platform)
5mo
Report this post
Downstream data teams feel the pain of upstream data quality issues immensely... but are scared of addressing it... "This is just how things are, we can't fix this." "The upstream engineering team doesn't care." "They would never allow us to additional CI/CD tests." Yet something interesting happens once we get an upstream engineer in the room to talk about data contracts. "Wait... the data team isn't already doing this? We can put this into our existing CI/CD pipeline? Notifications happen directly in my GitHub pull request? We should have been doing this yesterday." Despite the challenges of data quality, upstream engineers and downstream data teams are way more aligned than most think. It's just the silos between transactional and analytical systems that make communicating this alignment so hard. Good luck!
34 Comments
Like Comment
To view or add a comment, sign in
Gable

6,184 followers
6mo
Report this post
📕 You keep hearing about data contracts within the developer workflow, but how does it actually work? Below is a sneak peek diagram from Chad and Mark's upcoming O'Reilly book on Data Contracts. To Summarize: 1. A pull request (PR) is created where the code changes the schema of a database. 2. A new branch kicks off CI/CD workflows where a Docker container is generated. 3. The Docker container has a test database that provides table name and schema metadata. 4. This metadata is used to validate if there is a data contract in place, and if so, check the Kafka schema registry for validation. 5. If the validation shows differing schemas from expectations, the CI/CD check fails, and the ability to merge the branch into main is blocked. How can you see this fitting within your developer workflow? #data #dataengineering ----- 📌 Interested in learning more? You can download the early release chapters of the O'Reilly book for free here: https://lnkd.in/gEhQeTxv
15 Comments
Like Comment
To view or add a comment, sign in
Ananda Bose
2mo Edited
Report this post
If you are thinking of using Iceberg for your downstream data needs, this is something for you to consider, especially for Snowflake and SFDC use cases. Check the blog - https://lnkd.in/gEKavvyk
Confluent

563,196 followers
2mo

Tired of coordinating multiple copies of your data? A headless data architecture separates data storage, management, optimization, and access from the services that write, process, and query it—meaning that you no longer have to coordinate multiple data copies, and that you are free to use whatever processing or query engine is most suitable for your needs—whether that be Trino, Presto, Apache Flink, or Apache Spark. HDAs can also: ▸ encompass multiple data formats (with data streams and tables being the two most common.) ▸ help with regulatory compliance Learn more about building your own HDA in this 2-part blog series from Adam Bellemare! Read part one here ➡️ https://cnfl.io/4eLBmgB
1 Comment
Like Comment
To view or add a comment, sign in
Vince Rioux
1mo
Report this post
Headless data architecture is like a power strip for your data — it lets you plug in multiple devices (Flink, Spark, etc.) without having to rewire your entire system. No need to duplicate data, just connect and go. If you are thinking of using Iceberg for your downstream data needs, this is something for you to consider, especially for Snowflake and SFDC use cases. Check the blog - https://lnkd.in/gEKavvyk
Confluent

563,196 followers
2mo

Tired of coordinating multiple copies of your data? A headless data architecture separates data storage, management, optimization, and access from the services that write, process, and query it—meaning that you no longer have to coordinate multiple data copies, and that you are free to use whatever processing or query engine is most suitable for your needs—whether that be Trino, Presto, Apache Flink, or Apache Spark. HDAs can also: ▸ encompass multiple data formats (with data streams and tables being the two most common.) ▸ help with regulatory compliance Learn more about building your own HDA in this 2-part blog series from Adam Bellemare! Read part one here ➡️ https://cnfl.io/4eLBmgB
Like Comment
To view or add a comment, sign in
🎯 Mark Freeman II

Data Engineering | Tech Lead @ Gable.ai | O’Reilly Author: Data Contracts | LinkedIn [in]structor | Founder @ On the Mark Data
6mo Edited
Report this post
💻 How do data contracts work? I'm now writing the book's sections on implementing data contracts with open-source tools. Here is a sneak peek! 👇🏽 The workflow for schema validation: 1. A pull request (PR) is created where the code changes the schema of a database. 2. A new branch kicks off CI/CD workflows where a Docker container is generated. 3. The Docker container has a test database that provides table name and schema metadata. 4. This metadata is used to validate if there is a data contract in place, and if so, check the Kafka schema registry for validation. 5. If the validation shows differing schemas from expectations, the CI/CD check fails, and the ability to merge the branch into main is blocked. 📕 The whole point of this chapter is to take data contracts from a theoretical idea to something you can apply today. I aim to create a public repo with tutorials and code to follow along with the book chapter as well. 🤔 What specific aspects of implementing data contracts do you have questions about? I would love to make sure it's covered in the chapter. #data #dataengineering ----- 🎯 Want more content like this on your LinkedIn feed? Then don't forget to click "follow" on my profile!
11 Comments
Like Comment
To view or add a comment, sign in
Sebastian Royer

Superintendent Data and Integration Platforms
6mo
Report this post
For my team we’ve been considering how we can implement a high confidence categorisation for some pipelines. This is to complement the alternate pattern affectionately referred to as the ‘YOLO’ pattern. I like the ideas here though I think we can achieve it more simply just enhancing our existing devops build and releases, relying on all the metadata in Databricks Unity Catalog. Also given UC is now open source this will be a viable workflow for many. #datacontracts #databricks #unitycatalog
🎯 Mark Freeman II

Data Engineering | Tech Lead @ Gable.ai | O’Reilly Author: Data Contracts | LinkedIn [in]structor | Founder @ On the Mark Data
6mo Edited

💻 How do data contracts work? I'm now writing the book's sections on implementing data contracts with open-source tools. Here is a sneak peek! 👇🏽 The workflow for schema validation: 1. A pull request (PR) is created where the code changes the schema of a database. 2. A new branch kicks off CI/CD workflows where a Docker container is generated. 3. The Docker container has a test database that provides table name and schema metadata. 4. This metadata is used to validate if there is a data contract in place, and if so, check the Kafka schema registry for validation. 5. If the validation shows differing schemas from expectations, the CI/CD check fails, and the ability to merge the branch into main is blocked. 📕 The whole point of this chapter is to take data contracts from a theoretical idea to something you can apply today. I aim to create a public repo with tutorials and code to follow along with the book chapter as well. 🤔 What specific aspects of implementing data contracts do you have questions about? I would love to make sure it's covered in the chapter. #data #dataengineering ----- 🎯 Want more content like this on your LinkedIn feed? Then don't forget to click "follow" on my profile!
Like Comment
To view or add a comment, sign in
David Nanson

Data Mesh | Event-Driven | Streaming
2mo
Report this post
Think of headless data architecture like a power strip for your data — it lets you plug in multiple devices (Flink, Spark, etc.) without having to rewire your entire system. No need to duplicate data, just connect and go.
Confluent

563,196 followers
2mo

Tired of coordinating multiple copies of your data? A headless data architecture separates data storage, management, optimization, and access from the services that write, process, and query it—meaning that you no longer have to coordinate multiple data copies, and that you are free to use whatever processing or query engine is most suitable for your needs—whether that be Trino, Presto, Apache Flink, or Apache Spark. HDAs can also: ▸ encompass multiple data formats (with data streams and tables being the two most common.) ▸ help with regulatory compliance Learn more about building your own HDA in this 2-part blog series from Adam Bellemare! Read part one here ➡️ https://cnfl.io/4eLBmgB
Like Comment
To view or add a comment, sign in
Timo Ruohomäki

data analytics, BI and data products (also R, Shiny, Go and Xojo)
2mo
Report this post
From ETL to T.
Confluent

563,196 followers
2mo

Tired of coordinating multiple copies of your data? A headless data architecture separates data storage, management, optimization, and access from the services that write, process, and query it—meaning that you no longer have to coordinate multiple data copies, and that you are free to use whatever processing or query engine is most suitable for your needs—whether that be Trino, Presto, Apache Flink, or Apache Spark. HDAs can also: ▸ encompass multiple data formats (with data streams and tables being the two most common.) ▸ help with regulatory compliance Learn more about building your own HDA in this 2-part blog series from Adam Bellemare! Read part one here ➡️ https://cnfl.io/4eLBmgB
Like Comment
To view or add a comment, sign in
Vadim Orlov

Helping enterprises build their dream Lakehouse
2mo
Report this post
The leading caused of production data pipeline failures are source schema drift and data type changes. In my latest blog, I discuss how DataForge safeguards against these issues.

DataForge

814 followers
2mo

📄 Mastering Schema Evolution & Type Safety with DataForge 📄 We’ve published a new blog exploring two key challenges in data pipeline development: schema evolution and type safety. Based on recent discussions in the data engineering community, schema changes in source data remain one of the primary causes of pipeline failures. In this post, we cover: 🔹 How schema evolution impacts pipelines, especially when adding, removing, or modifying attributes. 🔹 Why SQL, despite being a strongly typed language, often struggles with enforcing type safety. 🔹 How DataForge addresses these issues with compile-time type safety and automated schema evolution strategies. By leveraging these features, data engineers can: ✅ Improve pipeline reliability and reduce debugging time. ✅ Automate handling schema changes across datasets. ✅ Ensure consistent downstream logic across data lakehouse architectures. Read the full blog and see how DataForge can streamline schema management in your pipelines. 👉 https://lnkd.in/gZ4SYSB7 #DataEngineering #SchemaEvolution #TypeSafety #DataPipelines #DataForge

Mastering Schema Evolution & Type Safety with DataForge — DataForge

dataforgelabs.com

1 Comment
Like Comment
To view or add a comment, sign in
Joe Swanson

Senior Software Engineer, Co-Founder
2mo
Report this post
Going from software engineering from data engineering and back and forth, I can't stress enough how important handling type safety is, especially in the data engineering world. Check out our new blog to see how DataForge handles these concepts.

DataForge

814 followers
2mo

📄 Mastering Schema Evolution & Type Safety with DataForge 📄 We’ve published a new blog exploring two key challenges in data pipeline development: schema evolution and type safety. Based on recent discussions in the data engineering community, schema changes in source data remain one of the primary causes of pipeline failures. In this post, we cover: 🔹 How schema evolution impacts pipelines, especially when adding, removing, or modifying attributes. 🔹 Why SQL, despite being a strongly typed language, often struggles with enforcing type safety. 🔹 How DataForge addresses these issues with compile-time type safety and automated schema evolution strategies. By leveraging these features, data engineers can: ✅ Improve pipeline reliability and reduce debugging time. ✅ Automate handling schema changes across datasets. ✅ Ensure consistent downstream logic across data lakehouse architectures. Read the full blog and see how DataForge can streamline schema management in your pipelines. 👉 https://lnkd.in/gZ4SYSB7 #DataEngineering #SchemaEvolution #TypeSafety #DataPipelines #DataForge

Mastering Schema Evolution & Type Safety with DataForge — DataForge

dataforgelabs.com

1 Comment
Like Comment
To view or add a comment, sign in

6,184 followers

View Profile Follow

Gable’s Post

More Relevant Posts

Explore topics