So you're thinking of getting into data diffing? You might want to read this first. In our newest blog post, Insung Ko and Elliot G. break down data diffing best practices in Datafold. Learn how data engineers are using Datafold's: - Sampling - Filtering - Monitors as code - Efficient hashing algorithm to manage data quality at scale. https://lnkd.in/gskhZcmQ
Datafold’s Post
More Relevant Posts
-
Ever felt lost trying to track changes in large datasets? The intricacies of data management can be a challenge, but understanding the Delta Lake transaction log can significantly simplify your experience. Delta Lake’s transaction log acts like a centralized diary for every change made to your table, ensuring you always have a consistent view of data, regardless of concurrent modifications. This empowers data engineers and analysts alike to focus on deriving insights while maintaining a solid foundation for their data operations. With features like time travel and ACID transactions, Delta Lake revolutionizes how we interact with big data at scale. What's your experience with data versioning? Share your thoughts below! 👇 #DeltaLake #DataManagement #BigData #Analytics #DataInsights https://lnkd.in/g3CfTsTJ
Diving Into Delta Lake: Unpacking The Transaction Log
databricks.com
To view or add a comment, sign in
-
Have you ever wondered about the differences between Datafold's Data Diff and #dbt tests? Do data teams really need both? In this post, Elliot G. and Leo Folsom explore: - Why dbt tests prevent some data quality issues, but not all - How the two tests answer fundamentally different data quality questions - And why having both in your CI pipeline is essential for complete data quality coverage https://lnkd.in/gsdjcmF8
Three key differences between Datafold tests and dbt tests | Datafold
datafold.com
To view or add a comment, sign in
-
Discover key steps for transforming unstructured data into LLM-ready formats ready for use in RAG systems – understand what matters for LLM ingestion and preprocessing. Read more about how Unstructured is leading the way in data preprocessing below! ⤵ https://lnkd.in/eiEnENY8
Understanding What Matters for LLM Ingestion and Preprocessing – Unstructured
unstructured.io
To view or add a comment, sign in
-
Great article about some of the challenges behind the iceberg hype and how to take advantage of it. Needless to say, not all is as rosy as many people are making it out to be and architects and data engineers need to be deliberate about how to use iceberg without exposing their teams to more complexity and inconsistencies in managing data.
Iceberg Is An Implementation Detail | dbt Developer Blog
docs.getdbt.com
To view or add a comment, sign in
-
To explore Parquet file format, along with Vu Trinh’s excellent blog, the following one from Chandan Nandy would be a good start for Data Engineers. Vu’s write-up : https://lnkd.in/dK5MZ2_i Chandan’s write-up :
Deep Dive into Parquet
medium.com
To view or add a comment, sign in
-
There is a Part II now, focusing on Bloom Filter & More. Here it goes. #Spark #parquet #Databricks #OLAP #BloomFilter https://lnkd.in/gnxvv47J
To explore Parquet file format, along with Vu Trinh’s excellent blog, the following one from Chandan Nandy would be a good start for Data Engineers. Vu’s write-up : https://lnkd.in/dK5MZ2_i Chandan’s write-up :
Deep Dive into Parquet
medium.com
To view or add a comment, sign in
-
Check out my new Article on How Data Engineers can do initial testings on Massive datasets for Data Quality using Central Limit Theorem and save costs + time.
Leveraging the Central Limit Theorem for Cost-Effective Big Data Quality Assurance
medium.datadriveninvestor.com
To view or add a comment, sign in
-
Struggling with backfilling data in Apache Pinot? Dive into my journey of tackling this challenge head-on in my latest Medium blog post! https://lnkd.in/gVpUu-bZ The blog explains in detail how I backfilled a derived column by extracting deeply nested value from a JSON-based column in the realtime table. Hoping you find it both helpful and a time-saver, sparing you the need to reinvent the wheel! #apachepinot
Backfilling Derived Column in Pinot
medium.com
To view or add a comment, sign in
-
🚀 Unlocking Metadata Discovery with AI-Assisted Data Catalogs DataHub's latest innovation, as detailed by Saeed Rahman, leverages Large Language Models (LLMs) and Neo4j knowledge graphs to transform metadata discovery in investment firms. By representing metadata as graphs, this approach enables efficient and accurate answers to complex data queries, linking alternative datasets through advanced entity resolution. This AI-driven solution enhances data discoverability, speeds up insights, and unlocks new opportunities in data management. A game-changer for data professionals. #AI #DataManagement #KnowledgeGraphs #Metadata #DataDiscovery #LLM #DataHub
✍ NEW Blog: AI-Assisted Data Catalogs: An LLM Powered by Knowledge Graphs for Metadata Discovery by Saeed Rahman! 🚀
AI-Assisted Data Catalogs: An LLM Powered by Knowledge Graphs for Metadata Discovery
blog.datahubproject.io
To view or add a comment, sign in
-
The modern data stack problem 1/3: Skill dbt isn't just simple SQL. To truly participate in the data model evolution with dbt, you must adopt an engineering mindset—a mindset that business analysts don't necessarily have. And for good reason! Their mission is to serve the business. Equipping analysts with the power of dbt often starts with excitement. Data teams up-skill analysts to become engineers, integrating dbt into their workflows. But what happens to the analysts' original expertise? The expertise that drives the most value to the business—leveraging data to deliver business insights quickly—gets sidelined. Requests from business stakeholders start piling up, but analysts are caught up in engineering work, shifting their focus away from rapid business insight generation. This leads to a speed problem, which I'll dive into in tomorrow's post. Be sure to follow!
To view or add a comment, sign in
5,809 followers