Datafold’s Post

View organization page for Datafold, graphic

5,809 followers

2mo

So you're thinking of getting into data diffing? You might want to read this first. In our newest blog post, Insung Ko and Elliot G. break down data diffing best practices in Datafold. Learn how data engineers are using Datafold's: - Sampling - Filtering - Monitors as code - Efficient hashing algorithm to manage data quality at scale. https://lnkd.in/gskhZcmQ

Best practices for data diffing in CI/CD pipelines | Datafold

datafold.com

To view or add a comment, sign in

More Relevant Posts

Rabindra J.

Senior Data Engineer @ Wipro. Databricks Solution Architect Champion.
2mo
Report this post
Ever felt lost trying to track changes in large datasets? The intricacies of data management can be a challenge, but understanding the Delta Lake transaction log can significantly simplify your experience. Delta Lake’s transaction log acts like a centralized diary for every change made to your table, ensuring you always have a consistent view of data, regardless of concurrent modifications. This empowers data engineers and analysts alike to focus on deriving insights while maintaining a solid foundation for their data operations. With features like time travel and ACID transactions, Delta Lake revolutionizes how we interact with big data at scale. What's your experience with data versioning? Share your thoughts below! 👇 #DeltaLake #DataManagement #BigData #Analytics #DataInsights https://lnkd.in/g3CfTsTJ

Diving Into Delta Lake: Unpacking The Transaction Log

databricks.com
Like Comment
To view or add a comment, sign in
Kira F.
7mo
Report this post
Have you ever wondered about the differences between Datafold's Data Diff and #dbt tests? Do data teams really need both? In this post, Elliot G. and Leo Folsom explore: - Why dbt tests prevent some data quality issues, but not all - How the two tests answer fundamentally different data quality questions - And why having both in your CI pipeline is essential for complete data quality coverage https://lnkd.in/gsdjcmF8

Three key differences between Datafold tests and dbt tests | Datafold

datafold.com

2 Comments
Like Comment
To view or add a comment, sign in
unstructured.io

17,577 followers
4mo Edited
Report this post
Discover key steps for transforming unstructured data into LLM-ready formats ready for use in RAG systems – understand what matters for LLM ingestion and preprocessing. Read more about how Unstructured is leading the way in data preprocessing below! ⤵ https://lnkd.in/eiEnENY8

Understanding What Matters for LLM Ingestion and Preprocessing – Unstructured

unstructured.io
Like Comment
To view or add a comment, sign in
Mike Szalinski

Experienced Leader in Data Management and Enterprise Architecture
1w
Report this post
Great article about some of the challenges behind the iceberg hype and how to take advantage of it. Needless to say, not all is as rosy as many people are making it out to be and architects and data engineers need to be deliberate about how to use iceberg without exposing their teams to more complexity and inconsistencies in managing data.

Iceberg Is An Implementation Detail | dbt Developer Blog

docs.getdbt.com
Like Comment
To view or add a comment, sign in
Rahul Chakraborty

Technology Leader | Data Engineering
2mo
Report this post
To explore Parquet file format, along with Vu Trinh’s excellent blog, the following one from Chandan Nandy would be a good start for Data Engineers. Vu’s write-up : https://lnkd.in/dK5MZ2_i Chandan’s write-up :

Deep Dive into Parquet

medium.com

3 Comments
Like Comment
To view or add a comment, sign in
Chandan Nandy

Learning Data architecture ...& more
2mo Edited
Report this post
There is a Part II now, focusing on Bloom Filter & More. Here it goes. #Spark #parquet #Databricks #OLAP #BloomFilter https://lnkd.in/gnxvv47J

Rahul Chakraborty

Technology Leader | Data Engineering
2mo

To explore Parquet file format, along with Vu Trinh’s excellent blog, the following one from Chandan Nandy would be a good start for Data Engineers. Vu’s write-up : https://lnkd.in/dK5MZ2_i Chandan’s write-up :

Deep Dive into Parquet

medium.com

1 Comment
Like Comment
To view or add a comment, sign in
Amanpreet Singh

Data Science | Agile Management & Delivery
9mo
Report this post
Check out my new Article on How Data Engineers can do initial testings on Massive datasets for Data Quality using Central Limit Theorem and save costs + time.

Leveraging the Central Limit Theorem for Cost-Effective Big Data Quality Assurance

medium.datadriveninvestor.com
Like Comment
To view or add a comment, sign in
Shruti Mantri

Architect | Udemy Instructor | Mentor
6mo
Report this post
Struggling with backfilling data in Apache Pinot? Dive into my journey of tackling this challenge head-on in my latest Medium blog post! https://lnkd.in/gVpUu-bZ The blog explains in detail how I backfilled a derived column by extracting deeply nested value from a JSON-based column in the realtime table. Hoping you find it both helpful and a time-saver, sparing you the need to reinvent the wheel! #apachepinot

Backfilling Derived Column in Pinot

medium.com
Like Comment
To view or add a comment, sign in
James Powers

DataHub Advocate | Relationship Architect | #girldad | Lover of Pizza
4mo
Report this post
🚀 Unlocking Metadata Discovery with AI-Assisted Data Catalogs DataHub's latest innovation, as detailed by Saeed Rahman, leverages Large Language Models (LLMs) and Neo4j knowledge graphs to transform metadata discovery in investment firms. By representing metadata as graphs, this approach enables efficient and accurate answers to complex data queries, linking alternative datasets through advanced entity resolution. This AI-driven solution enhances data discoverability, speeds up insights, and unlocks new opportunities in data management. A game-changer for data professionals. #AI #DataManagement #KnowledgeGraphs #Metadata #DataDiscovery #LLM #DataHub

Acryl Data

4,092 followers
4mo Edited

✍ NEW Blog: AI-Assisted Data Catalogs: An LLM Powered by Knowledge Graphs for Metadata Discovery by Saeed Rahman! 🚀

AI-Assisted Data Catalogs: An LLM Powered by Knowledge Graphs for Metadata Discovery

blog.datahubproject.io
Like Comment
To view or add a comment, sign in
Sarah Levy

Co-Founder & CEO of Euno: Govern data models everywhere ✩
6mo
Report this post
The modern data stack problem 1/3: Skill dbt isn't just simple SQL. To truly participate in the data model evolution with dbt, you must adopt an engineering mindset—a mindset that business analysts don't necessarily have. And for good reason! Their mission is to serve the business. Equipping analysts with the power of dbt often starts with excitement. Data teams up-skill analysts to become engineers, integrating dbt into their workflows. But what happens to the analysts' original expertise? The expertise that drives the most value to the business—leveraging data to deliver business insights quickly—gets sidelined. Requests from business stakeholders start piling up, but analysts are caught up in engineering work, shifting their focus away from rapid business insight generation. This leads to a speed problem, which I'll dive into in tomorrow's post. Be sure to follow!
5 Comments
Like Comment
To view or add a comment, sign in

5,809 followers

View Profile Connect

Datafold’s Post

More Relevant Posts

Explore topics