Dipankar Mazumdar, M.Sc 🥑’s Post

View profile for Dipankar Mazumdar, M.Sc 🥑, graphic

Staff Data Engineer Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Author of "Engineering Lakehouses"

Metadata Indexing in Apache Hudi for faster performance. Data skipping is one of the common techniques used with large volumes of data to achieve better query performances. The whole idea is simple - read as less data as possible! What that means is that as a compute engine paired with a lakehouse platform like Apache Hudi it should read only the data files that are needed to satisfy the query from the storage. Data skipping not only reduces the volume of data that needs to be scanned & processed, but it can also lead to substantial improvements in execution time. Of course this is made possible because of the metadata information provided by file formats like #Parquet. Basically, each Parquet file contains the min/max values of each column along with other useful info such as: num of NULL values. These min/max values are 'column statistics'. Now, although, we could directly leverage these stats for data skipping, this could affect the query performance because the engine still has to go through each file to read the footer. ❌ This process can be very time taking specifically with large data volumes. What's Hudi's approach? ✅ Hudi adds a next level of pruning in this case. It basically takes all these column statistics & collates the info in the form of an INDEX. ✅ Indexes such as this (& more) are incorporated into Hudi's internal metadata table so engines can just get the files where the data is stored. Therefore, instead of reading individual Parquet footers, compute engines can directly go to the metadata table's index & fetch the required files. Way faster. For e.g. in this image, there are 2 parquet files that has well-defined value ranges for each of its columns. - File1.parquet contains 'Salary' ranges in $10000-40000 - File2.parquet contains 'Salary' ranges in $45000-90000 Now, if we run a query to fetch all the records where Salary=30000, the engine (Spark) can fetch the records from only File1 as it has values within the range & skip the other one. Imagine doing this at scale (with large number of files)! #dataengineering #softwareengineering

  • No alternative text description for this image
Dipankar Mazumdar, M.Sc 🥑

Staff Data Engineer Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Author of "Engineering Lakehouses"

3mo
Lakshmi Shiva Ganesh Sontenam

Data Engineering - Retail Anly. | Visual Illustrator | Medium✍️

3mo

Dipankar Mazumdar, M.Sc 🥑 - is there any short comings of hudi compared to iceberg? I am planning to do a poc where I want to ingest daily updates and needs robust catalogue , metadata management and ACId compliance, and ease of incremental updates with schema evolution and performing scd type changes as needed and back fills.. any suggestion to get started with?

Like
Reply
See more comments

To view or add a comment, sign in

Explore topics