Dipankar Mazumdar, M.Sc 🥑’s Post

Staff Data Engineer Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Author of "Engineering Lakehouses"

3mo

Metadata Indexing in Apache Hudi for faster performance. Data skipping is one of the common techniques used with large volumes of data to achieve better query performances. The whole idea is simple - read as less data as possible! What that means is that as a compute engine paired with a lakehouse platform like Apache Hudi it should read only the data files that are needed to satisfy the query from the storage. Data skipping not only reduces the volume of data that needs to be scanned & processed, but it can also lead to substantial improvements in execution time. Of course this is made possible because of the metadata information provided by file formats like #Parquet. Basically, each Parquet file contains the min/max values of each column along with other useful info such as: num of NULL values. These min/max values are 'column statistics'. Now, although, we could directly leverage these stats for data skipping, this could affect the query performance because the engine still has to go through each file to read the footer. ❌ This process can be very time taking specifically with large data volumes. What's Hudi's approach? ✅ Hudi adds a next level of pruning in this case. It basically takes all these column statistics & collates the info in the form of an INDEX. ✅ Indexes such as this (& more) are incorporated into Hudi's internal metadata table so engines can just get the files where the data is stored. Therefore, instead of reading individual Parquet footers, compute engines can directly go to the metadata table's index & fetch the required files. Way faster. For e.g. in this image, there are 2 parquet files that has well-defined value ranges for each of its columns. - File1.parquet contains 'Salary' ranges in $10000-40000 - File2.parquet contains 'Salary' ranges in $45000-90000 Now, if we run a query to fetch all the records where Salary=30000, the engine (Spark) can fetch the records from only File1 as it has values within the range & skip the other one. Imagine doing this at scale (with large number of files)! #dataengineering #softwareengineering

2 Comments

Dipankar Mazumdar, M.Sc 🥑

Staff Data Engineer Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Author of "Engineering Lakehouses"

3mo

https://hudi.apache.org/docs/next/metadata/

1 Reaction

Lakshmi Shiva Ganesh Sontenam

Data Engineering - Retail Anly. | Visual Illustrator | Medium✍️

3mo

Dipankar Mazumdar, M.Sc 🥑 - is there any short comings of hudi compared to iceberg? I am planning to do a poc where I want to ingest daily updates and needs robust catalogue , metadata management and ACId compliance, and ease of incremental updates with schema evolution and performing scd type changes as needed and back fills.. any suggestion to get started with?

See more comments

To view or add a comment, sign in

More Relevant Posts

Sameer Shaik

Engineer @Bajaj Markets || Data engineer
2mo Edited
Report this post
💡 Handling Late Arriving Updates in Fact Tables with Apache Hudi 💡 In many data systems, like ride-sharing, e-commerce, or financial trading platforms, transactional data is stored in NoSQL data stores. These datasets grow rapidly, with the most frequent updates occurring on recent data, and occasional updates on older data (due to late-arriving transactions or corrections). 🔄 Challenge: Handling random updates, especially when a majority occur on recent data while some affect older partitions, can lead to inefficiencies in data storage and retrieval. 🔥 Solution: Apache Hudi + BLOOM Index For such workloads, the BLOOM index in Apache Hudi shines. Here’s why: Efficient Lookups: The BLOOM index uses bloom filters to quickly narrow down the files that need to be checked for updates. Range Pruning: If the keys are ordered, Hudi uses range pruning to further reduce the number of files to compare. An interval tree optimizes this, filtering out files that don’t match any key ranges. Reduced False Positives: To minimize false positives (which can lead to unnecessary data shuffling), Hudi supports dynamic bloom filters that adapt to the number of records per file, maintaining an optimal false positive ratio. 🛠 Performance Optimization: Hudi also leverages caching of input records and employs a custom partitioner to balance the workload across executors, ensuring minimal data skew and efficient processing. 📈 For data-heavy applications requiring frequent updates on large datasets, Apache Hudi’s BLOOM index provides a robust solution that ensures fast and accurate updates while optimizing resource usage. #DataEngineering #ApacheHudi #BigData #FactTables #Indexing #NoSQL #BloomIndex #ETL #DataOptimization #DataLakes
Like Comment
To view or add a comment, sign in
Aman Chadha

Brand partnership • GenAI Leadership @ AWS • Stanford AI • Ex-, Amazon Alexa, Nvidia, Qualcomm • EB-1 "Einstein Visa" Recipient/Mentor • EMNLP 2023 Outstanding Paper Award
5mo
Report this post
🌟 Unfreeze Your Apache Iceberg Lakehouse to Power Intelligent Applications ❓Why This Matters: 🔹Apache Iceberg is a high-performance format designed for managing huge analytic tables. It brings the reliability and simplicity of SQL tables to big data, making it possible for engines like Apache Spark, Trino, Flink, Presto, Hive, and Impala to work safely with the same tables simultaneously. With Iceberg, you can easily merge new data, update existing rows, and perform targeted deletes, ensuring flexible and efficient data management. 🔹Officially, Iceberg is defined as an open table format for massive analytic datasets. It functions as a middle layer between the computing layer (Flink, Spark) and the storage layer (ORC, Parquet, and Avro). Data is written into Iceberg through Flink or Spark, and then accessed through engines like Spark, Flink, and Presto. This integration allows for high-performance SQL table functionalities within these engines. 🔹Coupling the pros of Iceberg with a local deployment powered by a virtual private cloud (VPC) solution which enables secure, reliable, and governed local execution -- this duo enables users to achieve subsecond analytics and power low-latency applications while maintaining top-notch security and data integrity. 🔹Learn more in this hands-on webinar on Wednesday, June 26 @ 10-11:30 AM PDT / Thursday, June 27 @ 11:30 AM - 1:00 PM IST by Madhukar Kumar and others which will unveil new features in SingleStore, including native integration for Apache Iceberg to unfreeze your data lakehouse and deploy locally for sub-second latency. Say goodbye to complex ETL processes and hello to fast, intelligent applications! 👉🏼 Register here: https://lnkd.in/g-zfqWsE #bigdata #artificialintelligence #analytics

1 Comment
Like Comment
To view or add a comment, sign in
Vikram K.

Senior Engineer - Big Data | Python | PySpark | Hadoop | SQL | AWS | Azure | DSA | Machine Learning | Life Long Learner
2mo
Report this post
𝐀𝐩𝐚𝐜𝐡𝐞 𝐈𝐜𝐞𝐛𝐞𝐫𝐠 • Apache Iceberg is introduced or developed to overcome the issues with hive table format thereby, enabling faster query planning & execution, enable better and safer table evolution, etc. • Iceberg is not a storage or execution engine, it is a table format specification, and it is a set of APIs and libraries for engines to interact with the tables following the specifications. 𝐈𝐬𝐬𝐮𝐞𝐬 𝐰𝐢𝐭𝐡 𝐇𝐢𝐯𝐞 𝐓𝐚𝐛𝐥𝐞 𝐅𝐨𝐫𝐦𝐚𝐭 • Changes made to data are inefficient • No way to safely change the data in multiple partitions at a single time. • In practice, multiple jobs modifying the same data isn’t a safe operation • All of the directory listings needed for large tables take a long time. • Users have to know the physical layout of the table • Hive table statistics are generally stale. • The File system layout has poor performance on cloud object storage 𝐀𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞 𝐨𝐟 𝐈𝐜𝐞𝐛𝐞𝐫𝐠 𝐓𝐚𝐛𝐥𝐞 There are 3 layers in the architecture of an Iceberg table, with each higher layer tracks the information of the one below it • The Iceberg Catalog • The Metadata Layer, which contains the metadata files, manifest lists and manifest files • The Data Layer, where actual data of the files reside. In the next post, we will deep dive into the architecture and discuss what will happen at each layer whenever we are running any query on the table. #dataengineering #iceberg #learning #datawithvikram
1 Comment
Like Comment
To view or add a comment, sign in
Dipankar Mazumdar, M.Sc 🥑

Staff Data Engineer Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Author of "Engineering Lakehouses"
1mo
Report this post
Fast "Copy-On-Write" on Apache Parquet. Upserts are critical for various use cases such as CDC. Usually there are 2 ways to do record-level updates in a data lake: - Copy-On-Write (CoW) - Merge-On-Read (MoR) Upserts in data lakes using traditional copy-on-write (CoW) methods can be slow & resource-intensive, particularly when dealing with large data volumes. Uber observed that some of their tables received updates spread across 90% of the files, leading to rewrites of ~100TB of data for any large-scale table in their data lake. To improve the speed of upserts, they introduced a partial CoW within #Parquet files with row-level index. By doing so, they can skip unnecessary data pages (the smallest storage unit in Parquet) & read/write efficiently. Their row-level secondary index is used to locate the data pages in Parquet to improvise CoW. Here's how it works: ✅Each index points to the rows inside a page where that record exists. ✅When an update is received they can quickly locate the file & data pages that need to be updated. ✅This helps in skipping non-relevant pages and thereby saves a lot of compute resources to speed up the CoW process. It is definitely an interesting work and the benchmarks (TPC-DS sales) demonstrate that the new approach achieves significantly faster speeds. They also plan to get this integrated into Apache Hudi, the foundation of their lakehouse. Blog link in comments ✌️ #dataengineering #softwareengineering
3 Comments
Like Comment
To view or add a comment, sign in
Apache Hudi

10,412 followers
7mo
Report this post
Improving Query Performance on Data Lakes with Apache Hudi. The key to optimize query latency is to actually read 'fewer bytes' from input tables. In order to do so, there are 3 important aspects to understand. ✅ Indexes: Indexes helps us locate ‘potential’ records in the table that needs to be fetched back. They are really helpful for needle-in-a-haystack type queries ✅ Caching: We can build a cache in the middle (say disk cache). That way we don’t have to access data from cloud directly & can just read from the cache. ✅ Storage Layout: Storage layout controls how data is physically organized on storage, meaning which records are in which files for a query. It is important to optimally lay out data for the query engines (things like clustering, file sizing) Apache Hudi offers various table services to help keep the storage layout & metadata management performant. Hudi was designed with built-in table services that enables running them in inline, semi-asynchronous or full-asynchronous modes. On top of that, query engines can use various indexes in Hudi, like file listings, column statistics, bloom filters, record-level indexes, and functional indexes to quickly generate optimized query plans & improve read performance. Read more about the Hudi Stack: https://lnkd.in/gbyc9CwF #dataengineering #softwareengineering
Like Comment
To view or add a comment, sign in
Dipankar Mazumdar, M.Sc 🥑

Staff Data Engineer Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Author of "Engineering Lakehouses"
6mo
Report this post
Multi-Modal Indexing System in Apache Hudi. One of the core technical design choices that separates Apache Hudi from other open table formats is Indexing. From its inception, Hudi has been heavily optimized to handle mutable change streams with varying write patterns. Indexing plays a critical role in dealing with these writes that deems faster updates. They help with faster upserts/deletes by quickly locating the records that needs to be updated. 𝐖𝐡𝐚𝐭 𝐢𝐬 𝐇𝐮𝐝𝐢'𝐬 𝐌𝐮𝐥𝐭𝐢-𝐌𝐨𝐝𝐚𝐥 𝐢𝐧𝐝𝐞𝐱𝐢𝐧𝐠? Hudi's multi-modal indexing system redefines the design of an indexing subsystem for data lakes. 𝐇𝐨𝐰 𝐢𝐬 𝐢𝐭 𝐢𝐦𝐩𝐥𝐞𝐦𝐞𝐧𝐭𝐞𝐝? - By using Hudi's Metadata table - Hudi's metadata table is actually a single internal Hudi Merge-on-Read table - It consists of different types of indexes such as column stats, bloom filter, files, record-level, etc. as individual '𝘱𝘢𝘳𝘵𝘪𝘵𝘪𝘰𝘯𝘴' 𝐂𝐨𝐫𝐞 𝐃𝐞𝐬𝐢𝐠𝐧 𝐏𝐫𝐢𝐧𝐜𝐢𝐩𝐥𝐞𝐬 - Scalable metadata: Size of the metadata table can grow with larger updates. The table needs to scale metadata to TBs of size - ACID updates: index & table metadata must always be up-to-date & in sync with the data table. No partial writes - Fast Lookups: Indexes size can be huge. For certain queries to be efficient, scanning entire indexes should be avoided 𝐇𝐨𝐰 𝐝𝐨𝐞𝐬 𝐢𝐭 𝐢𝐦𝐩𝐫𝐨𝐯𝐞𝐬 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞? ✅ File Listings: For large datasets, listing files is a huge bottleneck. With files index in a multi-modal index system, listing is drastically reduced Vs direct S3 listing ✅ Data Skipping: On the read-side, column_stats index (min, max val) can be used to skip irrelevant data files. This serves as a 2nd level pruning Vs reading each #Parquet footers directly ✅ Fast Upserts: Bloom_filter index stores the bloom filters of all data files, thereby avoiding the need to scan the footers of each data file. Record-level indexes helps locating records faster. Having worked on other open #lakehouse table formats, I understand specific pain points in terms of write latency and how a multi-modal indexing system can be extremely beneficial. Detailed read in comments. #dataengineering #softwareengineering
7 Comments
Like Comment
To view or add a comment, sign in
Dipankar Mazumdar, M.Sc 🥑

Staff Data Engineer Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Author of "Engineering Lakehouses"
8mo
Report this post
Multi-Modal Indexing System in Apache Hudi. One of the core technical design choices that separates Apache Hudi from other open table formats is 𝐈𝐧𝐝𝐞𝐱𝐢𝐧𝐠. From its inception, Hudi has been heavily optimized to handle mutable change streams with varying write patterns. Indexing plays a critical role in dealing with these writes that deems faster updates. They help with faster upserts/deletes by quickly locating the records that needs to be updated. 𝐖𝐡𝐚𝐭 𝐢𝐬 𝐇𝐮𝐝𝐢'𝐬 𝐌𝐮𝐥𝐭𝐢-𝐌𝐨𝐝𝐚𝐥 𝐢𝐧𝐝𝐞𝐱𝐢𝐧𝐠? Hudi's multi-modal indexing system redefines the design of an indexing subsystem for data lakes. 𝐇𝐨𝐰 𝐢𝐬 𝐢𝐭 𝐢𝐦𝐩𝐥𝐞𝐦𝐞𝐧𝐭𝐞𝐝? - By using Hudi's Metadata table - Hudi's metadata table is actually a single internal Hudi Merge-on-Read table - It consists of different types of indexes such as column stats, bloom filter, files, record-level, etc. as individual '𝘱𝘢𝘳𝘵𝘪𝘵𝘪𝘰𝘯𝘴' 𝐂𝐨𝐫𝐞 𝐃𝐞𝐬𝐢𝐠𝐧 𝐏𝐫𝐢𝐧𝐜𝐢𝐩𝐥𝐞𝐬 - Scalable metadata: Size of the metadata table can grow with larger updates. The table needs to scale metadata to TBs of size - ACID updates: index & table metadata must always be up-to-date & in sync with the data table. No partial writes - Fast Lookups: Indexes size can be huge. For certain queries to be efficient, scanning entire indexes should be avoided 𝐇𝐨𝐰 𝐝𝐨𝐞𝐬 𝐢𝐭 𝐢𝐦𝐩𝐫𝐨𝐯𝐞𝐬 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞? ✅ File Listings: For large datasets, listing files is a huge bottleneck. With files index in a multi-modal index system, listing is drastically reduced Vs direct S3 listing ✅ Data Skipping: On the read-side, column_stats index (min, max val) can be used to skip irrelevant data files. This serves as a 2nd level pruning Vs reading each #Parquet footers directly ✅ Fast Upserts: Bloom_filter index stores the bloom filters of all data files, thereby avoiding the need to scan the footers of each data file. Record-level indexes helps locating records faster. Having worked on other open #lakehouse table formats, I understand specific pain points in terms of write latency and how a multi-modal indexing system can be extremely beneficial. Detailed read in comments. #dataengineering #softwareengineering
2 Comments
Like Comment
To view or add a comment, sign in
Dipankar Mazumdar, M.Sc 🥑

Staff Data Engineer Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Author of "Engineering Lakehouses"
4mo
Report this post
Apache Hudi Merge-on-Read (MoR) Table & Query Types. File formats such as #Parquet in a data lake are immutable in nature. Meaning once the file is written, it cannot be modified. However, lakehouse table formats needs ways to handle updates, deletions, & insertions efficiently while maintaining the benefits of the underlying immutable storage format. To tackle this, Apache Hudi brings in the concept of 2 table types that are designed for specific workloads: - Copy-on-Write (CoW) - Merge-on-Read (MoR). The general guideline in choosing between these 2 types is: 👉 Choose CoW if you are not worried about write cost 👉 Choose MoR if you need quick ingestion & want to balance read/write amplification intelligently, to provide near real-time data (streaming) In a MoR table, data is stored using a combination of columnar (base) + row-based formats (log). The base columnar file (Parquet) stores the main data & the row-based log file stores the incoming updates. These updates are later "compacted" to produce new versions of columnar files. Hudi also offers different query types to query a MoR Table for different use cases: 1. Snapshot Query: fetches latest data based on the latest snapshot. For MoR, the compaction process exposes near-real time data to be queried. Use case: low data latency. 2. Read-Optimized Query: Only the data from base columnar file is exposed. Use case: low query latency vs data latency. 3. Incremental Query: Queries only see new data written to the table. So, the data from the log file is only exposed. Use case: incremental data pipelines. More reading in comments. #dataengineering #softwareengineering
10 Comments
Like Comment
To view or add a comment, sign in
Dipankar Mazumdar, M.Sc 🥑

Staff Data Engineer Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Author of "Engineering Lakehouses"
1mo
Report this post
Multi-Modal Indexing System in Apache Hudi. One of the core technical design choices that separates Apache Hudi from other open table formats is "Indexing." From its inception, Hudi has been heavily optimized to handle mutable change streams with varying write patterns. Indexing plays a critical role in dealing with these writes that deems faster updates. They help with faster upserts/deletes by quickly locating the records that needs to be updated. 𝐖𝐡𝐚𝐭 𝐢𝐬 𝐇𝐮𝐝𝐢'𝐬 𝐌𝐮𝐥𝐭𝐢-𝐌𝐨𝐝𝐚𝐥 𝐢𝐧𝐝𝐞𝐱𝐢𝐧𝐠? Hudi's multi-modal indexing system redefines the design of an indexing subsystem for data lakes. 𝐇𝐨𝐰 𝐢𝐬 𝐢𝐭 𝐢𝐦𝐩𝐥𝐞𝐦𝐞𝐧𝐭𝐞𝐝? - By using Hudi's Metadata table - Hudi's metadata table is actually a single internal Hudi Merge-on-Read table - It consists of different types of indexes such as column stats, bloom filter, files, record-level, etc. as individual '𝘱𝘢𝘳𝘵𝘪𝘵𝘪𝘰𝘯𝘴' 𝐂𝐨𝐫𝐞 𝐃𝐞𝐬𝐢𝐠𝐧 𝐏𝐫𝐢𝐧𝐜𝐢𝐩𝐥𝐞𝐬 - Scalable metadata: Size of the metadata table can grow with larger updates. The table needs to scale metadata to TBs of size - ACID updates: index & table metadata must always be up-to-date & in sync with the data table. No partial writes - Fast Lookups: Indexes size can be huge. For certain queries to be efficient, scanning entire indexes should be avoided 𝐇𝐨𝐰 𝐝𝐨𝐞𝐬 𝐢𝐭 𝐢𝐦𝐩𝐫𝐨𝐯𝐞𝐬 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞? ✅ File Listings: For large datasets, listing files is a huge bottleneck. With files index in a multi-modal index system, listing is drastically reduced Vs direct S3 listing ✅ Data Skipping: On the read-side, column_stats index (min, max val) can be used to skip irrelevant data files. This serves as a 2nd level pruning Vs reading each #Parquet footers directly ✅ Fast Upserts: Bloom_filter index stores the bloom filters of all data files, thereby avoiding the need to scan the footers of each data file. Record-level indexes helps locating records faster. Having worked on other open hashtag #lakehouse table formats, I understand specific pain points in terms of write latency and how a multi-modal indexing system can be extremely beneficial. Detailed read in comments. #dataengineering #softwareengineering
1 Comment
Like Comment
To view or add a comment, sign in
Ant Wilson

Co-Founder & CTO at Supabase | YC S20
4mo
Report this post
Learn about Dynamic Table Partitioning in Postgres: https://lnkd.in/eFwhi7pT

Dynamic Table Partitioning in Postgres

supabase.com
Like Comment
To view or add a comment, sign in

11,377 followers

View Profile Follow

Dipankar Mazumdar, M.Sc 🥑’s Post

More from this author

Visualizing Multivariate dataset using Radviz.

Interactive Data Visualization using D3.js

Explore topics