Onehouse’s Post

View organization page for Onehouse, graphic

8,243 followers

7mo

🐬 Dremio users deal with a lot of different data file and table formats. ⚒️ So Dipankar Mazumdar, M.Sc 🥑 of Onehouse and Alex Merced of Dremio worked together to create a new blog post. 🕵️ In the post, they show how you might use Dremio to smoothly do lakehouse analytics on Apache Hudi and Apache Iceberg data using Apache XTable (Incubating). 0️⃣1️⃣ Check it out - and borrow and revise the code as needed. https://lnkd.in/g-jBvJ55 #apachehudi #XTable #datalakehouse #dataengineering #onehouse #opensource #nolockin

Dremio Lakehouse Analytics with Hudi and Iceberg using XTable

onehouse.ai

3 Comments

Sriram Ganarajan

Data LakeHouse Architect with Deep Expertise in Various aspects of Data Management & Accelerated Computing

7mo

Oh man, cannot miss to see you folks, Dipankar Mazumdar, M.Sc 🥑?Alex Merceddd in action together ..its like Batman and Superman coming together for a common cause to save the #data world....😎😀

3 Reactions

To view or add a comment, sign in

More Relevant Posts

Dipankar Mazumdar, M.Sc 🥑

Staff Data Engineer Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Author of "Engineering Lakehouses"
8mo
Report this post
Multi-Modal Indexing System in Apache Hudi. One of the core technical design choices that separates Apache Hudi from other open table formats is 𝐈𝐧𝐝𝐞𝐱𝐢𝐧𝐠. From its inception, Hudi has been heavily optimized to handle mutable change streams with varying write patterns. Indexing plays a critical role in dealing with these writes that deems faster updates. They help with faster upserts/deletes by quickly locating the records that needs to be updated. 𝐖𝐡𝐚𝐭 𝐢𝐬 𝐇𝐮𝐝𝐢'𝐬 𝐌𝐮𝐥𝐭𝐢-𝐌𝐨𝐝𝐚𝐥 𝐢𝐧𝐝𝐞𝐱𝐢𝐧𝐠? Hudi's multi-modal indexing system redefines the design of an indexing subsystem for data lakes. 𝐇𝐨𝐰 𝐢𝐬 𝐢𝐭 𝐢𝐦𝐩𝐥𝐞𝐦𝐞𝐧𝐭𝐞𝐝? - By using Hudi's Metadata table - Hudi's metadata table is actually a single internal Hudi Merge-on-Read table - It consists of different types of indexes such as column stats, bloom filter, files, record-level, etc. as individual '𝘱𝘢𝘳𝘵𝘪𝘵𝘪𝘰𝘯𝘴' 𝐂𝐨𝐫𝐞 𝐃𝐞𝐬𝐢𝐠𝐧 𝐏𝐫𝐢𝐧𝐜𝐢𝐩𝐥𝐞𝐬 - Scalable metadata: Size of the metadata table can grow with larger updates. The table needs to scale metadata to TBs of size - ACID updates: index & table metadata must always be up-to-date & in sync with the data table. No partial writes - Fast Lookups: Indexes size can be huge. For certain queries to be efficient, scanning entire indexes should be avoided 𝐇𝐨𝐰 𝐝𝐨𝐞𝐬 𝐢𝐭 𝐢𝐦𝐩𝐫𝐨𝐯𝐞𝐬 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞? ✅ File Listings: For large datasets, listing files is a huge bottleneck. With files index in a multi-modal index system, listing is drastically reduced Vs direct S3 listing ✅ Data Skipping: On the read-side, column_stats index (min, max val) can be used to skip irrelevant data files. This serves as a 2nd level pruning Vs reading each #Parquet footers directly ✅ Fast Upserts: Bloom_filter index stores the bloom filters of all data files, thereby avoiding the need to scan the footers of each data file. Record-level indexes helps locating records faster. Having worked on other open #lakehouse table formats, I understand specific pain points in terms of write latency and how a multi-modal indexing system can be extremely beneficial. Detailed read in comments. #dataengineering #softwareengineering
2 Comments
Like Comment
To view or add a comment, sign in
Akashdeep Gupta

Principal Data Engineer at Accenture, AWS Certified Data Analytics and ML Specialist
1mo Edited
Report this post
Why Do We Need Open Table Formats (Apache Hudi, Apache Iceberg, and Delta Lake)? 🤔 As Data Architecture moves towards the Data LakeHouse model (decoupling storage and compute), we need data warehouse-like functionality on our tables: think ACID transactions, better performance, and consistency. Open Table Formats (OTFs) also eliminate the shortcomings of the Hive Table Format: ❌ Hive lacks support for UPDATE, DELETE, and UPSERT. ✅ OTFs enable full CRUD operations. ❌ Hive has no ACID guarantees, leading to reliability issues with concurrent writes. ✅ OTFs provide ACID transactions, ensuring safe writes. ❌ Hive struggles with performance and scalability. ✅ OTFs offer better performance, optimized metadata, and scalability. In addition to this, OTFs bring more features like: • 🔄 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻 𝗘𝘃𝗼𝗹𝘂𝘁𝗶𝗼𝗻: Change table partitioning without rewriting data. • 📝 𝗦𝗰𝗵𝗲𝗺𝗮 𝗘𝘃𝗼𝗹𝘂𝘁𝗶𝗼𝗻: Easily add, remove, or rename columns. • ⏳ 𝗧𝗶𝗺𝗲 𝗧𝗿𝗮𝘃𝗲𝗹: Query past versions of your data. • 🔙 𝗗𝗮𝘁𝗮 𝗩𝗲𝗿𝘀𝗶𝗼𝗻𝗶𝗻𝗴: Immutable data files allow rollback to any previous table version. These features are powered by a METADATA LAYER introduced on top of the data layer (data files) that stores the following information: • 📜 𝗦𝗰𝗵𝗲𝗺𝗮 & 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻 𝗜𝗻𝗳𝗼: Table structure and partition details for quick query execution. • 📊 𝗦𝘁𝗮𝘁𝗶𝘀𝘁𝗶𝗰𝗮𝗹 𝗜𝗻𝗳𝗼: Row count, min/max values, and null counts for better query optimization. • 🕒 𝗖𝗼𝗺𝗺𝗶𝘁 𝗛𝗶𝘀𝘁𝗼𝗿𝘆: Track changes for time travel and version rollback. • 📂 𝗗𝗮𝘁𝗮 𝗙𝗶𝗹𝗲 𝗣𝗮𝘁𝗵𝘀: Locations of the data files, including partition details. If you are more interested, there's an in-depth blog by Dipankar Mazumdar, M.Sc 🥑 that explains the evolution of data architectures and how we got to where we are today (on OTFs). Link in the comments 👇 #DataEngineering #OpenTableFormat #ApacheHudi #ApacheIceberg #DeltaLake #DataLakeHouse #BigData

13 Comments
Like Comment
To view or add a comment, sign in
Sarah Serbascewicz

Nexus Cognitive | Modern Data Analytics | Digital Transformation | Cloud Technology
8mo
Report this post
Check out this article from CRN around our fully managed “Icehouse” data analytics platform!

Ethel Anderson

Modern Data Analytics | Ex-Google | Digital Transformation | Board of Advisors, SEAAV | EDM Council, WDP Co-Chair Americas | Lecturer at The Erdős Institute | Women in Data Mentor
8mo

If Starburst data + Apache Iceberg were to release new lyrics for what this Icehouse architecture symbolizes it would start with, "Flow like a river, data streams so clean, Starburst's SQL, it's a mean machine, Iceberg tables, organized and lean, Query optimization, gonna be like you've never seen."

Starburst Debuts “Icehouse” Managed Data Lake Service

crn.com
Like Comment
To view or add a comment, sign in
Avinash Narala

Data Engineer | Databricks | Pyspark | Python | SQL | 4xDatabricksCertifications | Freelance
6mo
Report this post
The open data lakehouse approach offers a powerful foundation for data management. However, choosing the optimal data format for various workloads can be a challenge. Popular options like Delta Lake, Apache Iceberg, and Apache Hudi all offer distinct advantages. Standardizing on a single format can seem daunting, leading to "decision fatigue" and missed opportunities♀️. 🌟 This is where 𝐃𝐞𝐥𝐭𝐚 𝐔𝐧𝐢𝐅𝐨𝐫𝐦 by Databricks steps in. It provides a simple and elegant solution for achieving interoperability across these open formats. 🚀 https://lnkd.in/gaKKJyQ8 #Databricks #DeltaUniform #DataHack #LTIMindtreeXDatabricksCoE

Delta UniForm: a universal format for lakehouse interoperability

databricks.com
Like Comment
To view or add a comment, sign in
Onehouse

8,243 followers
1w Edited
Report this post
Are you trying to choose a metastore or a catalog, but not sure what to pick? Read our latest comprehensive comparison article where Kyle Weller tears down some of the top Catalogs on the market: Unity Catalog, Apache Polaris Incubating, DataHub, Glue, Apache Gravitino, and Atlan. In this deep dive you will find head-to-head comparisons of features ranging from access controls, data quality, data discovery, and much more. The article breaks down the difference between a metastore and a business catalog, and it describes the intricate relationships of catalogs and lakehouse open table formats Apache Hudi, Delta Lake, and Apache Iceberg. Sneak peek at the rankings: Data Discovery and Exploration ✅ Best = Atlan ❌ Worst = Apache Polaris Data Connectors ✅ Best = DataHub ❌ Worst = Apache Polaris Access Control ✅ Best = Apache Polaris ❌ Worst = DataHub Compliance ✅ Best = DataHub ❌ Worst = Apache Gravitino Data Lineage: ✅ Best = Databricks Unity Catalog ❌ Worst = Apache Polaris Data Quality: ✅ Best = Glue ❌ Worst = Unity Catalog OSS Read the blog for the full descriptions, links to docs, and the metastore rankings: https://lnkd.in/gBvA6YiK #datacatalog #unitycatalog #datahub #apachepolaris #atlan #awsglue #apachegravitino #dataengineering #apachehudi #apacheiceberg #deltalake

Comprehensive Data Catalog Comparison

onehouse.ai
Like Comment
To view or add a comment, sign in
Raj Tiwari

Lead - Big Data & Connectors at Lumenore - A Netlink Platform , MTech in Data Science & Engineering
4mo
Report this post
🔗 New read on Medium! 📘 I've just published an article exploring the integration of Apache Spark with DuckDB to optimize data analytics operations. This setup proves to be a game-changer for handling vast datasets with incredible efficiency. 🚀 Highlights include: - Step-by-step guide on combining Apache Spark with DuckDB - Insights into the performance benefits of this integration Ideal for data professionals looking to enhance their toolkit. Check it out and let me know your thoughts or experiences with these technologies! 👉 Read the full article here: Enhancing Data Analytics: Connecting Apache Spark and DuckDB for Optimal Performance - (https://lnkd.in/d3cuhWpx) #DataEngineering #BigData #TechInnovation #ApacheSpark #DuckDB

Enhancing Data Analytics: Connecting Apache Spark and DuckDB for Optimal Performance

medium.com

2 Comments
Like Comment
To view or add a comment, sign in
Nageen Yerramsetty

Data Science | Business Intelligence | Data analysis | Data Modelling | Love to discuss Data Stories
1mo
Report this post
🚀 New Blog Post Alert! 📄 In this latest post, I dive deep into Parquet file structure and metadata, uncovering how Parquet file stores data internally and how metadata plays a crucial role in speeding up data retrieval. If you, like me, are curious to understand the internals of the Apache Parquet file, this blog is for you! https://lnkd.in/gtzXaZYG #ApacheParquet #BigData #DataEngineering #Curious

Apache Parquet - File structure and metadata - CONNECTING DOTS

https://learn2infiniti.com

1 Comment
Like Comment
To view or add a comment, sign in
Dipankar Mazumdar, M.Sc 🥑

Staff Data Engineer Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Author of "Engineering Lakehouses"
2mo
Report this post
Multi-Modal Indexing System in Apache Hudi. One of the core technical design choices that separates Apache Hudi from other open table formats is "Indexing." From its inception, Hudi has been heavily optimized to handle mutable change streams with varying write patterns. Indexing plays a critical role in dealing with these writes that deems faster updates. They help with faster upserts/deletes by quickly locating the records that needs to be updated. 𝐖𝐡𝐚𝐭 𝐢𝐬 𝐇𝐮𝐝𝐢'𝐬 𝐌𝐮𝐥𝐭𝐢-𝐌𝐨𝐝𝐚𝐥 𝐢𝐧𝐝𝐞𝐱𝐢𝐧𝐠? Hudi's multi-modal indexing system redefines the design of an indexing subsystem for data lakes. 𝐇𝐨𝐰 𝐢𝐬 𝐢𝐭 𝐢𝐦𝐩𝐥𝐞𝐦𝐞𝐧𝐭𝐞𝐝? - By using Hudi's Metadata table - Hudi's metadata table is actually a single internal Hudi Merge-on-Read table - It consists of different types of indexes such as column stats, bloom filter, files, record-level, etc. as individual '𝘱𝘢𝘳𝘵𝘪𝘵𝘪𝘰𝘯𝘴' 𝐂𝐨𝐫𝐞 𝐃𝐞𝐬𝐢𝐠𝐧 𝐏𝐫𝐢𝐧𝐜𝐢𝐩𝐥𝐞𝐬 - Scalable metadata: Size of the metadata table can grow with larger updates. The table needs to scale metadata to TBs of size - ACID updates: index & table metadata must always be up-to-date & in sync with the data table. No partial writes - Fast Lookups: Indexes size can be huge. For certain queries to be efficient, scanning entire indexes should be avoided 𝐇𝐨𝐰 𝐝𝐨𝐞𝐬 𝐢𝐭 𝐢𝐦𝐩𝐫𝐨𝐯𝐞𝐬 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞? ✅ File Listings: For large datasets, listing files is a huge bottleneck. With files index in a multi-modal index system, listing is drastically reduced Vs direct S3 listing ✅ Data Skipping: On the read-side, column_stats index (min, max val) can be used to skip irrelevant data files. This serves as a 2nd level pruning Vs reading each #Parquet footers directly ✅ Fast Upserts: Bloom_filter index stores the bloom filters of all data files, thereby avoiding the need to scan the footers of each data file. Record-level indexes helps locating records faster. Having worked on other open hashtag #lakehouse table formats, I understand specific pain points in terms of write latency and how a multi-modal indexing system can be extremely beneficial. Detailed read in comments. #dataengineering #softwareengineering
1 Comment
Like Comment
To view or add a comment, sign in
⚡ Mayur Palta

GTM | Field Engineering | Competitive Strategist | Board Member | Author | Harvard Business School | Ex-AWS, Ex-Oracle
6mo
Report this post
UniForm aka Universal Format for unifying the open table formats.

Delta Lake

60,713 followers
6mo

Delta UniForm and Apache XTable (Incubating) are actively collaborating to help organizations build architectures that are not constrained to any single ecosystem. These tools take advantage of the fact that #DeltaLake, #Iceberg, and #Hudi all consist of a metadata layer built on Apache Parquet data files. XTable translates metadata between a source and target format, maintaining a single copy of the data files. In this blog, we cover: ⭐ Why users are choosing to interoperate ⭐ Building a format-agnostic lakehouse with UniForm and XTable ⭐ How the Delta Lake and XTable communities collaborate Learn more ➡ https://lnkd.in/echrzTV3 cc Jonathan Brito, Kyle Weller #oss #lfaidata #xtable #lakehouse #opentable

Unifying the open table formats with Delta Lake Universal Format (UniForm) and Apache XTable

delta.io

1 Comment
Like Comment
To view or add a comment, sign in
Maneesh K Bishnoi

Staff Data Engineer at Walmart | Ex-Unacademy | Empowering Future Data Engineers
2mo Edited
Report this post
🚀 Lakehouse Feature Comparison: Apache Hudi vs Delta Lake vs Apache Iceberg Just finished reading an insightful blog that dives deep into a detailed comparison of three leading data lakehouse engines. The blog goes beyond the usual surface-level differences, providing a comprehensive analysis of key features, TCP-DS benchmarking, and highlighting how these engines perform across popular industry use cases. It’s a great resource for anyone working with modern data architectures and looking to make informed decisions about the right tool for their needs. 🔍 Key takeaways include: - In-depth feature comparison for transaction management, ACID compliance, schema evolution, and time-travel. - Benchmarks for performance and scalability. - Real-world use cases in large-scale industry scenarios. 💡 If you are in the data engineering space or exploring data lakes, I highly recommend checking it out! Read the full comparison here: https://lnkd.in/g2R_m_yg At Unacademy, we rely on Apache Hudi for our transactional data lake needs and appreciate the robustness it offers in terms of incremental processing, ACID guarantees, and scalable data management. #DataEngineering #Lakehouse #ApacheHudi #DeltaLake #ApacheIceberg #BigData #Unacademy

Apache Hudi vs Delta Lake vs Apache Iceberg - Data Lakehouse Feature Comparison

onehouse.ai

2 Comments
Like Comment
To view or add a comment, sign in

8,243 followers

View Profile Follow

Onehouse’s Post

More Relevant Posts

Explore topics