🐬 Dremio users deal with a lot of different data file and table formats. ⚒️ So Dipankar Mazumdar, M.Sc 🥑 of Onehouse and Alex Merced of Dremio worked together to create a new blog post. 🕵️ In the post, they show how you might use Dremio to smoothly do lakehouse analytics on Apache Hudi and Apache Iceberg data using Apache XTable (Incubating). 0️⃣1️⃣ Check it out - and borrow and revise the code as needed. https://lnkd.in/g-jBvJ55 #apachehudi #XTable #datalakehouse #dataengineering #onehouse #opensource #nolockin
Onehouse’s Post
More Relevant Posts
-
Multi-Modal Indexing System in Apache Hudi. One of the core technical design choices that separates Apache Hudi from other open table formats is 𝐈𝐧𝐝𝐞𝐱𝐢𝐧𝐠. From its inception, Hudi has been heavily optimized to handle mutable change streams with varying write patterns. Indexing plays a critical role in dealing with these writes that deems faster updates. They help with faster upserts/deletes by quickly locating the records that needs to be updated. 𝐖𝐡𝐚𝐭 𝐢𝐬 𝐇𝐮𝐝𝐢'𝐬 𝐌𝐮𝐥𝐭𝐢-𝐌𝐨𝐝𝐚𝐥 𝐢𝐧𝐝𝐞𝐱𝐢𝐧𝐠? Hudi's multi-modal indexing system redefines the design of an indexing subsystem for data lakes. 𝐇𝐨𝐰 𝐢𝐬 𝐢𝐭 𝐢𝐦𝐩𝐥𝐞𝐦𝐞𝐧𝐭𝐞𝐝? - By using Hudi's Metadata table - Hudi's metadata table is actually a single internal Hudi Merge-on-Read table - It consists of different types of indexes such as column stats, bloom filter, files, record-level, etc. as individual '𝘱𝘢𝘳𝘵𝘪𝘵𝘪𝘰𝘯𝘴' 𝐂𝐨𝐫𝐞 𝐃𝐞𝐬𝐢𝐠𝐧 𝐏𝐫𝐢𝐧𝐜𝐢𝐩𝐥𝐞𝐬 - Scalable metadata: Size of the metadata table can grow with larger updates. The table needs to scale metadata to TBs of size - ACID updates: index & table metadata must always be up-to-date & in sync with the data table. No partial writes - Fast Lookups: Indexes size can be huge. For certain queries to be efficient, scanning entire indexes should be avoided 𝐇𝐨𝐰 𝐝𝐨𝐞𝐬 𝐢𝐭 𝐢𝐦𝐩𝐫𝐨𝐯𝐞𝐬 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞? ✅ File Listings: For large datasets, listing files is a huge bottleneck. With files index in a multi-modal index system, listing is drastically reduced Vs direct S3 listing ✅ Data Skipping: On the read-side, column_stats index (min, max val) can be used to skip irrelevant data files. This serves as a 2nd level pruning Vs reading each #Parquet footers directly ✅ Fast Upserts: Bloom_filter index stores the bloom filters of all data files, thereby avoiding the need to scan the footers of each data file. Record-level indexes helps locating records faster. Having worked on other open #lakehouse table formats, I understand specific pain points in terms of write latency and how a multi-modal indexing system can be extremely beneficial. Detailed read in comments. #dataengineering #softwareengineering
To view or add a comment, sign in
-
Why Do We Need Open Table Formats (Apache Hudi, Apache Iceberg, and Delta Lake)? 🤔 As Data Architecture moves towards the Data LakeHouse model (decoupling storage and compute), we need data warehouse-like functionality on our tables: think ACID transactions, better performance, and consistency. Open Table Formats (OTFs) also eliminate the shortcomings of the Hive Table Format: ❌ Hive lacks support for UPDATE, DELETE, and UPSERT. ✅ OTFs enable full CRUD operations. ❌ Hive has no ACID guarantees, leading to reliability issues with concurrent writes. ✅ OTFs provide ACID transactions, ensuring safe writes. ❌ Hive struggles with performance and scalability. ✅ OTFs offer better performance, optimized metadata, and scalability. In addition to this, OTFs bring more features like: • 🔄 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻 𝗘𝘃𝗼𝗹𝘂𝘁𝗶𝗼𝗻: Change table partitioning without rewriting data. • 📝 𝗦𝗰𝗵𝗲𝗺𝗮 𝗘𝘃𝗼𝗹𝘂𝘁𝗶𝗼𝗻: Easily add, remove, or rename columns. • ⏳ 𝗧𝗶𝗺𝗲 𝗧𝗿𝗮𝘃𝗲𝗹: Query past versions of your data. • 🔙 𝗗𝗮𝘁𝗮 𝗩𝗲𝗿𝘀𝗶𝗼𝗻𝗶𝗻𝗴: Immutable data files allow rollback to any previous table version. These features are powered by a METADATA LAYER introduced on top of the data layer (data files) that stores the following information: • 📜 𝗦𝗰𝗵𝗲𝗺𝗮 & 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻 𝗜𝗻𝗳𝗼: Table structure and partition details for quick query execution. • 📊 𝗦𝘁𝗮𝘁𝗶𝘀𝘁𝗶𝗰𝗮𝗹 𝗜𝗻𝗳𝗼: Row count, min/max values, and null counts for better query optimization. • 🕒 𝗖𝗼𝗺𝗺𝗶𝘁 𝗛𝗶𝘀𝘁𝗼𝗿𝘆: Track changes for time travel and version rollback. • 📂 𝗗𝗮𝘁𝗮 𝗙𝗶𝗹𝗲 𝗣𝗮𝘁𝗵𝘀: Locations of the data files, including partition details. If you are more interested, there's an in-depth blog by Dipankar Mazumdar, M.Sc 🥑 that explains the evolution of data architectures and how we got to where we are today (on OTFs). Link in the comments 👇 #DataEngineering #OpenTableFormat #ApacheHudi #ApacheIceberg #DeltaLake #DataLakeHouse #BigData
To view or add a comment, sign in
-
Check out this article from CRN around our fully managed “Icehouse” data analytics platform!
Modern Data Analytics | Ex-Google | Digital Transformation | Board of Advisors, SEAAV | EDM Council, WDP Co-Chair Americas | Lecturer at The Erdős Institute | Women in Data Mentor
If Starburst data + Apache Iceberg were to release new lyrics for what this Icehouse architecture symbolizes it would start with, "Flow like a river, data streams so clean, Starburst's SQL, it's a mean machine, Iceberg tables, organized and lean, Query optimization, gonna be like you've never seen."
Starburst Debuts “Icehouse” Managed Data Lake Service
crn.com
To view or add a comment, sign in
-
The open data lakehouse approach offers a powerful foundation for data management. However, choosing the optimal data format for various workloads can be a challenge. Popular options like Delta Lake, Apache Iceberg, and Apache Hudi all offer distinct advantages. Standardizing on a single format can seem daunting, leading to "decision fatigue" and missed opportunities♀️. 🌟 This is where 𝐃𝐞𝐥𝐭𝐚 𝐔𝐧𝐢𝐅𝐨𝐫𝐦 by Databricks steps in. It provides a simple and elegant solution for achieving interoperability across these open formats. 🚀 https://lnkd.in/gaKKJyQ8 #Databricks #DeltaUniform #DataHack #LTIMindtreeXDatabricksCoE
Delta UniForm: a universal format for lakehouse interoperability
databricks.com
To view or add a comment, sign in
-
Are you trying to choose a metastore or a catalog, but not sure what to pick? Read our latest comprehensive comparison article where Kyle Weller tears down some of the top Catalogs on the market: Unity Catalog, Apache Polaris Incubating, DataHub, Glue, Apache Gravitino, and Atlan. In this deep dive you will find head-to-head comparisons of features ranging from access controls, data quality, data discovery, and much more. The article breaks down the difference between a metastore and a business catalog, and it describes the intricate relationships of catalogs and lakehouse open table formats Apache Hudi, Delta Lake, and Apache Iceberg. Sneak peek at the rankings: Data Discovery and Exploration ✅ Best = Atlan ❌ Worst = Apache Polaris Data Connectors ✅ Best = DataHub ❌ Worst = Apache Polaris Access Control ✅ Best = Apache Polaris ❌ Worst = DataHub Compliance ✅ Best = DataHub ❌ Worst = Apache Gravitino Data Lineage: ✅ Best = Databricks Unity Catalog ❌ Worst = Apache Polaris Data Quality: ✅ Best = Glue ❌ Worst = Unity Catalog OSS Read the blog for the full descriptions, links to docs, and the metastore rankings: https://lnkd.in/gBvA6YiK #datacatalog #unitycatalog #datahub #apachepolaris #atlan #awsglue #apachegravitino #dataengineering #apachehudi #apacheiceberg #deltalake
Comprehensive Data Catalog Comparison
onehouse.ai
To view or add a comment, sign in
-
🔗 New read on Medium! 📘 I've just published an article exploring the integration of Apache Spark with DuckDB to optimize data analytics operations. This setup proves to be a game-changer for handling vast datasets with incredible efficiency. 🚀 Highlights include: - Step-by-step guide on combining Apache Spark with DuckDB - Insights into the performance benefits of this integration Ideal for data professionals looking to enhance their toolkit. Check it out and let me know your thoughts or experiences with these technologies! 👉 Read the full article here: Enhancing Data Analytics: Connecting Apache Spark and DuckDB for Optimal Performance - (https://lnkd.in/d3cuhWpx) #DataEngineering #BigData #TechInnovation #ApacheSpark #DuckDB
Enhancing Data Analytics: Connecting Apache Spark and DuckDB for Optimal Performance
medium.com
To view or add a comment, sign in
-
🚀 New Blog Post Alert! 📄 In this latest post, I dive deep into Parquet file structure and metadata, uncovering how Parquet file stores data internally and how metadata plays a crucial role in speeding up data retrieval. If you, like me, are curious to understand the internals of the Apache Parquet file, this blog is for you! https://lnkd.in/gtzXaZYG #ApacheParquet #BigData #DataEngineering #Curious
Apache Parquet - File structure and metadata - CONNECTING DOTS
https://learn2infiniti.com
To view or add a comment, sign in
-
Multi-Modal Indexing System in Apache Hudi. One of the core technical design choices that separates Apache Hudi from other open table formats is "Indexing." From its inception, Hudi has been heavily optimized to handle mutable change streams with varying write patterns. Indexing plays a critical role in dealing with these writes that deems faster updates. They help with faster upserts/deletes by quickly locating the records that needs to be updated. 𝐖𝐡𝐚𝐭 𝐢𝐬 𝐇𝐮𝐝𝐢'𝐬 𝐌𝐮𝐥𝐭𝐢-𝐌𝐨𝐝𝐚𝐥 𝐢𝐧𝐝𝐞𝐱𝐢𝐧𝐠? Hudi's multi-modal indexing system redefines the design of an indexing subsystem for data lakes. 𝐇𝐨𝐰 𝐢𝐬 𝐢𝐭 𝐢𝐦𝐩𝐥𝐞𝐦𝐞𝐧𝐭𝐞𝐝? - By using Hudi's Metadata table - Hudi's metadata table is actually a single internal Hudi Merge-on-Read table - It consists of different types of indexes such as column stats, bloom filter, files, record-level, etc. as individual '𝘱𝘢𝘳𝘵𝘪𝘵𝘪𝘰𝘯𝘴' 𝐂𝐨𝐫𝐞 𝐃𝐞𝐬𝐢𝐠𝐧 𝐏𝐫𝐢𝐧𝐜𝐢𝐩𝐥𝐞𝐬 - Scalable metadata: Size of the metadata table can grow with larger updates. The table needs to scale metadata to TBs of size - ACID updates: index & table metadata must always be up-to-date & in sync with the data table. No partial writes - Fast Lookups: Indexes size can be huge. For certain queries to be efficient, scanning entire indexes should be avoided 𝐇𝐨𝐰 𝐝𝐨𝐞𝐬 𝐢𝐭 𝐢𝐦𝐩𝐫𝐨𝐯𝐞𝐬 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞? ✅ File Listings: For large datasets, listing files is a huge bottleneck. With files index in a multi-modal index system, listing is drastically reduced Vs direct S3 listing ✅ Data Skipping: On the read-side, column_stats index (min, max val) can be used to skip irrelevant data files. This serves as a 2nd level pruning Vs reading each #Parquet footers directly ✅ Fast Upserts: Bloom_filter index stores the bloom filters of all data files, thereby avoiding the need to scan the footers of each data file. Record-level indexes helps locating records faster. Having worked on other open hashtag #lakehouse table formats, I understand specific pain points in terms of write latency and how a multi-modal indexing system can be extremely beneficial. Detailed read in comments. #dataengineering #softwareengineering
To view or add a comment, sign in
-
UniForm aka Universal Format for unifying the open table formats.
Delta UniForm and Apache XTable (Incubating) are actively collaborating to help organizations build architectures that are not constrained to any single ecosystem. These tools take advantage of the fact that #DeltaLake, #Iceberg, and #Hudi all consist of a metadata layer built on Apache Parquet data files. XTable translates metadata between a source and target format, maintaining a single copy of the data files. In this blog, we cover: ⭐ Why users are choosing to interoperate ⭐ Building a format-agnostic lakehouse with UniForm and XTable ⭐ How the Delta Lake and XTable communities collaborate Learn more ➡ https://lnkd.in/echrzTV3 cc Jonathan Brito, Kyle Weller #oss #lfaidata #xtable #lakehouse #opentable
Unifying the open table formats with Delta Lake Universal Format (UniForm) and Apache XTable
delta.io
To view or add a comment, sign in
-
🚀 Lakehouse Feature Comparison: Apache Hudi vs Delta Lake vs Apache Iceberg Just finished reading an insightful blog that dives deep into a detailed comparison of three leading data lakehouse engines. The blog goes beyond the usual surface-level differences, providing a comprehensive analysis of key features, TCP-DS benchmarking, and highlighting how these engines perform across popular industry use cases. It’s a great resource for anyone working with modern data architectures and looking to make informed decisions about the right tool for their needs. 🔍 Key takeaways include: - In-depth feature comparison for transaction management, ACID compliance, schema evolution, and time-travel. - Benchmarks for performance and scalability. - Real-world use cases in large-scale industry scenarios. 💡 If you are in the data engineering space or exploring data lakes, I highly recommend checking it out! Read the full comparison here: https://lnkd.in/g2R_m_yg At Unacademy, we rely on Apache Hudi for our transactional data lake needs and appreciate the robustness it offers in terms of incremental processing, ACID guarantees, and scalable data management. #DataEngineering #Lakehouse #ApacheHudi #DeltaLake #ApacheIceberg #BigData #Unacademy
Apache Hudi vs Delta Lake vs Apache Iceberg - Data Lakehouse Feature Comparison
onehouse.ai
To view or add a comment, sign in
8,243 followers
Data LakeHouse Architect with Deep Expertise in Various aspects of Data Management & Accelerated Computing
7moOh man, cannot miss to see you folks, Dipankar Mazumdar, M.Sc 🥑?Alex Merceddd in action together ..its like Batman and Superman coming together for a common cause to save the #data world....😎😀