Exactly 5 years before Databricks announced the acquisition of Tabular (now part of Databricks), and even before Tabular existed, I blogged about Apache Iceberg on our company blog (https://lnkd.in/dwrad_2m). This is what I wrote in June 12th, 2019: "We believe Iceberg has huge potential of changing the way we do Data Warehousing and you should check it out!" "While Apache Iceberg is still Work In Progress, you should definitely keep an eye out for it, and even learn more about it and get involved on this early stage. We believe Iceberg has huge potential of changing the way we do Data Warehousing, to a point where it's easier, faster, and more economical for all to use." Similar to what we did with other technologies we've identified in BigData Boutique early on as promising and sustainable technologies, Apache Iceberg was, and remains, to be in the core of our doing. We built plenty of Lakes using it, and we are about to complete a huge migration off BigQuery, to Iceberg-powered Data Warehouse. The real intent behind this acquisition remains unknown. It's clear why Snowflake and Databricks had a bidding war (happy for you Ryan Blue!), but were Iceberg founders acqui-hired to slowly suffocate that brilliant technology and let Delta thrive? Iceberg is a truly remarkable innovation, I'm actually not even sure it can be called a technology. Ryan and the team and community did stellar job in making it what it is today. It effectively killed Hudi and made Delta change course several times. The key with Iceberg is simplicity, and beautiful design that was done from the ground up. It'd be such a shame to see it go away just because someone had deep enough pockets to let this problem go away.
Itamar Syn-Hershko’s Post
More Relevant Posts
-
Azure Databricks Performance: Optimize, Vacuum, and Z-Ordering in Databricks’ Delta Tables Delta Lake is a powerful storage layer that brings ACID transactions to Apache Spark and big data workloads. Delta Lake not only enhances reliability but also introduces various optimization techniques that can significantly boost performance and streamline data workflows. In this article, we will delve deeper into the inner workings of Delta Lake, shedding light on the mechanics that drive its functionality. Furthermore, we will explore a selection of common optimization techniques applicable to Delta tables within the Databricks environment. Without further ado, let’s dive right in and begin our exploration. Processing Tables and Data Files in Databricks: Let’s create our table, first: Let’s run a SELECT query: Let’s see how the files are placed in the DBFS (Databricks File System) location: As you can see, in the first run only _delta_log folder is created. If we go inside this folder, we see a json file with the initial snapshot (present state of table): Let’s insert few records: Each insert will be treated as a separate operation, as a result, three different files will be created, with each containing only one record. Let’s verify the same: Let’s check our _delta_log folder again: In the delta log, since the table state got changed thrice (after three insert operations), three more json files got created. Let’s do three more insert operations to our table: So, now we have a total of six data files. And we’ll have three more json files in our _delta_log folder. Let’s again do a SELECT operation on our table: Now let’s see how this SELECT operation ran behind the scenes: Whenever we run a query on top of our delta table, Databricks first goes to the _delta_log folder, refers the latest json file (in our case it is 0000000000006.json), and gets the info about the snapshot of that table, that is which files it needs to refer to give an output and accordingly gives the output, after referring those data files. Now let’s do a delete operation: Let’s see how this impacted our DBFS files: You’ll see a new json file is created in _delta_log whereas the number of data files remains the same — SIX. The file is not removed immediately. This for the simple reason of time-travelling — an important feature of delta tables. So, instead of removing the file, Databricks will simply mark that file “inactive”. All this info will be present in the latest log file (0000000007.json) Now let’s do an UPDATE operation on our table: As expected, a new data file is added along with one additional json file. It will keep the previous file and write the updated record in a new file. The latest log file (000000000008.json) will tell Databricks to ignore two files (one as the corresponding record is deleted and other one as it contains an old value of that record).
To view or add a comment, sign in
-
Redshift to Databricks
Our Journey from Redshift to Databricks at AnyClip
medium.com
To view or add a comment, sign in
-
🚀 Revolutionize Your Data Workflow! 🚀 Exciting news! MongoDB's Connector for BigQuery & Spark is here! Seamlessly integrate MongoDB with Google BigQuery & Apache Spark, unlocking effortless data analytics. Dive deeper with stored procedures for advanced analysis. Follow this blog tutorial to see how it works! #DataIntegration #MongoDB #BigQuery #Spark #TechInnovation
Spark Up Your MongoDB and BigQuery using BigQuery Spark Stored Procedures | MongoDB
mongodb.com
To view or add a comment, sign in
-
Major update for Data Engineering folks: LEGENDS Might already know this. Databricks and Snowflake have been the go-to players for managing Iceberg tables, thanks to their ability to handle complex maintenance like compaction. But now, #AWS has introduced something that could change the game: S3 #Tables. The data in S3 Tables is stored in a new bucket type: a table bucket, which stores tables as subresources. Table buckets support storing tables in the Apache Iceberg format. Using standard SQL statements, you can query your tables with query engines that support Iceberg, such as Amazon Athena, Amazon Redshift, and Apache Spark. The best part? The cost is pretty reasonable—only about 15% more than standard S3 storage, and monitoring costs are minimal for most analytics workloads. What excites me most is how this simplifies writing and maintaining Iceberg tables. Until now, this was a challenge reserved for big players due to the complexity. But with S3 Tables, AWS makes it much easier for smaller teams to build systems using Iceberg tables without worrying about heavy maintenance. There’s also potential for a big shift in how Iceberg catalogs work. Instead of running separate catalog services (and managing the overhead), object stores like S3 could handle this natively. Plus, authentication gets simpler—AWS IAM policies can manage both data access and catalog operations seamlessly. It’s worth noting that AWS named it S3 Tables, not "S3 Iceberg," which hints they might plan to support other table formats in the future. I wouldn’t be surprised if other cloud providers follow AWS’s lead soon. If you’re building data systems on AWS or exploring Iceberg, S3 Tables are worth a close look. Follow me (Avinash S. ) for more. Image: depicts a simplified reference architecture of a data lake that's based on Iceberg.
To view or add a comment, sign in
-
Considering a move from Redshift to Databricks for your data lakehouse? You're not alone. Many businesses are finding Databricks offers a powerful alternative, with: ➡ Unified Data Management: Simplify operations with a single platform for all your data needs – eliminating the siloed approach of Redshift. ➡ Enhanced Performance & Scalability: Handle complex workloads and growing data volumes with ease thanks to Databricks' Apache Spark architecture. ➡ Open Source Advantage: Leverage the flexibility and innovation of the open-source ecosystem for a future-proof data platform. But is migrating the right move for you? Our blog post dives deep into the key considerations, helping you decide if Databricks is the key to unlocking your data's true potential. #databricks #redshift #datalakehouse #dataanalysis #cloudmigration P.S. Have questions about data migration or Databricks? Comment below and let's chat! https://lnkd.in/e8EyvpVK
The Case for Migrating to Databricks from Redshift
https://bentleyave.io
To view or add a comment, sign in
-
Snowflake, DataBricks and the Fight for Apache Iceberg Tables https://lnkd.in/gfTxqiYr SAN FRANCISCO — Last week, Snowflake announced it had adopted Apache Iceberg tables as a native format. Now customers can put their Snowflake data lakes into Iceberg, and even create external tables on a cloud provider of their choice, and have Snowflake manage them. In addition, Snowflake released Polaris, a catalog for Iceberg tables that could be called by any data processing engine that could read the format (Spark, Dremio, Snowflake). With the catalog, using the engine of your choice, you could do joins across tables gathering info heretofore much more difficult to obtain. Permissions, for who can see what, are managed by the catalog itself. And shortly, you will be able to pull in metadata from other catalogs. The company discussed these interoperability initiatives during its own user conference, the Snowflake Data Cloud Summit, held last week in San Francisco, But the company was not alone in its eager adoption of Iceberg. Also, last week, chief Snowflake rival Databricks announced it had purchased Iceberg distribution provider Tabular, a company that offers an Iceberg distribution that was founded by the three people who created the technology, Ryan Blue, Daniel Weeks, and Jason Reid. How did Apache Iceberg become the Belle of the Ball? Clearly, the data lakes and data lake houses are about to undergo a fundamental shift to open source. Apache Iceberg Came from Netflix “I think in this space, we have a classic customer who wants control of their solution. “–Snowflake’s Ron Ortloff. Iceberg grew from a series of frustrations on the part of Netflix engineers to scale their data operations, with existing file formats not reliable in distributed scenarios. Netflix open sourced the project in 2018 and donated it to the Apache Software Foundation. Since then, AirBnB, Amazon Web Services, Alibaba, Expedia, and others have contributed. The advantage that Iceberg brings is that it allows data to be stored once — eliminating a whole mess of compliance and security issues around having data copies in multiple places — and queried by any one of a number of Iceberg-compliant engines. A large number of Iceberg distributions are available these days, from Celerdata, Clickhouse, Cloudera, Dremio, Starburst, and of course Tabular. Earlier this month, Microsoft announced that it would support Snowflake’s Iceberg tables on its own Microsoft Fabric, an analytics service on Azure. Customers are very, very sensitive about lock-in these days, said Ron Ortloff, Snowflake’s senior product manager. “I think in this space, we have a classic customer who wants control of their solution,” he said in an interview with The New Stack. “So we want to give those customers a choice.” Snowflake has traditionally been a company that manages a client’s data from the cloud, relieving the customer of the considerable burden of managing it themselves. So why risk th...
To view or add a comment, sign in
-
🚀 Discover the potential of Apache Iceberg with Abhishek Sharma, Data Engineer at Velotio, in his latest blog! Learn how this powerful open-source table format revolutionizes large-scale data management. 🔍 Explore key features, advantages over Hive, and get a step-by-step guide to setting up Iceberg with Docker. The blog also highlights its advanced capabilities, including schema and partition evolution, hidden partitioning, and ACID compliance. 👉 Read the full blog here: https://bit.ly/4aBQ8Um #pyspark #aws #iceberg #datalake #upserts #metastore #local
Iceberg - Introduction and Setup (Part - 1)
velotio.com
To view or add a comment, sign in
-
Is it a Signal of End for Databricks Delta Lake ?? 🚨 Yesterday, Amazon Web Services (AWS) dropped a bombshell announcement that will undoubtedly shake up the data industry. They've introduced 𝐀𝐦𝐚𝐳𝐨𝐧 𝐒3 𝐓𝐚𝐛𝐥𝐞𝐬 🔥 a new service that could change the landscape for data lakes – and possibly mark the demise of Delta Lake as the go-to open-source table format. 𝐒𝐨, 𝐰𝐡𝐚𝐭 𝐞𝐱𝐚𝐜𝐭𝐥𝐲 𝐝𝐢𝐝 𝐀𝐖𝐒 𝐮𝐧𝐯𝐞𝐢𝐥? They’ve introduced a new type of S3 bucket – the “𝐭𝐚𝐛𝐥𝐞 𝐛𝐮𝐜𝐤𝐞𝐭” – designed specifically for optimized Parquet storage and Iceberg querying. Think of it as a “database in a bucket,” where every file inside becomes a “table” (hence the name *Amazon S3 Tables*). This isn’t just a simple storage upgrade; it comes with table-level permissions, metadata management, automatic file compaction, and other essential features needed to operationalize a data lake. 𝐖𝐡𝐲 𝐈𝐬 𝐓𝐡𝐢𝐬 𝐒𝐮𝐜𝐡 𝐚 𝐁𝐢𝐠 𝐃𝐞𝐚𝐥? Data lakes and open data formats have been trending for a while now, as companies look to store their data in a single cloud provider’s storage and make it accessible across multiple services and query engines. By adding first-class support for Parquet and Iceberg, AWS is setting the stage for this movement to gain even more momentum. S3 Tables is a foundational building block that will likely become the backbone of many data platforms – including competitors like Snowflake and Databricks. If AWS embraces Iceberg so fully, it sends a clear signal to the market about the direction things are headed. 𝐖𝐡𝐚𝐭 𝐀𝐛𝐨𝐮𝐭 𝐃𝐞𝐥𝐭𝐚 𝐋𝐚𝐤𝐞? For those who don’t know, Delta Lake is the open-source table format championed by Databricks – an alternative to Iceberg. Earlier this year, the debate between Iceberg and Delta Lake heated up, as both claimed to be the best choice for modern data lakes. Now, with AWS – the largest cloud provider – going all-in on Iceberg, it’s hard to ignore the message: * 𝐈𝐜𝐞𝐛𝐞𝐫𝐠 𝐡𝐚𝐬 𝐰𝐨𝐧. When a giant like AWS backs a format so strongly, it’s a major signal to the industry. So, when it comes to choosing the long-term data lake file format for your company, would you still bet on Delta Lake or Iceberg ?? Follow Yash Bhawsar for more insights, tips, and resources to help you on your data engineering journey. 🚀 Let’s build the future with data! 🌟 Keep Learning 😊 #AWS #S3Tables #Iceberg #DataLake #Databricks #DeltaLake #CloudComputing #DataEngineering
To view or add a comment, sign in
-
ONE SUBSTACK A DAY: In this series, I'll be introducing one good Substack article per day - after I've personally read them. Today's Substack is about Snowflake vs Databricks, written by the amazing Tech Funds #substack, whose work is simply magical: https://lnkd.in/gDSEy4-3 AI-generated summary (200 words): - Snowflake vs Databricks highlights the transition to cloud-based data scaling, free from the limitations of traditional systems. - The article examines Snowflake's response to competition from major cloud providers and its strategy to capitalize on data sharing network effects. - Snowflake's acquisition of Streamlit aims to boost its machine learning offerings within its ecosystem. - Fiscal year 2025 projections indicate a slowdown in revenue growth to 22%, with a reduction in gross margins to 76%. - The company's cautious forecast reflects adjustments to changing consumption patterns, aiming for a $10 billion product revenue target. - Innovations include Cortex for Large Language Model (LLM) capabilities and Unistore, to evolve its data warehouse into a comprehensive database solution, despite potential latency issues compared to traditional SQL databases. - The article posits optimism regarding Snowflake's growth rebound, supported by stronger performance obligations and a return to typical customer optimization levels. - The comparison of SQL and NoSQL databases outlines the challenges of scaling and complex queries, positioning data warehouses as a solution for analytics.
Snowflake vs Databricks, the birth of unlimited and decoupled data scaling in the cloud, and Snowflake's outlook
techfund.one
To view or add a comment, sign in