How I’d Learn Apache Iceberg (if I Had To Start Over)

How I’d Learn Apache Iceberg (if I Had To Start Over)

I first heard about Apache Iceberg in 2022. Back then, I didn’t quite understand the concept behind the table formats. I wasn’t even in a good state to understand Iceberg's value proposition. Like many tech trends that come and go, I initially dismissed it as just another passing fad, similar to the Analytics Engineering wave that was sweeping through the industry. However, I skimmed through some articles, picked up the basics, and filed them away in the back of my mind.

Fast forward to 2024, and I see Iceberg is everywhere. Major cloud providers and data platform vendors—including Google, Confluent, and Snowflake—have bundled Apache Iceberg support into their managed service offerings, making it an essential skill in every data professional's toolkit, whether you like it or not.

So I decided to revisit Iceberg fundamentals for a deeper understanding. This time, I'm taking an immersive approach to apply these skills in my future work. I've created a comprehensive 7-week study plan that balances theoretical concepts with hands-on practice. Though I'm still working through it, I wanted to share my learning roadmap—both to help others grasp the basics and to gather feedback for improvements.

My study is based on two books: Apache Iceberg: The Definitive Guide and Database Internals. While many good Iceberg resources exist online, I chose the Iceberg book as my primary source of truth and the guiding light. I regularly reference the Database Internals book to reinforce my understanding, as I view Iceberg and lakehouses as evolved, exploded versions of traditional databases. I will write more on that in a separate post. For now, let’s focus on Iceberg.

I will use these two books side by side during the study.

Week 1: Understanding the problem context

I will spend the first week studying what led to the creation of Apache Iceberg. Through reading articles, and books, and watching videos, I'll build a mental model of Iceberg and understand why it exists.

  • Understand the strengths and weaknesses of traditional data management architectures—data warehouses and data lakes.

  • Pick an imaginary use case of storing parquet files in a data lake and running analytics on it. Try to understand the potential challenges I’d have to face, such as not having ACID guarantees as well as performance issues.

  • Learn the fundamentals of a data lakehouse architecture and how it solves all the problems data warehouses and data lakes face.

  • Understand the concept behind open table formats, the role they play in a lakehouse, and how Iceberg fits into it.

Reading material

Week 2: What is Iceberg?

I will spend the second week trying to understand the architecture of Apache Iceberg—what it is made of and how it works.

  • Understand what Iceberg is and isn't

  • Study the key components of the metadata layer, focusing on metadata, snapshots, and manifest files

  • Explore Iceberg's core features like partitioning, schema evolution, record-level operations, and time travel queries at a basic level

  • Understand Iceberg catalogs: their purpose and various implementations

Week 3: Getting hands-on

The third week is all about applying everything I’ve learned so far into practice. I will try to set up a local Iceberg environment where I can experiment with basic table-level operations.

  • Set up Pyiceberg locally to work with the file system storage layer.

  • Write Python code to create an Iceberg table using a JDBC catalog (using an SQLite database underneath).

  • Write and read some records to the table to examine the flow of write and read requests. Observe the physical location where Iceberg stores the table’s data including the metadata files.

Week 4: Working with Apache Spark, partitioning and time-traveling

I will dedicate week 4 to exploring how query engines work with core Iceberg features. I will start with Apache Spark.

  • Download the NYC taxi dataset.

  • Set up an Apache Spark cluster with Docker Compose, and configure it to work with the Hadoop catalog.

  • Create an Iceberg table with Spark to ingest the taxi data set into it.

  • Experiment with basic read and write queries on the table.

  • Learn about partition evolution and hidden partitioning features. Try them on the created table above.

  • Get familiar with time travel queries.

Week 5: Record level operations, version controlling for tables

In the fifth week, I will further explore the core Iceberg features using a different query engine and catalog: Dremio and Nessie.

  • Set up Dremio, Nessie, and Minio with Docker Compose. I will follow this excellent tutorial by Alex Merced from Dremio.

  • Experiment with row-level updates on tables, including Copy-on-write (COW) and Merge-on-Read (MOR) strategies.

  • Try out the version controlling on Iceberg tables, including version rollback, branching, and tagging.

Week 6: Streaming with Apache Flink, schema evolution

Now that I understand Iceberg’s core capabilities, it’s time to explore how Iceberg integrates batch and real-time processing. I will experiment with Apache Flink.

  • Set up a Docker Compose project of Apache Kafka, and Apache Flink clusters with a Hive catalog.

  • Experiment with basic batch operations with Flink.

  • Configure an Iceberg sink connector to append/upsert data from a Kafka topic into an Iceberg table.

  • Change the schema of the events in the Kafka topic to see how Iceberg supports schema evolution.

Week 7: Advanced concepts

I will wrap up my study in week 7, focusing on advanced concepts of Iceberg. This includes reading on the following topics:

  • How table compaction works

  • How data governance works in Iceberg, including data security controls at both storage and catalog levels

  • How does Iceberg handle ACID guarantees on data?

  • How does Iceberg compare to its rivals, Delta and Hudi?

Even after completing this 7-week schedule, I won't feel fully confident until I apply this knowledge practically. Therefore, I plan to build a real-world data lakehouse project that incorporates batch and real-time data processing, a BI dashboard, and a machine learning use case at the end.

I hope this learning plan is helpful. If you're an expert in this field, I welcome your feedback on any topics I may have missed.

Dipankar Mazumdar, M.Sc 🥑

Staff Data Engineer Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Author of "Engineering Lakehouses"

3w

Thanks for including my blog. I am also glad to see that you are including Database internals in the mix. It just adds so much context, overall (big fan!). Pretty much this is what I am trying to do with my upcoming book (lakehouse internals + applications = Lakehouse engineering).

Alex Merced

Co-Author of “Apache Iceberg: The Definitive Guide” | Senior Tech Evangelist at Dremio | LinkedIn Learning Instructor | Tech Content Creator

3w

If people like my tutorial that you mention in the article they can find a lot more on this list -> https://datalakehousehub.com/blog/2024-10-ultimate-directory-of-apache-iceberg-resources/

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics