Pritam Saha’s Post

View profile for Pritam Saha, graphic

Serving Notice Period | Azure Data Engineer @ TCS | Tech YouTuber | Pyspark | Azure Databricks | Azure Data Factory | Azure Synapse | dbt | SQL | Python | DSA | Kafka | Data Modeling | 4X MS Azure Certified

𝐒𝐩𝐚𝐫𝐤 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄 𝗤𝘂𝗲𝘀𝘁𝗶𝗼𝗻 🚀 🤵 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰𝐞𝐫: You’re reading a 10 GB file in Spark. How Spark decides the number of partitions and tasks? 𝐌𝐞: Assuming the file block size 128 MB, we get 80 partitions (10 GB / 128 MB ~ 80). As each partition corresponds to a task. So, will have 80 tasks, one for each partition. 🤵 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰𝐞𝐫: Suppose spark.default.parallelism and spark.sql.shuffle.partitions are both set to 200. What happens next, especially when transformations are applied? 𝐌𝐞: Here's how it works: 🔸 spark.default.parallelism controls the number of partitions for narrow transformations on RDDs, particularly for non-file sources (e.g., Kafka, RDD creation). Since we're reading from a file, the initial partitions are based on the file's block size, which gave us 80 partitions. 🔸 When we reach a shuffle operation (like groupBy() or join()), Spark uses spark.sql.shuffle.partitions, which is set to 200. After the shuffle, Spark repartitions the data into 200 partitions, creating 200 tasks for the next stage. 🤵 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰𝐞𝐫: Let’s say you apply 20 transformations, and some of them are shuffle operations. How many stages & tasks per stage, will Spark create? 𝐌𝐞: If we have, say, 3 shuffle operations, Spark will divide the job into 4 stages (one for each set of transformations before and after each shuffle). The first stage will handle the 80 partitions with 80 tasks. Each subsequent stage (after a shuffle) will use 200 partitions and therefore run with 200 tasks. 🤵 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰𝐞𝐫: So, can the number of partitions ever change while reading a file? 𝐌𝐞: No, the number of partitions cannot change during the file read process itself. However, once the data is loaded, we can explicitly change the partitions using methods like .repartition() or .coalesce(). 🤵 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰𝐞𝐫: Sounds like you’ve got it. Any advice for optimizing partitions and tasks in Spark? 𝐌𝐞: Absolutely! The key is to tune spark.default.parallelism and spark.sql.shuffle.partitions based on your data size and cluster resources. Under-partitioning (too few partitions) can lead to uneven task distribution and skewed workloads, causing some tasks to take much longer than others. Over-partitioning (too many partitions) can create too much overhead, as managing and scheduling an excessive number of small tasks can slow down the job. ✅ Follow Pritam Saha for more Data Engineering posts. ****************** 𝗕𝗶𝗴𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄 𝗖𝗼𝗻𝗰𝗲𝗽𝘁𝘀: https://lnkd.in/gYFZacjF ****************** #Spark #pyspark #dataengineering #dataengineer #interviewquestions

Madhavi Bollimuntha

Data engineer@TCS |Spark |Scala| Azure

3mo

Can you give any example for fine tuning the spark.default.parallelism like without under partition or over partition. What must be the range. Please give an example with numbers.

Like
Reply

To view or add a comment, sign in

Explore topics