Pritam Saha’s Post

3mo

𝐒𝐩𝐚𝐫𝐤 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄 𝗤𝘂𝗲𝘀𝘁𝗶𝗼𝗻 🚀 🤵 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰𝐞𝐫: You’re reading a 10 GB file in Spark. How Spark decides the number of partitions and tasks? 𝐌𝐞: Assuming the file block size 128 MB, we get 80 partitions (10 GB / 128 MB ~ 80). As each partition corresponds to a task. So, will have 80 tasks, one for each partition. 🤵 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰𝐞𝐫: Suppose spark.default.parallelism and spark.sql.shuffle.partitions are both set to 200. What happens next, especially when transformations are applied? 𝐌𝐞: Here's how it works: 🔸 spark.default.parallelism controls the number of partitions for narrow transformations on RDDs, particularly for non-file sources (e.g., Kafka, RDD creation). Since we're reading from a file, the initial partitions are based on the file's block size, which gave us 80 partitions. 🔸 When we reach a shuffle operation (like groupBy() or join()), Spark uses spark.sql.shuffle.partitions, which is set to 200. After the shuffle, Spark repartitions the data into 200 partitions, creating 200 tasks for the next stage. 🤵 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰𝐞𝐫: Let’s say you apply 20 transformations, and some of them are shuffle operations. How many stages & tasks per stage, will Spark create? 𝐌𝐞: If we have, say, 3 shuffle operations, Spark will divide the job into 4 stages (one for each set of transformations before and after each shuffle). The first stage will handle the 80 partitions with 80 tasks. Each subsequent stage (after a shuffle) will use 200 partitions and therefore run with 200 tasks. 🤵 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰𝐞𝐫: So, can the number of partitions ever change while reading a file? 𝐌𝐞: No, the number of partitions cannot change during the file read process itself. However, once the data is loaded, we can explicitly change the partitions using methods like .repartition() or .coalesce(). 🤵 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰𝐞𝐫: Sounds like you’ve got it. Any advice for optimizing partitions and tasks in Spark? 𝐌𝐞: Absolutely! The key is to tune spark.default.parallelism and spark.sql.shuffle.partitions based on your data size and cluster resources. Under-partitioning (too few partitions) can lead to uneven task distribution and skewed workloads, causing some tasks to take much longer than others. Over-partitioning (too many partitions) can create too much overhead, as managing and scheduling an excessive number of small tasks can slow down the job. ✅ Follow Pritam Saha for more Data Engineering posts. ****************** 𝗕𝗶𝗴𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄 𝗖𝗼𝗻𝗰𝗲𝗽𝘁𝘀: https://lnkd.in/gYFZacjF ****************** #Spark #pyspark #dataengineering #dataengineer #interviewquestions

1 Comment

Madhavi Bollimuntha

Data engineer@TCS |Spark |Scala| Azure

3mo

Can you give any example for fine tuning the spark.default.parallelism like without under partition or over partition. What must be the range. Please give an example with numbers.

To view or add a comment, sign in

More Relevant Posts

SAMBA SIVA RAO TAMMAVARAPU

Bigdata - Hadoop || SQL || Python || PySpark - Databricks || SSIS || AWS || Azure || Snowflake || DBT || ML/DL || SnapLogic || Airflow || CICD || Analytics || Ex - CapitalOne || Ex - Dun & Bradstreet || Ex - Cognizant
3mo
Report this post
𝐒𝐩𝐚𝐫𝐤 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄 𝗤𝘂𝗲𝘀𝘁𝗶𝗼𝗻 🚀 🤵 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰𝐞𝐫: You’re reading a 10 GB file in Spark. How Spark decides the number of partitions and tasks? 𝐌𝐞: Assuming the file block size 128 MB, we get 80 partitions (10 GB / 128 MB ~ 80). As each partition corresponds to a task. So, will have 80 tasks, one for each partition. 🤵 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰𝐞𝐫: Suppose spark.default.parallelism and spark.sql.shuffle.partitions are both set to 200. What happens next, especially when transformations are applied? 𝐌𝐞: Here's how it works: 🔸 spark.default.parallelism controls the number of partitions for narrow transformations on RDDs, particularly for non-file sources (e.g., Kafka, RDD creation). Since we're reading from a file, the initial partitions are based on the file's block size, which gave us 80 partitions. 🔸 When we reach a shuffle operation (like groupBy() or join()), Spark uses spark.sql.shuffle.partitions, which is set to 200. After the shuffle, Spark repartitions the data into 200 partitions, creating 200 tasks for the next stage. 🤵 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰𝐞𝐫: Let’s say you apply 20 transformations, and some of them are shuffle operations. How many stages & tasks per stage, will Spark create? 𝐌𝐞: If we have, say, 3 shuffle operations, Spark will divide the job into 4 stages (one for each set of transformations before and after each shuffle). The first stage will handle the 80 partitions with 80 tasks. Each subsequent stage (after a shuffle) will use 200 partitions and therefore run with 200 tasks. 🤵 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰𝐞𝐫: So, can the number of partitions ever change while reading a file? 𝐌𝐞: No, the number of partitions cannot change during the file read process itself. However, once the data is loaded, we can explicitly change the partitions using methods like .repartition() or .coalesce(). 🤵 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰𝐞𝐫: Sounds like you’ve got it. Any advice for optimizing partitions and tasks in Spark? 𝐌𝐞: Absolutely! The key is to tune spark.default.parallelism and spark.sql.shuffle.partitions based on your data size and cluster resources. Under-partitioning (too few partitions) can lead to uneven task distribution and skewed workloads, causing some tasks to take much longer than others. Over-partitioning (too many partitions) can create too much overhead, as managing and scheduling an excessive number of small tasks can slow down the job. #spark#aws#bigdata

1 Comment
Like Comment
To view or add a comment, sign in
Sanil Kumar Barik

Azure Data Engineer | Azure Data Factory (ADF) | Azure Databricks | Apache Spark | Spark SQL | Pyspark | SQL | Python | Blob Storage | ADLS Gen2 | Power BI | Excel
2mo
Report this post
Jay Jagannath❤️❤️ Today, let's discuss about RDD in Spark and some potential interview questions shared by Manish Kumar. Q1. What is RDD? - An RDD (Resilient Distributed Dataset) is the fundamental data structure in Spark. It is a distributed collection of objects that allows Spark to handle large-scale data processing across a cluster in a fault-tolerant manner. - It provides fault tolerance and parallel processing, making it suitable for distributed data processing. Q2. When do we need an RDD? We need RDDs in Spark when: 1. Fine-grained control: Custom partitioning or complex transformations not possible with DataFrames/Datasets. 2. Unstructured data: Handling raw data like logs or binary files. 3. Functional programming: Using low-level operations like filter(), etc. 5. Fault tolerance: For lineage-based recovery from node failures. 6. Legacy APIs: When using older libraries that depend on RDDs. Q3. Features of RDD? a. Immutability: Once created, RDDs cannot be changed. All transformations on an RDD create a new RDD, maintaining immutability. b. Distributed: Data is distributed across multiple nodes in a cluster, enabling parallel processing of large datasets. c. Fault Tolerance: RDDs automatically recover from node failures using lineage information, allowing Spark to recompute lost partitions. d. Lazy Evaluation: RDDs use lazy evaluation, meaning transformations are not executed until an action(like collect() or count()) is called. Q4. Why we shouldn't use an RDD? We shouldn’t use an RDD in Spark in most cases because DataFrames and Datasets offer significant advantages in terms of performance, ease of use, and optimization. Here are the key reasons why RDDs are not recommended for general use: a. Lack of Optimizations - RDDs do not benefit from Spark's query optimizations like Catalyst(query optimizer), which DataFrames and Datasets utilize. As a result, RDDs can be much less efficient for many workloads. b. Higher Memory and CPU Overhead - Because RDDs work with raw objects, they are less memory-efficient than DataFrames, which store data in a compact binary format. c. Complexity - RDD API is more complex and low-level, requiring more code to achieve simple tasks. For example, applying transformations like filtering or aggregation involves more manual steps in RDDs, whereas DataFrames and Datasets have higher-level APIs that are easier to use. d. Less Suitable for Big Data Workloads - Big data processing requires performance optimizations, which are harder to achieve with RDDs due to the lack of query planning and execution optimizations. In the next post. we'll discuss about Jobs, Stages, and Tasks in spark. Until then, Ta Ta! Happy Learning! Munna Das Sahil Chavan #dataengineering #dataanalytics #businessanalytics #powerbideveloper
Like Comment
To view or add a comment, sign in
Joseph Melkin

Azure Data Engineer
1mo Edited
Report this post
𝐌𝐚𝐬𝐭𝐞𝐫𝐢𝐧𝐠 𝐒𝐩𝐚𝐫𝐤 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧: 𝐑𝐞𝐩𝐚𝐫𝐭𝐢𝐭𝐢𝐨𝐧, 𝐂𝐨𝐚𝐥𝐞𝐬𝐜𝐞, 𝐚𝐧𝐝 𝐒𝐡𝐮𝐟𝐟𝐥𝐞 𝐏𝐚𝐫𝐭𝐢𝐭𝐢𝐨𝐧𝐬 In continuation of my previous posts on optimizing Apache Spark, today I’m diving into another efficient method: managing repartition, coalesce, and shuffle partitions to boost performance. When working with large datasets in Apache Spark, understanding how to manage partitions efficiently is essential for performance. Today, let’s break down how repartition, coalesce, and shuffle partitions work under the hood, and why using them wisely can save both time and resources. 𝐑𝐞𝐩𝐚𝐫𝐭𝐢𝐭𝐢𝐨𝐧 𝐯𝐬. 𝐂𝐨𝐚𝐥𝐞𝐬𝐜𝐞 Repartition is used when you need to increase the number of partitions. It triggers a complete shuffle of the dataset because the state split happens across the entire data. This makes it a computationally expensive operation as Spark has to redistribute the data across new partitions. Coalesce, on the other hand, is used to reduce the number of partitions without a full shuffle. It simply moves data from multiple partitions into fewer ones, avoiding the overhead of shuffling and making it a more efficient choice when you’re consolidating partitions. 𝐖𝐡𝐞𝐧 𝐚𝐧𝐝 𝐖𝐡𝐲 𝐒𝐡𝐮𝐟𝐟𝐥𝐞 𝐇𝐚𝐩𝐩𝐞𝐧𝐬 Shuffling in Spark is triggered during certain operations, especially wide transformations like `groupBy`, `reduceByKey`, or `join`, where data needs to be moved between partitions. These transformations inherently involve a shuffle, which can be a costly operation, especially when dealing with large datasets. 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧 𝐈𝐧𝐬𝐢𝐠𝐡𝐭: 𝐑𝐞𝐝𝐮𝐜𝐞 𝐭𝐡𝐞 𝐍𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐒𝐡𝐮𝐟𝐟𝐥𝐞𝐬 If you need to repartition your data, it’s best to do so before a wide transformation. Here’s why: when you set the correct number of partitions upfront using `spark.sql.shuffle.partitions`, you can limit the shuffle to happen once during the transformation itself. If you repartition or coalesce after a wide transformation, you’re essentially forcing Spark to shuffle the data again, which means additional network I/O, higher resource consumption, and longer execution times. 𝐊𝐞𝐲 𝐓𝐚𝐤𝐞𝐚𝐰𝐚𝐲: 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐞 𝐘𝐨𝐮𝐫 𝐒𝐡𝐮𝐟𝐟𝐥𝐞𝐬 In scenarios where you know you’ll be performing a wide transformation, set your shuffle partitions appropriately before the transformation. This ensures you get the best performance by minimizing the number of shuffles. #ApacheSpark #DataEngineering #BigData #SparkOptimization #Shuffling #Partitions #WideTransformations
Like Comment
To view or add a comment, sign in
arun sundar sundararajan

Big Data Engineer | Databricks Certified | Apache Spark | Ex-CGI | Ex-Cognizant | BE CSE @ GCT Coimbatore
4mo
Report this post
𝗗𝗔𝗚, 𝗹𝗶𝗻𝗲𝗮𝗴𝗲, 𝗼𝗽𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝘀, 𝗽𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻𝘀, 𝘀𝘁𝗮𝗴𝗲, 𝘁𝗮𝘀𝗸 (𝗮𝗽𝗮𝗰𝗵𝗲 𝘀𝗽𝗮𝗿𝗸) 𝗹𝗶𝗻𝗲𝗮𝗴𝗲 & 𝗗𝗔𝗚 1) DAG - represents the logical flow of operations within a spark job (action) - breaks the job down into a sequence of stages, each stage represents a group of tasks that can be executed parallelly - provides a broader view - visible when an action is called 2) lineage - represents the sequence of transformations applied on the input data to produce a new RDD or DataFrame (DF) or Dataset (DS) - keeps a track of transformations to be executed after an action has been called - gives a detailed view of each transformation - lineage is part of DAG - provides fault tolerance 𝗼𝗽𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝘀 𝗶𝗻 𝗮𝗽𝗮𝗰𝗵𝗲 𝘀𝗽𝗮𝗿𝗸 1) 𝘁𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻𝘀 (lazy) - operations on RDDs, DF, DS that produce a new distributed dataset from an existing one - lazy until an action is called so spark can optimize the execution plan 2) 𝗻𝗮𝗿𝗿𝗼𝘄 𝘁𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻 - works on the principle of data locality - process data wherever available - NO shuffling - each partition contributes to only one partition - eg: map, flatMap 3) 𝘄𝗶𝗱𝗲 𝘁𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻 (shuffling) - costly & takes time - a new stage is created when a wide transformation is used - each partition can contribute to multiple partitions of output - eg: join, groupBy 4) 𝗮𝗰𝘁𝗶𝗼𝗻𝘀 (𝗲𝗮𝗴𝗲𝗿) - operations that trigger the execution of transformations & return a value to the driver program as local variable or write data to storage - initiates the computation - ideally one JOB shows in Spark UI for one action - eg: show, collect, count 𝗯𝗹𝗼𝗰𝗸𝘀, 𝗽𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻𝘀, 𝘁𝗮𝘀𝗸𝘀 1) one-one mapping between blocks (distributed across disk) & partitions (distributed across memory) (initially while loading files) 2) one-one mapping between a partition & a task 3) number of tasks = number of partitions in RDD or Structured API (DF, DS) 4) number of parallel tasks in each executor depends on the number of cores in each executor (1 cpu-core is allocated to one partition) 𝘀𝘁𝗮𝗴𝗲 & 𝘄𝗶𝗱𝗲 𝘁𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻𝘀 1) a job can have multiple stages 2) when a wide transformation is encountered, a new stage is created 3) stage indicates a shuffle 4) shuffle is sending the data from one machine to another machine & data being rearranged between partitions 5) each stage has tasks 𝗻𝘂𝗺𝗯𝗲𝗿 𝗼𝗳 𝘁𝗮𝘀𝗸𝘀 for 2 wide transformations & 1 action on a 2 GB file ? 1) using RDD - default partition size = 128 MB - 1 job = 1 action - 2 wide transformations = 3 stages - tasks = (file_size/partition_size) * stages = 48 tasks - number of partitions on most cases do not change after shuffle 2) using DF - default partitions in case of shuffling for Structured APIs = 200 - 1 job = 1 action - 2 wide transformations = 3 stages - tasks = 16 + 200 + 200 = 416 there can be exceptions to above observations #apachespark #spark #pyspark

5 Comments
Like Comment
To view or add a comment, sign in
Shivam Gupta

Immediate Joiner | Docker | Kubernetes | Jenkins | CICD | Terraform | Ansible | Linux Administrator | Server Configuration | Monitoring | Version Control System | AWS | Migration | Nginx
1w
Report this post
🌟✨ Exciting Update: Week 11 of My Big Data Journey! - SPARK OPTIMIZATIONS & PERFORMANCE TUNING ✨🌟 🔶 Memory Management in Apache Spark: This week, I explored memory management in Apache Spark, learning how Spark allocates memory across components (Unified, user, reserved, and overhead). I also uncovered techniques to optimize memory usage, ensuring efficient performance for large-scale data processing. 🔶 Sort Aggregate vs Hash Aggregate: I studied the difference between Sort Aggregate and Hash Aggregate. Sort Aggregate performs better with pre-sorted data, while Hash Aggregate is more efficient for unsorted data, offering insights into optimizing Spark jobs based on data characteristics. 🔶 Various Plans in Apache Spark: Spark follows a sequence of plans to guide query execution: •Parsed Logical Plan: Initial representation of the query before optimization. •Analyzed Logical Plan: Validates references, ensuring tables and columns are correct. •Optimized Logical Plan: Result of Spark’s Catalyst optimizer applying optimization techniques. •Physical Plan: The final plan for executing the query on the cluster. 🔶 Catalyst Optimizer: I deepened my understanding of Catalyst Optimizer, Spark’s query optimization engine. Catalyst applies transformations and optimizations like predicate pushdown and combining multiple filters, making Spark SQL queries faster. 🔶 File Formats & Compression Techniques: I explored key file formats and compression techniques that are vital for data storage and performance optimization: •Row-Based Formats: CSV and JSON, where data is stored row-wise. •Column-Based Formats: Parquet and ORC, offering efficient storage and analytical access. 🔶 Specialized File Formats: I learned about formats that boost Spark performance: •AVRO: A row-based format, ideal for schema evolution. •ORC: A columnar format, known for its compression and optimization. •PARQUET: A popular columnar format optimized for performance and storage. 🔶 Schema Evolution: I studied schema evolution, allowing formats like Parquet to adjust to changing schemas over time without compromising data integrity. 🔶 Compression Techniques: I explored various compression techniques that improve storage and performance, including: •SNAPPY: Fast compression with moderate rates, ideal for large datasets. •LZO: Fast decompression, suited for real-time data. •GZIP: High compression, but slower processing. •BZIP: High compression rates, but slower than others. A big thank you to Sumit Mittal Sir and the TrendyTech - Big Data By Sumit Mittal team for their continued guidance and support! 🚀 Excited to apply these optimizations and continue this learning journey. #BigDataJourney #SparkOptimization #MemoryManagement #CatalystOptimizer #FileFormats #CompressionTechniques #SchemaEvolution #SparkSQL #DataEngineering #TechCommunity #DataScience #PerformanceTuning #PARQUET #SNAPPY #BigData #SparkPerformance #DataEngineeringJourney #TrendyTech #LearningByDoing

1 Comment
Like Comment
To view or add a comment, sign in
SUDIP SEN

Technical Lead @ FynamicsGST | Data Engineering
5mo Edited
Report this post
Executor Memory Management in Spark: Suppose if you ask resource manager, to assign 4GB memory for an executor, resource manager will create a (4GB plus 10% of 4GB = 400MB) 4GB 400MB container and give it to you, by which an executor can be created. Here some point we have to keep in our mind: 1) we can’t do any task which will required more than 4GB, 400MB resource, if we try to do that, we will encounter out of memory exception. 2) 4GB memory utilized for JVM process and 400 MB memory uses for non jvm process like pyspark object creation like that, in any situation if we try to exceed any memory, it will give you out of out of memory exception. So now what we have 4GB JVM Heap Memory(spark.excutor.memory) and 400MB spark.executor.memoryoverhead. In 4GB JVM heap memory is divided by 3 memory segments: 1) Reserved Memory: Reversed memory created by system itself, and it is hardcoded 300 MB, if executor memory is less than 1.5 times than reserved memory, it will give error and will ask to use larger heap size. 300 MB Reserved Memory means my executor memory left: 4GB (4096MB) – 300MB = 3796 MB 2) User memory: This memory is used only for user-defined functions, structures, and computations. This particular memory segment is not managed by Spark, it typically uses 40% of executor memory. 3) Spark Memory: This memory segment is managed by Spark itself. This memory stores the intermediate state while doing task execution like joining or to store broadcast variable. All cache/persisted data stored here, this memory uses 60% of executor memory, this memory also divided into two sub memory type memory block: a) Storage Memory: this memory segment stores all broadcast variables, cached data. b) Execution Memory: this memory used by spark for object created during the execution time. Memory Calculation: 4GB executor Memory 10% of 4GB means 400 MB used by memory overhead or non jvm memory. 4GB * 10% = 400 MB for non jvm storage executor memory overhead Now with 4GB or 4096MB comes as jvm heap memory: Reserved Memory = 300 MB Remaining memory (4096-300) = 3796 MB will use in user memory and spark memory User Memory: USER MEMORY FRACTION 40% or 0.4 of total executor memory, in our case it is 3796MB (EXECUTOR_MAIN_MEMORY – RESERVED MEMORY) * USER_MEMORY_FRACTION = (4096-300) * 40% MB = 1518.4MB Spark Memory: (EXECUTOR_MAIN_MEMORY – RESERVED MEMORY) * (1-USER_MEMORY_FRACTION) =(4096-300)*(1-0.4) = 2277.6MB Storage Memory: (EXECUTOR_MAIN_MEMORY – RESERVED MEMORY) * (1-USER_MEMORY_FRACTION)*(SPARK_MEMORY_STORAGEFRACTION) =(4096-300)*(1-0.4)*0.5 = 1138.8 Execution Memory: (EXECUTOR_MAIN_MEMORY – RESERVED MEMORY) * (1 -USER_MEMORY_FRACTION)*(1 - SPARK_MEMORY_STORAGEFRACTION) =(4096-300)*(1-0.4)*(1-0.5) = 1138.8 The percentage value is only for 4GB executor memory calculation, for a different executor memory configuration, these won’t hold good. It is only given for understanding purposes. #spark#pyspark
Like Comment
To view or add a comment, sign in
Swarnali S.

Data Engineer | DP-203 | Pyspark | SQL | Python | Azure | AWS
4mo
Report this post
Delta tables are a crucial feature of Delta Lake, which offers ACID transactions, scalable metadata handling, and combines streaming and batch data processing. Here's a brief overview of Delta tables: What is Delta Lake? Delta Lake is a storage layer that ensures reliability in data lakes. It operates on top of existing data lakes, providing features for data reliability and consistency. Key Features of Delta Tables: 1. ACID Transactions: Delta Lake guarantees data integrity with ACID transactions, ensuring that operations are atomic, consistent, isolated, and durable. 2. Scalable Metadata Handling: Unlike traditional data lakes, Delta Lake efficiently manages metadata for large numbers of files and partitions. 3. Schema Evolution: Delta Lake supports schema evolution and enforcement, enabling the updating and management of schemas as data changes. 4. Time Travel: Delta Lake facilitates time travel, allowing the querying of historical data and rollback to previous versions. 5. Unified Batch and Streaming: Delta Lake enables processing of streaming and batch data using the same table, simplifying real-time data updates and batch processing. Basic Operations with Delta Tables: Here’s how you can work with Delta tables using PySpark: 1. Creating a Delta Table: ```python from pyspark.sql import SparkSession from delta.tables import * spark = SparkSession.builder \ .appName("DeltaTableExample") \ .getOrCreate() # Define the DataFrame df = spark.createDataFrame([ (1, "Alice"), (2, "Bob"), (3, "Cathy") ], ["id", "name"]) # Write the DataFrame to a Delta table df.write.format("delta").save("/path/to/delta/table") ``` 2. Reading from a Delta Table: ```python # Read the Delta table delta_df = spark.read.format("delta").load("/path/to/delta/table") delta_df.show() ``` 3. Upserting Data: ```python # Create DeltaTable object delta_table = DeltaTable.forPath(spark, "/path/to/delta/table") # New data to upsert new_data = spark.createDataFrame([ (2, "Bob Updated"), (4, "David") ], ["id", "name"]) # Upsert (merge) data delta_table.alias("old") \ .merge( new_data.alias("new"), "old.id = new.id" ) \ .whenMatchedUpdate(set={"name": "new.name"}) \ .whenNotMatchedInsert(values={"id": "new.id", "name": "new.name"}) \ .execute() ``` 4. Time Travel: ```python # Read data from a previous version version_df = spark.read.format("delta").option("versionAsOf", 0).load("/path/to/delta/table") version_df.show() ``` Delta Lake is fully integrated with Apache Spark, making it a powerful tool for handling big data reliably and efficiently. Follow Swarnali S. #DeltaLake #DataEngineering #AuditLogs #DataGovernance #BigData #Spark #DataManagement
Like Comment
To view or add a comment, sign in
Abhisek Sahu

80K LinkedIn |Senior Azure Data Engineer ↔ Devops Engineer | Azure Databricks | Pyspark | ADF | Synapse| Python | SQL | Power BI
1mo
Report this post
𝗔𝗽𝗮𝗰𝗵𝗲 𝗦𝗽𝗮𝗿𝗸 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗤&𝗔 - 𝟭𝟱 𝗦𝗽𝗮𝗿𝗸 𝗘𝘀𝘀𝗲𝗻𝘁𝗶𝗮𝗹𝘀 𝗳𝗼𝗿 𝗕𝗶𝗴 𝗗𝗮𝘁𝗮 𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴 : 1. What is Apache Spark? A fast, open-source distributed system for big data processing, popular for its speed and support for advanced analytics. 2. How to improve Spark job performance? Use caching, broadcast variables, optimized shuffles, tune partitions, and enable Dynamic Resource Allocation. 3. How does caching work in Spark? Caching stores data in memory for quick reuse, ideal for iterative tasks or repeated dataset access. 4. Difference between cache() and persist()? cache() is shorthand for persist(MEMORY_ONLY), while persist() lets you choose storage levels. 5. What is a shuffle operation, and how to reduce its impact? Shuffles redistribute data across nodes; reduce by optimizing transformations and partitions. 6. How to optimize Spark SQL queries? Enable query caching, use DataFrame/Dataset API, partition tables, and tune shuffle partitions. 7. What is the Catalyst optimizer in Spark SQL? It automates query optimizations like predicate pushdown and join reordering for efficient execution. 8. What is the Tungsten project in Spark? Tungsten enhances performance by optimizing memory and CPU usage through cache-aware computation. 9. How to tune the number of partitions? Set partitions to about 2-3 times the core count or adjust based on data size and workload. 10. What are broadcast variables, and when to use them? Broadcast variables distribute read-only data efficiently, useful for joining with small datasets. 11. How to handle data skew in Spark? Mitigate skew by salting, custom partition, and using skew-aware join strategies. 12. What is Dynamic Resource Allocation? It adjusts executor count based on workload, optimizing resource usage and scaling down when idle. 13. How to monitor Spark jobs? Use the Spark UI, logs, and metrics tools like Prometheus for real-time insights into job performance. 14. Benefits of using Parquet in Spark? Parquet's columnar format boosts query performance with compression, efficient storage, and schema support. 15. How to manage memory in Spark? Avoid OutOfMemory errors by tuning memory settings, managing shuffle size, and ensuring adequate resources. Like and Save ✔ 🤝 Join my 2K+ Data Engineering Community : https://lnkd.in/gy4R55Tj 📌 Follow Abhisek Sahu for a regular curated feed of Data Engineering insights and valuable content! Repost ✅ if you find it useful #dataengineering #apachespark #databricks #sparkoptimization #pyspark
41 Comments
Like Comment
To view or add a comment, sign in
Mohamed Alghaly

Big Data Engineer @Vodafone
9mo
Report this post
In the previous post, we explored Data Immutability in Spark and how it works hand in hand with Lazy Evaluation to enhance the performance. In this post, we'll explore the concept of 𝗟𝗮𝘇𝘆 Evaluation in Apache Spark and why it's important. First, we will discuss what is Lazy Evaluation, how it works, why it is important, and finally, we will demonstrate Lazy Evaluation with Code. What is Lazy Evaluation - Lazy Evaluation means that Spark will wait until the very last moment to execute the graph of computation instructions. - This means that when you specify a transformation for the data, Spark doesn't execute it immediately. Instead, it builds a of all the transformations to be applied. How Lazy Evaluation Works - Instead of modifying the data immediately when you specify some operations, you build up a plan of transformations that you would like to apply to your source data. - Spark only performs the computations when an Action is performed. Actions such as count, collect, or save are triggers that tell Spark that we need the output of our Transformations. Why Lazy Evaluation is Important - Lazy evaluation in Spark is important because it allows for optimization of the entire data processing pipeline. - By waiting until the last minute to execute the code, Spark compiles this plan from your raw DataFrame transformations to a streamlined physical plan that will run as efficiently as possible across the cluster. - This provides immense benefits because Spark can optimize the entire data flow from end to end. Let's demonstrate the concept of Lazy Evaluation with a simple code example in PySpark: First, we create a simple DataFrame df = spark.range(1000).toDF("number") Perform simple transformations on the DataFrame df = df.select(df["number"] + 10).withColumnRenamed("(number + 10)", "number") df = df.select(df["number"] - 50).withColumnRenamed("(number - 50)", "number") df = df.select(df["number"] + 95).withColumnRenamed("(number + 95)", "number") Now look at the DAG for the Execution plan df.explain() == Physical Plan == *(1) Project [(((id#0L + 10) - 50) + 95) AS number#14L] +- *(1) Range (0, 1000, step=1, splits=8) All the transformations specified on MULTIPLE steps can be performed at once thanks to Lazy Evaluation. Let's try other transformations df = df.where("number % 2 = 0") Now look at the DAG for the Execution plan df.explain() == Physical Plan == *(1) Project [(((id#0L + 10) - 50) + 95) AS number#14L] +- *(1) 𝗙𝗶𝗹𝘁𝗲𝗿 (((((id#0L + 10) - 50) + 95) % 2) = 0) +- *(1) Range (0, 1000, step=1, splits=8) The catalyst Optimizer has chosen to 𝗳𝗶𝗿𝘀𝘁 𝗳𝗶𝗹𝘁𝗲𝗿 𝘁𝗵𝗲 𝗱𝗮𝘁𝗮, then perform the transformations (predicate pushdown)! Now, We will try a transformation that empties the DataFrame Completely! df = df.where("false") df.explain() == Physical Plan == 𝗟𝗼𝗰𝗮𝗹𝗧𝗮𝗯𝗹𝗲𝗦𝗰𝗮𝗻 <𝗲𝗺𝗽𝘁𝘆>, [𝗻𝘂𝗺𝗯𝗲𝗿#𝟭𝟰𝗟] As we have seen the Catalyst Optimizer has chosen 𝗻𝗼𝘁 𝘁𝗼 𝗲𝘃𝗲𝗻 𝗹𝗼𝗮𝗱 𝘁𝗵𝗲 𝗗𝗮𝘁𝗮!

1 Comment
Like Comment
To view or add a comment, sign in

3,972 followers

View Profile Connect

Pritam Saha’s Post

More from this author

Managed vs External(Un-Managed) table in Databricks

Data Engineer Interview question

Explore topics