TrendyTech’s Post

TrendyTech reposted this

View profile for Sumit Mittal, graphic

Founder & CEO of Trendytech | Big Data Trainer | Ex-Cisco | Ex-VMware | MCA @ NIT Trichy | #SumitTeaches | New Batch Starting November 16th, 2024

Internal working of Apache Spark - One of the most liked writeup Lets say you have a 20 node spark cluster Each node is of size - 16 cpu cores / 64 gb RAM Let's say each node has 3 executors, with each executor of size - 5 cpu cores / 21 GB RAM => 1. What's the total capacity of cluster? We will have 20 * 3 = 60 executors Total CPU capacity: 60 * 5 = 300 cpu Cores Total Memory capacity: 60 * 21 = 1260 GB RAM => 2. How many parallel tasks can run on this cluster? We have 300 CPU cores, we can run 300 parallel tasks on this cluster. => 3. Let's say you requested for 4 executors then how many parallel tasks can run? so the capacity we got is 20 cpu cores 84 GB RAM so a total of 20 parallel tasks can run. => 4. Let's say we read a csv file of 10.1 GB stored in datalake and have to do some filtering of data, how many tasks will run? if we create a dataframe out of 10.1 GB file we will get 81 partitions in our dataframe. (will cover in my next post on how many partitions are created) so we have 81 partitions each of size 128 mb, the last partition will be a bit smaller. so our job will have 81 total tasks. but we have 20 cpu cores lets say each task takes around 10 second to process 128 mb data. so first 20 tasks run in parallel, once these 20 tasks are done the other 20 tasks are executed and so on... so totally 5 cycles, if we think the most ideal scenario. 10 sec + 10 sec + 10 sec + 10 sec + 8 sec first 4 cycles is to process 80 tasks all of 128 mb, last 8 sec is to process just one task of around 100 mb, so it takes little lesser but 19 cpu cores were free during this time. => 5. is there a possibility of, out of memory error in the above scenario? Each executor has 5 cpu cores and 21 gb ram. This 21 gb RAM is divided in various parts - 300 mb reserved memory, 40% user memory to store user defined variables/data. example hashmap 60% spark memory - this is divided 50:50 between storage memory and execution memory. so basically we are looking at execution memory and it will be around 28% roughly of the total memory allotted. so consider around 6 GB of 21 GB memory is meant for execution memory. per cpu core we have (6 GB / 5 cores) = 1.2 GB execution memory. That means our task can roughly handle around 1.2 GB of data. however, we are handling 128 mb so we are well under this range. I hope you liked the explanation :) Do mention in comment what you want me to bring in my next post! if you want to experience a learning like never before & want to make a career in big data then DM me. New batch is starting tomorrow. #bigdata #career #dataengineering #apachespark

  • No alternative text description for this image
Branesh P

Data Engineer | Python, SQL, ETL, Airflow, Apache Spark | Open to Opportunities

2w

As of my knowledge, By default 1B(1block=128MB) which means, 1B can hold 1 partition of size 128MB. 10.1GB = 10.1 x 1024MB = 10342.4MB To find no.of partitions; 10342.4 /128 = 80.8 i.e) 81B or 81 partition. Correct me if I'm wrong

Anurag Singh

Serving Notice Period | 3 years exp | Data Engineer | SQL | Python | ETL Developer |Snowflake Certified | Azure 204 DP203 | Big Data | AWS | GCP

2w

It was so detailed and explained beautifully.

Hi Sumit mittal i am planning to change my carrier into BIGDATA ,but my background is SAP but it is difficult to learn or easy for my case

Like
Reply
Azhar Akther K

Big Data Enthusiast - Payments | Big Data | Pyspark | Hive | HDFS | Azure Databricks |ADLS gen2| Azure Data Factory |Python| SQL|Azure Synapse Analytics|Unity Catalog| Spark Structured Streaming|Autoloader| Apache Kafka

2w

One of the excellent and useful practical scenarios covered in this post. One should know this internals to plan for the resources for data crunching .

Joseph Vishal Vincent

Lead Azure Big Data Engineer | Technical Lead | Big Data Architect at Tata Consultancy Services

2w

Sir, as usual it is an awesome explanation of the internals.

Nagarjuna Mandalapu

Technical Lead @Nokia || SQL || PySpark || Databricks || ADLS Gen2 || ADF || AWS || Glue || EMR || Redshift || Athena || Kafka || VIT'22

2w

Very helpful Sumit sir. It's the most interesting topic in the course.

Very helpful Sumit Mittal and clear explanation in a simple way

Abdullah Najeeb

Big Data Engineer at Data Doers

2w

Very informative

Karthik Mani (PMI- PMP, ACP, RMP. SCDM- CCDA. SAS- VA.)

Manager - Clinical Data Management | Data | Analytics | SAS | SQL | Viz | Payroll Processing | Testing

2w

Sumit Mittal sir, simple and clear explanations. Thanks sir.

See more comments

To view or add a comment, sign in

Explore topics