Sreedhar Reddy’s Post

1mo

In Spark, usually it doesn't make any difference in performance if you write your code in Scala, Python or R but it's not the same case when creating an UDF. When we create an UDF in scala, it runs within the JVM and there's no additional overhead as Spark engine primarily runs on the JVM but when you create an UDF in Python, something quite different happens. Python UDFs run in a separate interpreter, outside of JVM. When a PySpark UDF is called, data has to be serialized back to the JVM. This process adds substantial overhead of inter process communication, making python UDFs generally slower than Scala UDFs. 𝐂𝐨𝐧𝐜𝐥𝐮𝐬𝐢𝐨𝐧: It's always recommended to write your UDFs in Scala or Java—the small amount of time it should take you to write the function in Scala will always yield significant speed ups, and on top of that, you can still use the scala function from Python by loading it into a jar. That's the beauty of the spark. #Spark #UDF #Python #Pyspark

To view or add a comment, sign in

More Relevant Posts

Apoorv Yadav

Software Developer 2 @ Asurion | Core Maintainer - Golang@ DiceDB | Node.js | Typescript | Python | Gen AI | LLM
4mo
Report this post
The post is for you if you are transitioning or learning Python. Sooner or later, you will be introduced to words like Cpython, Jython, Cython, etc. If you want to know more about these nuances, keep reading. - Python: The programming language itself. - CPython: The default and most widely-used implementation of Python, written in C. When people refer to Python, they usually mean CPython. It compiles Python code to bytecode, which is then interpreted by the CPython virtual machine. It might create misunderstanding but CPython does not allow us to write C code. - Cython: A superset of Python designed to give C-like performance with code that is written mostly in Python. It allows you to write Python code that gets compiled to C, enabling significant performance improvements, especially for CPU-bound tasks. - Jython: An implementation of Python that runs on the Java platform. Here our Python code first gets compiled to Java bytecode and then that bytecode gets interpreted to platform-specific operations by JVM or Java Virtual Machine. There are more implementations of Python out there such as PyPy, and Iron Python. If you are interested to know more about them and how each of them runs Python code, please check the first comment. One thing to note is, that irrespective of which implement you use, the result for that code should be the same, of course, deviations can occur. #Python #ComputerScience #DSA #Algorigthms #Programming

2 Comments
Like Comment
To view or add a comment, sign in
Ayushi Agrawal

SQL Developer|DataEngineer|SQL Server|Database Management|Python|Linux|Hadoop|HDFS|Sqoop|Hive|Spark|Mapreduce|AWS|Scala|PL/SQL|Capgemini
9mo
Report this post
Diving Deep into Python's While Loop: Technical Insights Let's unravel the intricacies of Python's while loop, a powerful construct for executing code repeatedly based on a condition. Here's a technical breakdown: Core Syntax: The while loop iterates as long as a specified condition is true. Its structure is as follows: while condition: # Code block to execute Conditional Execution: The loop continues iterating as long as the condition remains true. Once the condition evaluates to false, the loop terminates, and program control moves to the next statement. Control Flow: Unlike for loops that iterate over a predefined sequence, while loops are suitable for scenarios where the number of iterations is not known beforehand or when the loop needs to continue until a specific condition is met. Dynamic Iteration: while loops are well-suited for scenarios such as user input validation, iterative algorithms, or situations where the number of iterations depends on runtime conditions. Avoiding Infinite Loops: Care must be taken to ensure the condition within a while loop eventually becomes false. Failure to do so may result in an infinite loop, causing the program to hang or crash. Iteration Control: Within the loop block, developers often use control statements like break to exit the loop prematurely or continue to skip the current iteration and proceed to the next. Understanding the nuanced behavior and control mechanisms of the while loop empowers Python developers to create robust and efficient algorithms for various problem domains. What's your favorite application of the while loop in Python programming? #Python #Programming #WhileLoop #TechnicalInsights #ForLoop #TechnicalSyntax #IterationInPython #LinkedInLearning #sqlqueries #sqldeveloper #scala #dataengineerjobs #dataengineer #dataengineering #sqlqueries #sql #sqlserver #hdfs #mapreduce #hadoop #sqoop
Like Comment
To view or add a comment, sign in
Praveen Raj

Data Analyst | Python | SQL | Power BI |
9mo
Report this post
YouTube Data Harvesting(Using Python MYSQL mongoDB Streamlit) Full Source Code Refer my Git-Hub https://lnkd.in/gBDxt7x2 #python #Datascience #YoutubedataHarvesting #Mysql #mongoDB
Like Comment
To view or add a comment, sign in
Lawrence Fernandes

Data Architect | Solutions Architect - Big Data & Analytics
7mo
Report this post
🔍 Did you know Zen, Actian's amazing NoSQL (Not Only SQL) embbeded database has support for Python? This blog shares a journey from old-school programming to seamlessly connecting Actian Zen with Python. Discover how easy it can be! #Actian #ActianZen #PervasiveSQL #PervasivePSQL #Btrieve#Python #Database

Programming the Easy Way: Accessing a PSQL Zen Database with Python and ODBC

https://www.actian.com
Like Comment
To view or add a comment, sign in
Xavier Dave

Data Engineer Consultant @Wavestone
6mo
Report this post
Very interesting tests that demonstrate the use of Python with "libraries" such as Polars and DuckDB, and their efficiency compared to Pandas. I can't wait to see their impact in a "near" future. The question is "When?" I definitely need to try this for myself now!
Gabriel Evangelista

Sr Data Engineer | Analytics Engineer | AWS/Azure/Databricks/dbt/spark
6mo

🐍🦆 Processing 100 million rows with Python In this project, inspired by the Java-based One Billion Row Challenge, I efficiently processed a massive 100 million-line data file (~1.57 GB) using Python. The goal was to calculate statistics like aggregation and sorting. The challenge involved developing a Python program to read a file with two columns—station name and measurement—and calculate the minimum, average (rounded to one decimal place), and maximum temperature for each station, displaying results in a sorted table. Approaches included pure Python and libraries like Pandas, Dask, Polars, and DuckDB. Execution times for processing the 100 million-line file are shown below. The results highlight the effectiveness of Python libraries, with Dask, Polars, and DuckDB being exceptionally efficient, requiring fewer lines of code due to their ability to distribute data in "streaming batches" more efficiently. DuckDB performed best, achieving the shortest execution time due to its execution and data processing strategy. These findings underscore the importance of selecting the right tool for large-scale data analysis, showing Python, with the right libraries, is a powerful choice for big data challenges. #python #duckdb #pandas
Like Comment
To view or add a comment, sign in
telecomhall

40,822 followers
6mo
Report this post
Processing 100 million rows with Python
Gabriel Evangelista

Sr Data Engineer | Analytics Engineer | AWS/Azure/Databricks/dbt/spark
6mo

🐍🦆 Processing 100 million rows with Python In this project, inspired by the Java-based One Billion Row Challenge, I efficiently processed a massive 100 million-line data file (~1.57 GB) using Python. The goal was to calculate statistics like aggregation and sorting. The challenge involved developing a Python program to read a file with two columns—station name and measurement—and calculate the minimum, average (rounded to one decimal place), and maximum temperature for each station, displaying results in a sorted table. Approaches included pure Python and libraries like Pandas, Dask, Polars, and DuckDB. Execution times for processing the 100 million-line file are shown below. The results highlight the effectiveness of Python libraries, with Dask, Polars, and DuckDB being exceptionally efficient, requiring fewer lines of code due to their ability to distribute data in "streaming batches" more efficiently. DuckDB performed best, achieving the shortest execution time due to its execution and data processing strategy. These findings underscore the importance of selecting the right tool for large-scale data analysis, showing Python, with the right libraries, is a powerful choice for big data challenges. #python #duckdb #pandas
Like Comment
To view or add a comment, sign in
Muhammad Yasir Awan

Project and Program Management | PMO | Project Control | Data Analytics | Data Visualization & Dashboards | AI &ML | Executive Insights & Reporting | Agile | PMO | Geospatial Analytics
6mo
Report this post
The power of Python in data analysis is truly mind -boggling
Gabriel Evangelista

Sr Data Engineer | Analytics Engineer | AWS/Azure/Databricks/dbt/spark
6mo

🐍🦆 Processing 100 million rows with Python In this project, inspired by the Java-based One Billion Row Challenge, I efficiently processed a massive 100 million-line data file (~1.57 GB) using Python. The goal was to calculate statistics like aggregation and sorting. The challenge involved developing a Python program to read a file with two columns—station name and measurement—and calculate the minimum, average (rounded to one decimal place), and maximum temperature for each station, displaying results in a sorted table. Approaches included pure Python and libraries like Pandas, Dask, Polars, and DuckDB. Execution times for processing the 100 million-line file are shown below. The results highlight the effectiveness of Python libraries, with Dask, Polars, and DuckDB being exceptionally efficient, requiring fewer lines of code due to their ability to distribute data in "streaming batches" more efficiently. DuckDB performed best, achieving the shortest execution time due to its execution and data processing strategy. These findings underscore the importance of selecting the right tool for large-scale data analysis, showing Python, with the right libraries, is a powerful choice for big data challenges. #python #duckdb #pandas
1 Comment
Like Comment
To view or add a comment, sign in
Fayaz Shaik

Trainer | AWS | Java8 | Spring Boot | Kafka | Hibernate | Oracle | MongoDB | ElasticSearch | SQL | HTML | CSS | Angular | Javascript | Selenium | TestNG | Python | Docker | Kubernetes
1mo Edited
Report this post
Python core to advanced - Python installation - Python datatypes - Important methods/Operators/flow controls - List, Tuple, Set, Dictionary - Best Practices for Exception Handling in Python - Python Lambdas - Multithreading - Multiprocessing (Process-based parallelism) - asyncio in Python - The re module - Namedtuple - Python dataclasses - Dunder or magic methods in Python - Python Decorators (Aspect oriented programming) - Connect Python to Oracle DB using cx_Oracle module - Connect Python to SQLite - NumPy and Pandas - Natural Language Toolkit (NLTK) - Stemming, Lemmatization and POS Tag - Data Science and ML - Logging in Python - Matplotlib: Visualization with Python - Python web applications using Flask and Jinga2 - SQLAlchemy - Python SQL toolkit and ORM - FastAPI- fast (high-performance), web framework for building APIs - Comparison of FastAPI with Django and Flask
Like Comment
To view or add a comment, sign in
Gabriel Evangelista

Sr Data Engineer | Analytics Engineer | AWS/Azure/Databricks/dbt/spark
6mo
Report this post
🐍🦆 Processing 100 million rows with Python In this project, inspired by the Java-based One Billion Row Challenge, I efficiently processed a massive 100 million-line data file (~1.57 GB) using Python. The goal was to calculate statistics like aggregation and sorting. The challenge involved developing a Python program to read a file with two columns—station name and measurement—and calculate the minimum, average (rounded to one decimal place), and maximum temperature for each station, displaying results in a sorted table. Approaches included pure Python and libraries like Pandas, Dask, Polars, and DuckDB. Execution times for processing the 100 million-line file are shown below. The results highlight the effectiveness of Python libraries, with Dask, Polars, and DuckDB being exceptionally efficient, requiring fewer lines of code due to their ability to distribute data in "streaming batches" more efficiently. DuckDB performed best, achieving the shortest execution time due to its execution and data processing strategy. These findings underscore the importance of selecting the right tool for large-scale data analysis, showing Python, with the right libraries, is a powerful choice for big data challenges. #python #duckdb #pandas
69 Comments
Like Comment
To view or add a comment, sign in
Kamlesh Kumar

Senior Software Engineer - AI, Blockchain, Full Stack Web & Mobile Development
3mo
Report this post
A Hash Table is a fundamental data structure that maps keys to values using a hash function. #C #C++ #DSA #Python

Hash Table: Mapping Keys to Values Using a Hash Function

https://darkevil.club/home
Like Comment
To view or add a comment, sign in

771 followers

61 Posts

View Profile Connect

Sreedhar Reddy’s Post

More Relevant Posts

Explore topics