Sreedhar Reddy’s Post

View profile for Sreedhar Reddy, graphic

Senior Data Engineer at Persistent | Ex EY | Azure 2X certified | Databricks 2X certified | Pyspark | SQL

In Spark, usually it doesn't make any difference in performance if you write your code in Scala, Python or R but it's not the same case when creating an UDF. When we create an UDF in scala, it runs within the JVM and there's no additional overhead as Spark engine primarily runs on the JVM but when you create an UDF in Python, something quite different happens. Python UDFs run in a separate interpreter, outside of JVM. When a PySpark UDF is called, data has to be serialized back to the JVM. This process adds substantial overhead of inter process communication, making python UDFs generally slower than Scala UDFs. 𝐂𝐨𝐧𝐜𝐥𝐮𝐬𝐢𝐨𝐧: It's always recommended to write your UDFs in Scala or Java—the small amount of time it should take you to write the function in Scala will always yield significant speed ups, and on top of that, you can still use the scala function from Python by loading it into a jar. That's the beauty of the spark. #Spark #UDF #Python #Pyspark

To view or add a comment, sign in

Explore topics