Sergey Senigov’s Post

View profile for Sergey Senigov, graphic

Data Engineer SQL, Apache Spark, Spark Streaming, Flink, Python, Hive, DWH 🔥 Follow my Apache Spark Blog here!

There are so called «narrow» and «wide» transformations in Apache Spark.  Briefly, wide transformations are those in which Spark needs to collect all data belonging to one group to one partition, and so make data exchange aka shuffle. Any grouping, windowing transformations are wide because to get group summary Spark needs to move all group data in one partition. On the contrary narrow transformations don’t need data from other partitions, for example «filter», «map». The key for this topic is – narrow transformations preserve existing partitions during execution – they don’t shuffle data. But there is a special case – the «union» transformation. It doesn’t produce shuffle but it «stacks» source DataFrames. So the result DataFrame is comprised of sources’ partitions and its partition number equals sum of the sources’ partition numbers. This effect may be important for further transformations execution performance. Also it directly affects files number written to downstream storage. https://lnkd.in/g8VRq8qD

  • Apache Spark SQL partitions count after "union" transformation

To view or add a comment, sign in

Explore topics