Window Aggregation Analytics (Spark)

The Window Aggregation analytics plugin for Spark is available in the Hub.

Plugin version: 1.1.0

Specify a window over which functions should be applied. Supported functions: Rank, Dense Rank, Percent Rank, N tile, Row Number, Median, Continuous Percentile, Lead, Lag, First, Last, Cumulative distribution, Accumulate.

The plugin is used when you want to calculate some basic aggregations in your data similar to what you could do with a window function in SQL.

BigQuery ELT Transformation Pushdown (6.9.0+)

Window aggregation stages are now eligible to execute in BigQuery when BigQuery ELT Transformation Pushdown is enabled in a pipeline. Window aggregation stages will be executed in BigQuery when a preceding stage has already been executed in BigQuery (such as a Join operation or another aggregation stage) or if the source is BigQuery. All the above mentioned functions are supported in BigQuery.

Configuration

Property

Macro Enabled?

Description

Property

Macro Enabled?

Description

Partition fields

Yes

Required. Specifies a list of fields, comma-separated, to partition the data by. At least 1 field must be provided. Records with the same value for all these fields will be grouped together.

Order

Yes

Optional. Specifies key-value pairs containing the ordering field, and the order (ascending or descending). All data types are allowed, except when Frame Type is RANGE and Unbounded preceding or Unbounded following is set to false, order must be single expression and data type must be one of: Int, Long, Double, Float For example: value:Ascending,id:Descending.

Frame Type

Yes

Optional. Selects the type of window frame to create within each partition. Options can be ROW, RANGE or NONE. 

Default is None.

Unbounded preceding

Yes

Optional. Whether to use an unbounded start boundary for a frame. 

Default is No.

Unbounded following

N

Yes

Optional. Whether to use an unbounded end boundary for a frame.

Default is No.

Preceding

Yes

Optional. Specifies the number of preceding rows in the window frame. When Frame Type is ROW, this is a number relative to the current row. E.g. -2 to begin the frame 2 rows before the current row. When Frame Type is RANGE, specifies the value to be subtracted from the value of the current row to get the start boundary.

Following

Yes

Optional. Specifies the number of following rows to include in the window frame. When Frame Type is ROW, this is a number relative to the current row. E.g. 3 to end the frame 3 rows after the current row. When Frame Type is RANGE, specifies the value to be added to the value of the current row to get the end boundary.

Aggregates

Yes

Required. Specifies a list of functions to run on the selected window. Supported aggregate functions are Rank, Dense Rank, Percent Rank, N tile, Row Number, Median, Continuous Percentile, Lead, Lag, First, Last, Cumulative distribution, Accumulate. Aggregates are specified using syntax: .alias:function(field,encoded(arguments),ignoreNulls)[\n other functions]. For example, nextValue:lead(value,1,false)\npreviousValue:lag(value,1,false) will calculate two aggregates. The first will create a field called nextValue that is the next value of current row in the group. The second will create a field called previousValue that is the previous value of current row in the group.

Number of partitions

Yes

Optional. Number of partitions to use when grouping fields. If not specified, the execution framework will decide on the number to use.

Clause Constraints

Function

Partition fields

Order

Frame Type

Function

Partition fields

Order

Frame Type

rank

Required

Required

Not supported

dense_rank

Required

Required

Not supported

percent_rank

Required

Required

Not supported

n_tile

Required

Required

Not supported

row_number

Required

Required

Not supported

continous_percentile

Required

Not supported

Not supported

lead

Required

Required

Not supported

lag

Required

Required

Not supported

first

Required

Required

Optional

last

Required

Required

Optional

cumulative_distribution

Required

Required

Not supported

accumulate

Required

Optional

Optional

Functions with Arguments

There are few functions which require the field and argument as per the syntax alias:function(field,encoded(arguments),ignoreNulls). If the function doesn't require the field or the argument, then it's ignored.

Function

field

argument

Function

field

argument

rank

 

 

dense_rank

 

 

percent_rank

 

 

n_tile

 

Required : an integer greater than 0

row_number

 

 

continous_percentile

Required

Required : a numeric between 0 and 1 (both inclusive)

lead

Required

Required : a non-negative integer

lag

Required

Required : a non-negative integer

first

Required

 

last

Required

 

cumulative_distribution

 

 

accumulate

Required

 

Sample Pipeline

Input Records

name

age

location

name

age

location

peter

20

US

foo

22

US

rajeev

24

US

john

28

US

alex

30

US

ravi

20

INDIA

kenny

30

INDIA

Window Aggregations Configuration

Partition fields : location

Order : age:ascending

Frame Type : None

Aggregates :

my_rank: rank(,,true)

next_value:lead(age,1,false)

Output Records

name

age

location

my_rank

next_value

name

age

location

my_rank

next_value

peter

20

US

1

22

foo

22

US

2

24

rajeev

24

US

3

28

john

28

US

4

30

alex

30

US

5

 

ravi

20

INDIA

1

30

kenny

30

INDIA

2

 



Created in 2020 by Google Inc.