Spark parquet assign index within group

Question

I would like to know the most efficient way to generate column index to unique identify a record within each group of label:

+-------+-------+-------+
| label | value | index |
+-------+-------+-------+
| a     | v1    | 0     |
+-------+-------+-------+
| a     | v2    | 1     |
+-------+-------+-------+
| a     | v3    | 2     |
+-------+-------+-------+
| a     | v4    | 3     |
+-------+-------+-------+
| b     | v5    | 0     |
+-------+-------+-------+
| b     | v6    | 1     |
+-------+-------+-------+

My actual data is very large and each group of label has the same number of records. Column index will be used for Pivot. I could do the usual sort + for-loop incremental + check if cur<>pre then reset index, etc but a faster and more efficient way is always welcome.

EDIT: got my answer from the suggested question:

from pyspark.sql import Row, functions as F
from pyspark.sql.window import Window

df = df.withColumn("index", 
                   F.row_number().over(
                       Window.partitionBy("label").orderBy("value"))
                  )

Thank you for all your helps!

Shaido · Accepted Answer · 2018-05-28T05:31:29.017

0

You can use Window functions to create a rank based column while partitioning on the label column. However, this requires an ordering - in this case on value:

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window = Window.partitionBy(df['label']).orderBy(df['value'])
df.withColumn('index', row_number().over(window))

This will give a new column index with values starting from 1 (to start from 0, simply add -1 to the expression above). The values will be given as ordered by the value column.

edited May 28 '18 at 05:31

answered May 28 '18 at 03:53

Shaido

27,497
23
70
73

rank() will have issue when `value`s are the same, better use `row_number()`. – jxc May 28 '18 at 04:03
@jxc That is a good point. I updated the answer. – Shaido May 28 '18 at 05:00

Spark parquet assign index within group

1 Answers1