0

I would like to know the most efficient way to generate column index to unique identify a record within each group of label:

+-------+-------+-------+
| label | value | index |
+-------+-------+-------+
| a     | v1    | 0     |
+-------+-------+-------+
| a     | v2    | 1     |
+-------+-------+-------+
| a     | v3    | 2     |
+-------+-------+-------+
| a     | v4    | 3     |
+-------+-------+-------+
| b     | v5    | 0     |
+-------+-------+-------+
| b     | v6    | 1     |
+-------+-------+-------+

My actual data is very large and each group of label has the same number of records. Column index will be used for Pivot. I could do the usual sort + for-loop incremental + check if cur<>pre then reset index, etc but a faster and more efficient way is always welcome.

EDIT: got my answer from the suggested question:

from pyspark.sql import Row, functions as F
from pyspark.sql.window import Window

df = df.withColumn("index", 
                   F.row_number().over(
                       Window.partitionBy("label").orderBy("value"))
                  )

Thank you for all your helps!

Tri
  • 37
  • 6

1 Answers1

0

You can use Window functions to create a rank based column while partitioning on the label column. However, this requires an ordering - in this case on value:

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window = Window.partitionBy(df['label']).orderBy(df['value'])
df.withColumn('index', row_number().over(window))

This will give a new column index with values starting from 1 (to start from 0, simply add -1 to the expression above). The values will be given as ordered by the value column.

Shaido
  • 27,497
  • 23
  • 70
  • 73