I'm trying to add a column to my Spark DataFrame using withColumn and udf that takes no arguments. This only seems to work if I use a lambda to encapsulate my original function.
Here's a MWE:
from pyspark.sql import Row, SparkSession
from pyspark.sql.functions import udf
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([Row(number=i) for i in range(10)])
def foo():
return 'bar'
udfoo = udf(foo())
df = df.withColumn('word', udfoo())
# Fails with TypeError: _create_udf() missing 1 required positional argument: 'f'
udfoo = udf(lambda: foo())
df = df.withColumn('word', udfoo())
# Works
I've managed to achieve the behaviour I want, so a "solution" is not exactly what I'm looking for (even though I welcome any suggestions for a better/more idiomatic way to implement this kind of thing). If anyone lands here looking for a "how to do it" answer, this other question might help.
What I'm really after in is an explanation: why should the first solution fail and the first work?
I'm using spark 2.4.0 and python 3.7.3 on Ubuntu 18.04.2