2

I'm trying to add a column to my Spark DataFrame using withColumn and udf that takes no arguments. This only seems to work if I use a lambda to encapsulate my original function.

Here's a MWE:

from pyspark.sql import Row, SparkSession
from pyspark.sql.functions import udf

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([Row(number=i) for i in range(10)])

def foo():
    return 'bar'

udfoo = udf(foo())
df = df.withColumn('word', udfoo())
# Fails with TypeError: _create_udf() missing 1 required positional argument: 'f'

udfoo = udf(lambda: foo())
df = df.withColumn('word', udfoo())
# Works

I've managed to achieve the behaviour I want, so a "solution" is not exactly what I'm looking for (even though I welcome any suggestions for a better/more idiomatic way to implement this kind of thing). If anyone lands here looking for a "how to do it" answer, this other question might help.

What I'm really after in is an explanation: why should the first solution fail and the first work?

I'm using spark 2.4.0 and python 3.7.3 on Ubuntu 18.04.2

kadu
  • 746
  • 12
  • 29

1 Answers1

5

udf expects a function to be passed to it, but when you call foo() it evaluates immediately to a string.

You'll see the behavior you're expecting if you use udf(foo) instead of udf(foo()).

i.e.

udfoo = udf(foo)
df = df.withColumn('word', udfoo())

In case it helps, if you are trying to get a column that is just a constant value, you can use pyspark.sql.functions.lit, like:

from pyspark.sql import functions as F

df.withColumn('word', F.lit('bar'))
Patrick
  • 530
  • 5
  • 9
  • This makes perfect sense, the question had been gnawing at me! And thanks for the explanation about `lit`, this MWE is not exactly what I'm trying to do, but it may help future readers. – kadu Apr 23 '19 at 22:58