Pyspark的奇怪行为

时间:2018-01-10 10:37:54

标签: python date pyspark nonetype

我在PySpark中发现了一种奇怪的行为。也许你们其中一个人会知道会发生什么。如果我这样做:

def create_my_date(mydate):
       try:
           return mydate.strftime('%Y%m')
       except:
           return None

df = df.withColumn(
   "date_string", 
   F.udf(create_id, StringType())(df.mydate)
)

df.filter(~df.mydate.isNotNull()).count()
df.filter(df.mydate.isNotNull()).count()

此输出:

0
10

这意味着我在列df.mydate中没有Null值。

但是如果我改变了create_my_date函数并删除了try / except:

def create_my_date(mydate):
    return mydate.strftime('%Y%m')


df = df.withColumn(
    "date_string", 
    F.udf(create_id, StringType())(df.mydate)
)

df.filter(~df.mydate.isNotNull()).count()
df.filter(df.mydate.isNotNull()).count()

JVM打破并说:

Py4JJavaError: An error occurred while calling o7058.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 22 in stage 997.0 failed 4 times, most recent failure: Lost task 22.3 in stage 997.0 (TID 335940, 126.102.230.110, executor 29): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main
    process()
  File "/home/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 106, in <lambda>
    func = lambda _, it: map(mapper, it)
  File "/home/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 92, in <lambda>
    mapper = lambda a: udf(*a)
  File "/home/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 70, in <lambda>
    return lambda *a: f(*a)
  File "<ipython-input-109-422e4b5e07cf>", line 2, in create_my_date
AttributeError: 'NoneType' object has no attribute 'strftime'

有人对我有解释吗?

谢谢!

1 个答案:

答案 0 :(得分:2)

您收到属性错误的原因是您尝试在None类型上使用strftime。您可以看到在'create_my_date'期间触发了错误,因为它是udf正在使用rdd对象的python表示。所以基本上它是这样做的:

>>> None.strftime("%Y%m")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'strftime'

相反,您可以使用数据框函数完成您想要的任务(比udf更快,并且不需要try-except块):

from pyspark.sql.functions import date_format
from datetime import datetime
df = spark.createDataFrame([[datetime(2018, 3, 2).date()], [None]], ["mydate"])

df = df.withColumn("date_string", date_format("mydate", "YMM"))
df.show()

结果数据框:

    +----------+-----------+
    |    mydate|date_string|
    +----------+-----------+
    |2018-03-02|     201803|
    |      null|       null|
    +----------+-----------+

然后你的计数:

df.filter(df["mydate"].isNotNull()).count()
df.filter(df["mydate"].isNull()).count()

按预期退货:

1
1