PySpark to_timestamp()的怪异行为

时间:2019-02-16 14:02:19

标签: python apache-spark pyspark timestamp unix-timestamp

我注意到PySpark(可能还有Spark)的to_timestamp函数有点奇怪的行为。看起来它正在正确地将某些字符串转换为时间戳,而将其他格式完全相同的其他字符串转换为null。考虑下面我得出的例子:

times = [['2030-03-10 02:56:07'], ['2030-03-11 02:56:07']]

df_test = spark.createDataFrame(times, schema=StructType([
    StructField("time_string", StringType())
]))
df_test = df_test.withColumn('timestamp', 
                             F.to_timestamp('time_string', 
                                            format='yyyy-MM-dd HH:mm:ss'))
df_test.show(2, False)

这就是我得到的:

+-------------------+-------------------+
|time_string        |timestamp          |
+-------------------+-------------------+
|2030-03-10 02:56:07|null               |
|2030-03-11 02:56:07|2030-03-11 02:56:07|
+-------------------+-------------------+

正确转换第二个字符串而不转换第一个字符串的原因是什么?我也尝试过使用unix_timestamp()函数,结果是相同的。

更奇怪的是,如果我不使用format参数,我将不再得到null,但是时间戳的小时数增加了一个。

df_test2 = df_test.withColumn('timestamp', F.to_timestamp('time_string'))
df_test2.show(2, False)

结果:

+-------------------+-------------------+
|time_string        |timestamp          |
+-------------------+-------------------+
|2030-03-10 02:56:07|2030-03-10 03:56:07|
|2030-03-11 02:56:07|2030-03-11 02:56:07|
+-------------------+-------------------+

知道发生了什么吗?

更新:

我也通过spark-shell在Scala中尝试过,结果是相同的:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions

val times = Seq(Row("2030-03-10 02:56:07"), Row("2030-03-11 02:56:07"))
val schema=List((StructField("time_string", StringType)))
val df = spark.createDataFrame(spark.sparkContext.parallelize(times), 
                               StructType(schema))
val df_test = df.withColumn("timestamp", 
                      functions.to_timestamp(functions.col("time_string"), 
                                             fmt="yyyy-MM-dd HH:mm:ss"))

df_test.show()

结果:

+-------------------+-------------------+
|        time_string|          timestamp|
+-------------------+-------------------+
|2030-03-10 02:56:07|               null|
|2030-03-11 02:56:07|2030-03-11 02:56:07|
+-------------------+-------------------+

0 个答案:

没有答案