Pypsark:unix_timestamp错误

时间:2018-06-20 21:00:30

标签: pyspark unix-timestamp

Pyspark 2.1:

我创建了一个数据名人堂,并且有一个timestamp列,我将其转换为unix时间戳。但是,从unix时间戳派生的列不正确。随着时间戳的增加,unix_timestamp也应增加,但是事实并非如此。您可以从下面的代码中看到一个示例。请注意,在对timestamp变量和unix_ts变量进行排序时,将获得不同的顺序。

from pyspark.sql import functions as F

df = sqlContext.createDataFrame([
        ("a", "1", "2018-01-08 23:03:23.325359"),
        ("a", "2", "2018-01-09 00:03:23.325359"),
        ("a", "3", "2018-01-09 00:03:25.025240"),
        ("a", "4", "2018-01-09 00:03:27.025240"),
        ("a", "5", "2018-01-09 00:08:27.021240"),
        ("a", "6", "2018-01-09 03:03:27.025240"),
        ("a", "7", "2018-01-09 05:03:27.025240"),


], ["person_id", "session_id", "timestamp"])

df = df.withColumn("unix_ts",F.unix_timestamp(F.col("timestamp"), "yyyy-MM-dd HH:mm:ss.SSSSSS"))

df.orderBy("timestamp").show(10,False)
df.orderBy("unix_ts").show(10,False)

输出:

+---------+----------+--------------------------+----------+
|person_id|session_id|timestamp                 |unix_ts   |
+---------+----------+--------------------------+----------+
|a        |1         |2018-01-08 23:03:23.325359|1515474528|
|a        |2         |2018-01-09 00:03:23.325359|1515478128|
|a        |3         |2018-01-09 00:03:25.025240|1515477830|
|a        |4         |2018-01-09 00:03:27.025240|1515477832|
|a        |5         |2018-01-09 00:08:27.021240|1515478128|
|a        |6         |2018-01-09 03:03:27.025240|1515488632|
|a        |7         |2018-01-09 05:03:27.025240|1515495832|
+---------+----------+--------------------------+----------+

+---------+----------+--------------------------+----------+
|person_id|session_id|timestamp                 |unix_ts   |
+---------+----------+--------------------------+----------+
|a        |1         |2018-01-08 23:03:23.325359|1515474528|
|a        |3         |2018-01-09 00:03:25.025240|1515477830|
|a        |4         |2018-01-09 00:03:27.025240|1515477832|
|a        |5         |2018-01-09 00:08:27.021240|1515478128|
|a        |2         |2018-01-09 00:03:23.325359|1515478128|
|a        |6         |2018-01-09 03:03:27.025240|1515488632|
|a        |7         |2018-01-09 05:03:27.025240|1515495832|
+---------+----------+--------------------------+----------+

这是错误还是我正在做某事/实施了此错误?

此外,您还可以看到2018-01-09 00:03:27.025240 and 2018-01-09 00:08:27.021240 produce the same unix_timestamp of 1515495832`

1 个答案:

答案 0 :(得分:0)

问题似乎是Spark的unix_timestamp在内部使用Java的SimpleDateFormat解析日期,而SimpleDateFormat不支持微秒(例如,见here)。此外,unix_timestamp返回一个long,因此它的粒度只有几秒钟。

一种解决方法是仅在不包含微秒信息的情况下进行解析,然后将微秒分别添加回去:

df = spark.createDataFrame([
        ("a", "1", "2018-01-08 23:03:23.325359"),
        ("a", "2", "2018-01-09 00:03:23.325359"),
        ("a", "3", "2018-01-09 00:03:25.025240"),
        ("a", "4", "2018-01-09 00:03:27.025240"),
        ("a", "5", "2018-01-09 00:08:27.021240"),
        ("a", "6", "2018-01-09 03:03:27.025240"),
        ("a", "7", "2018-01-09 05:03:27.025240"),
], ["person_id", "session_id", "timestamp"])

# parse the timestamp up to the seconds place
df = df.withColumn("unix_ts_sec",f.unix_timestamp(f.substring(f.col("timestamp"), 1, 19), "yyyy-MM-dd HH:mm:ss"))
# extract the microseconds
df = df.withColumn("microsec", f.substring(f.col("timestamp"), 21, 6).cast('int'))
# add to get full epoch time accurate to the microsecond
df = df.withColumn("unix_ts", f.col("unix_ts_sec") + 1e-6 * f.col("microsec"))

侧面说明:我无法轻松访问Spark 2.1,但是使用Spark 2.2时,unix_ts的空值与最初编写的一样。您似乎遇到了某种Spark 2.1错误,为您提供了无用的时间戳记。