为什么spark(Python)吞噬了我的毫秒?

时间:2018-10-12 17:34:08

标签: python apache-spark pyspark

我有毫秒格式的时间戳,需要将它们从系统时间转换为UTC。无论如何...当进行转换时,火花吞噬了我的毫秒,并将它们显示为零。

简短示例:

from pyspark import Row
from pyspark import SparkContext
from pyspark.sql.functions import to_timestamp, date_format

spark = SparkContext.getOrCreate()

test = spark.createDataFrame([Row(timestamp = "2018-03-24 14:37:12,133")])
test_2 = test.withColumn('timestamp_2', to_timestamp('timestamp', 'yyyy-MM-dd HH:mm:ss,SSS'))
test_3 = test_2.withColumn('timestamp_3', date_format('timestamp_2', 'yyyy-MM-dd HH:mm:ss,SSS'))
test_3.write.option('header', True).csv('something')

这将导致:

timestamp,timestamp_2,timestamp_3
"2018-03-24 14:37:12,133",2018-03-24T14:37:12.000+01:00,"2018-03-24 14:37:12,000"

我可以保留毫秒吗?

我正在使用python 3.6.4和spark版本2.3.2。

1 个答案:

答案 0 :(得分:1)

设法使它现在可以工作。由于spark似乎无法在毫秒内正常工作,因此我定义了一个UDF,该UDF使用pytzdatetime包将字符串转换为datetime,更改时区,以及然后再次打印字符串。

import pytz
from datetime import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark import Row
from pyspark import SparkContext

spark = SparkContext.getOrCreate()

def convert_to_utc(timestamp):
    local = pytz.timezone("Arctic/Longyearbyen")
    naive = datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S,%f')
    local_dt = local.localize(naive, is_dst=None)
    utc_dt = local_dt.astimezone(pytz.utc)
    return utc_dt.strftime('%Y-%m-%d %H:%M:%S,%f')[:-3]

convert_to_utc_udf = udf(lambda timestamp: convert_to_utc(timestamp), StringType())

test = spark.createDataFrame([Row(timestamp = "2018-03-24 14:37:12,133")])
test_2 = test.withColumn('timestamp_2', convert_to_utc_udf('timestamp'))
test_2.write.option('header', True).csv('something')

#Output:
#timestamp,timestamp_2
#"2018-03-24 14:37:12,133","2018-03-24 13:37:12,133"

灵感来自:

How to convert a string column with milliseconds to a timestamp with milliseconds in Spark 2.1 using Scala?

并且:

How do I convert local time to UTC in Python?

相关问题