Question

我有毫秒格式的时间戳，需要将它们从系统时间转换为UTC。无论如何...当进行转换时，火花吞噬了我的毫秒，并将它们显示为零。

简短示例：

from pyspark import Row
from pyspark import SparkContext
from pyspark.sql.functions import to_timestamp, date_format

spark = SparkContext.getOrCreate()

test = spark.createDataFrame([Row(timestamp = "2018-03-24 14:37:12,133")])
test_2 = test.withColumn('timestamp_2', to_timestamp('timestamp', 'yyyy-MM-dd HH:mm:ss,SSS'))
test_3 = test_2.withColumn('timestamp_3', date_format('timestamp_2', 'yyyy-MM-dd HH:mm:ss,SSS'))
test_3.write.option('header', True).csv('something')

这将导致：

timestamp,timestamp_2,timestamp_3
"2018-03-24 14:37:12,133",2018-03-24T14:37:12.000+01:00,"2018-03-24 14:37:12,000"

我可以保留毫秒吗？

我正在使用python 3.6.4和spark版本2.3.2。

Answer 1

设法使它现在可以工作。由于spark似乎无法在毫秒内正常工作，因此我定义了一个UDF，该UDF使用pytz和datetime包将字符串转换为datetime，更改时区，以及然后再次打印字符串。

import pytz
from datetime import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark import Row
from pyspark import SparkContext

spark = SparkContext.getOrCreate()

def convert_to_utc(timestamp):
    local = pytz.timezone("Arctic/Longyearbyen")
    naive = datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S,%f')
    local_dt = local.localize(naive, is_dst=None)
    utc_dt = local_dt.astimezone(pytz.utc)
    return utc_dt.strftime('%Y-%m-%d %H:%M:%S,%f')[:-3]

convert_to_utc_udf = udf(lambda timestamp: convert_to_utc(timestamp), StringType())

test = spark.createDataFrame([Row(timestamp = "2018-03-24 14:37:12,133")])
test_2 = test.withColumn('timestamp_2', convert_to_utc_udf('timestamp'))
test_2.write.option('header', True).csv('something')

#Output:
#timestamp,timestamp_2
#"2018-03-24 14:37:12,133","2018-03-24 13:37:12,133"

灵感来自：

How to convert a string column with milliseconds to a timestamp with milliseconds in Spark 2.1 using Scala?

并且：

How do I convert local time to UTC in Python?

为什么spark（Python）吞噬了我的毫秒？

1 个答案: