我有毫秒格式的时间戳,需要将它们从系统时间转换为UTC。无论如何...当进行转换时,火花吞噬了我的毫秒,并将它们显示为零。
简短示例:
from pyspark import Row
from pyspark import SparkContext
from pyspark.sql.functions import to_timestamp, date_format
spark = SparkContext.getOrCreate()
test = spark.createDataFrame([Row(timestamp = "2018-03-24 14:37:12,133")])
test_2 = test.withColumn('timestamp_2', to_timestamp('timestamp', 'yyyy-MM-dd HH:mm:ss,SSS'))
test_3 = test_2.withColumn('timestamp_3', date_format('timestamp_2', 'yyyy-MM-dd HH:mm:ss,SSS'))
test_3.write.option('header', True).csv('something')
这将导致:
timestamp,timestamp_2,timestamp_3
"2018-03-24 14:37:12,133",2018-03-24T14:37:12.000+01:00,"2018-03-24 14:37:12,000"
我可以保留毫秒吗?
我正在使用python 3.6.4和spark版本2.3.2。
答案 0 :(得分:1)
设法使它现在可以工作。由于spark似乎无法在毫秒内正常工作,因此我定义了一个UDF,该UDF使用pytz
和datetime
包将字符串转换为datetime
,更改时区,以及然后再次打印字符串。
import pytz
from datetime import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark import Row
from pyspark import SparkContext
spark = SparkContext.getOrCreate()
def convert_to_utc(timestamp):
local = pytz.timezone("Arctic/Longyearbyen")
naive = datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S,%f')
local_dt = local.localize(naive, is_dst=None)
utc_dt = local_dt.astimezone(pytz.utc)
return utc_dt.strftime('%Y-%m-%d %H:%M:%S,%f')[:-3]
convert_to_utc_udf = udf(lambda timestamp: convert_to_utc(timestamp), StringType())
test = spark.createDataFrame([Row(timestamp = "2018-03-24 14:37:12,133")])
test_2 = test.withColumn('timestamp_2', convert_to_utc_udf('timestamp'))
test_2.write.option('header', True).csv('something')
#Output:
#timestamp,timestamp_2
#"2018-03-24 14:37:12,133","2018-03-24 13:37:12,133"
灵感来自:
并且: