如何从PySpark的“日期”列中获取星期的第一天?

时间:2019-02-04 16:39:07

标签: pyspark

我的PySpark数据框中有一个普通的时间戳列。我想从给定日期的新列中获取星期几的开始日期。

1 个答案:

答案 0 :(得分:1)

对于火花<= 2.2.0

请使用此:

from pyspark.sql.functions import weekofyear, year, to_date, concat, lit, col
from pyspark.sql.session import SparkSession
from pyspark.sql.types import TimestampType

spark = SparkSession.builder.getOrCreate()

spark.createDataFrame([['2020-10-03 05:00:00']], schema=['timestamp']) \
    .withColumn('timestamp', col('timestamp').astype(TimestampType())) \
    .withColumn('week', weekofyear('timestamp')) \
    .withColumn('year', year('timestamp')) \
    .withColumn('date_of_the_week', to_date(concat('week', lit('/'), 'year'), "w/yyyy")) \
    .show(truncate=False)

+-------------------+----+----+----------------+
|timestamp          |week|year|date_of_the_week|
+-------------------+----+----+----------------+
|2020-10-03 05:00:00|40  |2020|2020-09-27      |
+-------------------+----+----+----------------+

对于Spark> 2.2.0

from pyspark.sql.functions import date_trunc, col
from pyspark.sql.session import SparkSession
from pyspark.sql.types import TimestampType

spark = SparkSession.builder.getOrCreate()

spark.createDataFrame([['2020-10-03 05:00:00']], schema=['timestamp']) \
    .withColumn('timestamp', col('timestamp').astype(TimestampType())) \
    .withColumn('date_of_the_week', date_trunc(timestamp='timestamp', format='week')) \
    .show(truncate=False)

+-------------------+-------------------+
|timestamp          |date_of_the_week   |
+-------------------+-------------------+
|2020-10-03 05:00:00|2020-09-28 00:00:00|
+-------------------+-------------------+