Pyspark-使用除最后已知的非空值

时间:2018-04-06 12:43:55

标签: pyspark spark-dataframe

我有这样的数据:

PeopleCountTestSchema=StructType([StructField("building",StringType(), True),
StructField("date_created",StringType(), True),
StructField("hour",StringType(), True),
StructField("wirelesscount",StringType(), True),
StructField("rundate",StringType(), True)])

df=spark.read.csv("wasb://reftest@refdev.blob.core.windows.net/Praneeth/HVAC/PeopleCount_test/",schema=PeopleCountTestSchema,sep=",") df.createOrReplaceTempView('Test')

 |building date_created|hour|wirelesscount|
 +--------+------------+----+-------------+
 |36      |2017-01-02  |0   |35           |
 |36      |2017-01-03  |0   |46           |
 |36      |2017-01-04  |0   |32           |
 |36      |2017-01-05  |0   |90           |
 |36      |2017-01-06  |0   |33           |
 |36      |2017-01-07  |0   |22           |
 |36      |2017-01-08  |0   |11           |
 |36      |2017-01-09  |0   |null         |
 |36      |2017-01-10  |0   |null         |
 |36      |2017-01-11  |0   |null         |
 |36      |2017-01-12  |0   |null         |
 |36      |2017-01-13  |0   |null         |

这需要转变为:

|building|date_created|hour|wirelesscount|
+--------+------------+----+-------------+
|36      |2017-01-02  |0   |35           |
|36      |2017-01-03  |0   |46           |
|36      |2017-01-04  |0   |32           |
|36      |2017-01-05  |0   |90           |
|36      |2017-01-06  |0   |33           |
|36      |2017-01-07  |0   |22           |
|36      |2017-01-08  |0   |11           |
|36      |2017-01-09  |0   |35           |
|36      |2017-01-10  |0   |46           |
|36      |2017-01-11  |0   |32           |
|36      |2017-01-12  |0   |90           |
|36      |2017-01-13  |0   |33           |

当前的空值需要替换为第7个先前的值。

我尝试使用:

Test2 = df.withColumn("wirelesscount2", last('wirelesscount', True).over(Window.partitionBy('building','hour').orderBy('hour').rowsBetween(-sys.maxsize, -7)))

结果输出

|building|date_created|hour|wirelesscount|rundate   |wirelesscount2|
+--------+------------+----+-------------+----------+--------------+
|36      |2017-01-02  |0   |35           |2017-04-01|null          |
|36      |2017-01-03  |0   |46           |2017-04-01|null          |
|36      |2017-01-04  |0   |32           |2017-04-01|null          |
|36      |2017-01-05  |0   |90           |2017-04-01|null          |
|36      |2017-01-06  |0   |33           |2017-04-01|null          |
|36      |2017-01-07  |0   |22           |2017-04-01|null          |
|36      |2017-01-08  |0   |11           |2017-04-01|null          |
|36      |2017-01-09  |0   |null         |2017-04-01|35            |
|36      |2017-01-10  |0   |null         |2017-04-01|46            |
|36      |2017-01-11  |0   |null         |2017-04-01|32            |
|36      |2017-01-12  |0   |null         |2017-04-01|90            |
|36      |2017-01-13  |0   |null         |2017-04-01|33            |

使用第7个先前的值填充空值,但之前的7个值变为空。

请告诉我,如何处理。

提前致谢!

1 个答案:

答案 0 :(得分:0)

您可以使用coalesce完成此操作。

from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType

Test2 = Test2.withColumn('wirelesscount', Test2.wirelesscount.cast('integer'))
Test2 = Test2.withColumn('wirelesscount2', Test2.wirelesscount2.cast('integer'))

test3 = Test2.withColumn('wirelesscount3', coalesce(Test2.wirelesscount, Test2.wirelesscount2))
test3.show()