迭代计算列的高效方法

时间:2017-09-12 18:14:56

标签: pyspark apache-spark-sql spark-dataframe window-functions pyspark-sql

鉴于我有这个代码,它产生一个如图所示的df:

l = [('CM1','aa1',  3.0,  None, datetime.datetime(2017, 5, 30, 20,0,1)),\
     ('CM1','aa1',  None,    .1, datetime.datetime(2017, 5, 30, 20,0,4)),\
     ('CM1','aa1',  None,    .2, datetime.datetime(2017, 5, 30, 20,0,8)),\
     ('CM1','aa1',  None,    .3, datetime.datetime(2017, 5, 30, 20,0,12)),\
     ('CM1','aa1',  None,     .4, datetime.datetime(2017, 5, 30, 20,0,30)),\
     ('CM1','aa1',  None,    .0, datetime.datetime(2017, 5, 30, 20,0,33)),\
     ('CM1','aa1', 2.0,    None, datetime.datetime(2017, 5, 30, 20,0,37)),\
     ('CM1','aa1',  None,    .1, datetime.datetime(2017, 5, 30, 20,0,39)),\
     ('CM1','aa1',  None,     .0, datetime.datetime(2017, 5, 30, 20,0,39)),\
     ('CM1','aa1',  None,     .2, datetime.datetime(2017, 5, 30, 20,0,49)),\
     ('CM1','aa1',  None,    .8, datetime.datetime(2017, 5, 30, 20,0,55)),\
     ('CM1','aa1',  4.0,  None, datetime.datetime(2017, 5, 30, 20,0,59))
        ]

schema = StructType([StructField('customid', StringType(), True),
                     StructField('procid', StringType(), True),
                     StructField('speed', DoubleType(), True),
                     StructField('wait', DoubleType(), True),
                     StructField('timestamp', TimestampType(), True)]
                     )

rdd = sc.parallelize(l)

df = sqlContext.createDataFrame(rdd,schema)

df = df.withColumn('u_ts', unix_timestamp(df.timestamp))

w = \
  Window.partitionBy(df['procid']).orderBy(df['timestamp'].asc())#.rangeBetween(-1, 0)

df = df.withColumn('delay', (psf.lag(df.u_ts, 0).over(w))-(psf.lag(df.u_ts, 1).over(w)))

df.show()

-

+--------+------+-----+----+-------------------+----------+-----+
|customid|procid|speed|wait|          timestamp|      u_ts|delay|
+--------+------+-----+----+-------------------+----------+-----+
|     CM1|   aa1|  3.0|null|2017-05-30 20:00:01|1496167201| null|
|     CM1|   aa1| null| 0.1|2017-05-30 20:00:04|1496167204|    3|
|     CM1|   aa1| null| 0.2|2017-05-30 20:00:08|1496167208|    4|
|     CM1|   aa1| null| 0.3|2017-05-30 20:00:12|1496167212|    4|
|     CM1|   aa1| null| 0.4|2017-05-30 20:00:30|1496167230|   18|
|     CM1|   aa1| null| 0.0|2017-05-30 20:00:33|1496167233|    3|
|     CM1|   aa1|  2.0|null|2017-05-30 20:00:37|1496167237|    4|
|     CM1|   aa1| null| 0.1|2017-05-30 20:00:39|1496167239|    2|
|     CM1|   aa1| null| 0.0|2017-05-30 20:00:39|1496167239|    0|
|     CM1|   aa1| null| 0.2|2017-05-30 20:00:49|1496167249|   10|
|     CM1|   aa1| null| 0.8|2017-05-30 20:00:55|1496167255|    6|
|     CM1|   aa1|  4.0|null|2017-05-30 20:00:59|1496167259|    4|
+--------+------+-----+----+-------------------+----------+-----+

目标是根据以下内容计算并填充每个速度条目,该条目为空: (s,w,d,指速度,等待和延迟列)

+--------+------+-----+----+-------------------+----------+-----+
|customid|procid|speed          |wait|          timestamp|      u_ts|delay|
+--------+------+-----+----+-------------------+----------+-----+
|     CM1|   aa1|  3.0          |null|2017-05-30 20:00:01|1496167201| null|
|     CM1|   aa1| s[0]+w[1]*d[1]| 0.1|2017-05-30 20:00:04|1496167204|    3|
|     CM1|   aa1| s[1]+w[2]*d[2]| 0.2|2017-05-30 20:00:08|1496167208|    4|
|     CM1|   aa1| s[2]+w[3]*d[3]| 0.3|2017-05-30 20:00:12|1496167212|    4|
|     CM1|   aa1| s[3]+w[4]*d[4]| 0.4|2017-05-30 20:00:30|1496167230|   18|
|     CM1|   aa1| s[4]+w[5]*d[5]| 0.0|2017-05-30 20:00:33|1496167233|    3|
|     CM1|   aa1|  2.0          |null|2017-05-30 20:00:37|1496167237|    4|
|     CM1|   aa1| s[6]+w[7]*d[7]| 0.1|2017-05-30 20:00:39|1496167239|    2|
|     CM1|   aa1| s[7]+w[8]*d[8]| 0.0|2017-05-30 20:00:39|1496167239|    0|
|     CM1|   aa1| s[9]+w[10]*d[10]| 0.2|2017-05-30 20:00:49|1496167249|   10|
|     CM1|   aa1| s[10]+w[11]*d[11]| 0.8|2017-05-30 20:00:55|1496167255|    6|
|     CM1|   aa1| 4.0              |null|2017-05-30 20:00:59|1496167259|    4|
+--------+------+-----+----+-------------------+----------+-----+

我通过以下方式实施了解决方案:

for i in range(5):

    df = df.withColumn('speed',
                 psf.when(df.speed.isNull() == True,\
          (psf.lag(df.wait, 0).over(w))*df.delay+psf.lag(df.speed, 1).over(w))\
                       .otherwise(df.speed))



    #df = df.withColumn('speed',psf.coalesce(df.speed, df.result))

df.show()

结果还可以:

+--------+------+-----+----+-------------------+----------+-----+
|customid|procid|speed|wait|          timestamp|      u_ts|delay|
+--------+------+-----+----+-------------------+----------+-----+
|     CM1|   aa1|  3.0|null|2017-05-30 20:00:01|1496167201| null|
|     CM1|   aa1|  3.3| 0.1|2017-05-30 20:00:04|1496167204|    3|
|     CM1|   aa1|  4.1| 0.2|2017-05-30 20:00:08|1496167208|    4|
|     CM1|   aa1|  5.3| 0.3|2017-05-30 20:00:12|1496167212|    4|
|     CM1|   aa1| 12.5| 0.4|2017-05-30 20:00:30|1496167230|   18|
|     CM1|   aa1| 12.5| 0.0|2017-05-30 20:00:33|1496167233|    3|
|     CM1|   aa1|  2.0|null|2017-05-30 20:00:37|1496167237|    4|
|     CM1|   aa1|  2.2| 0.1|2017-05-30 20:00:39|1496167239|    2|
|     CM1|   aa1|  2.2| 0.0|2017-05-30 20:00:39|1496167239|    0|
|     CM1|   aa1|  4.2| 0.2|2017-05-30 20:00:49|1496167249|   10|
|     CM1|   aa1|  9.0| 0.8|2017-05-30 20:00:55|1496167255|    6|
|     CM1|   aa1|  4.0|null|2017-05-30 20:00:59|1496167259|    4|
+--------+------+-----+----+-------------------+----------+-----+

它确实在几百个procid组上运行,但处理速度非常慢。这是实施解决方案的正确方法,它不会浪费计算能力吗?

我不确定循环和if语句的情况:它是否也只在窗口上工作,或者df的每一列都是由withColumn / case表达式整体影响?

0 个答案:

没有答案