PySpark-聚合后添加新列

时间:2019-04-11 14:30:41

标签: python apache-spark dataframe pyspark

我正在尝试对列进行分组并采用最少的列。然后使用最小值计算与日期的差异。但是,当我使用最小日期列时,出现以下错误:

raise AnalysisException(s.split(': ', 1)[1], stackTrace)
 pyspark.sql.utils.AnalysisException: u"grouping expressions sequence is empty, and 'table.`lbrnm`' is not an aggregate function. Wrap '(datediff(CAST(CAST('2019-02-28 01:00:00' AS TIMESTAMP) AS DATE), CAST(CAST(concat(CASE WHEN (CAST((CAST(`min_date` AS DECIMAL(7,0)) / CAST(CAST(1000000 AS DECIMAL(7,0)) AS DECIMAL(7,0))) AS DOUBLE) = CAST('1' AS DOUBLE)) THEN '20' ELSE '' END, CASE WHEN (CAST((CAST(`min_date` AS DECIMAL(7,0)) / CAST(CAST(1000000 AS DECIMAL(7,0)) AS DECIMAL(7,0))) AS DOUBLE) = CAST('0' AS DOUBLE)) THEN '19' ELSE '' END, substring(substring(CAST(min(table.lspf.`lsdte`) AS STRING), 0, 3), -2, 2),

这是我的代码:

j = lspf_ret.groupBy(col("lsbrnm"),
                     col("lsdlp"),
                     col("lsdlr"))
            .agg(min(col('lsdte')))



    j.select('lsbrnm','lsdlp','lsdlr',col('min(lsdte)').alias('min_date'))
     .select('lsbrnm',
             'lsdlp',
             'lsdlr',
             'min_date', 
             datediff(lit('2019-02-28 01:00:00').cast(TimestampType()),
             concat(when(col("min_date")/1000000=='1','20').otherwise(''),
             when(col("min_date")/1000000=='0','19').otherwise(''),
             right(left(min(lspf.lsdte).cast(StringType()),3),2),
             lit('-'),
             left(right(min(lspf.lsdte).cast(StringType()),4),2), 
             lit('-'),
             right(min(lspf.lsdte).cast(StringType()),2),
             lit(' 00:00:00')  ).cast(TimestampType())))

以下是汇总输出:

    |lsbrnm|lsdlp|        lsdlr|min(lsdte)|
    +------+-----+-------------+----------+
    |  2266|  EF4| 171001370957|   1190201|
    |  2266|  EF4| 131201027045|   1171130|
    |  2266|  EF4| 140901072492|   1170301|
    |  2266|  EF4| 160901268734|   1180925|
    |  2266|  EF4| 161101289209|   1170929|
    |  2266|  EA4| 18501424940R|   1190220|

这是所需的输出:

    |lsbrnm|lsdlp|        lsdlr|   min_date|difference
    +------+-----+-------------+----------++----------+
    |  2266|  EF4| 171001370957|   1190201|         27|
    |  2266|  EF4| 131201027045|   1171130|         275|
    |  2266|  EF4| 140901072492|   1170301|         1|
    |  2266|  EF4| 160901268734|   1180925|         209|
    |  2266|  EF4| 161101289209|   1170929|         213|
    |  2266|  EA4| 18501424940R|   1190220|         8|

lspf.lsdte的示例:

      lsdte
      +------+
      1190201
      1171130
      1170301
      1180925
      1170929
      1190220

我要去哪里错了?

0 个答案:

没有答案