我的输入
:+-------------------+------------------+-----------------+
| TransactionDate|Product Major Code|Gross Trade Sales|
+-------------------+------------------+-----------------+
|2017-09-30 00:00:00| A| 100.0|
|2017-06-30 00:00:00| B| 200.0|
|2017-06-30 00:00:00| C| 300.0|
+-------------------+------------------+-----------------+
我的代码:
df.registerTempTable("tmp")
df2=spark.sql("SELECT TransactionDate,'Product Major Code', sum('Gross Trade Sales') FROM tmp GROUP BY TransactionDate,'Product Major Code'")
spark.catalog.dropTempView('tmp')
我的输出:
+-------------------+------------------+--------------------------------------+
| TransactionDate|Product Major Code|sum(CAST(Gross Trade Sales AS DOUBLE))|
+-------------------+------------------+--------------------------------------+
|2017-09-30 00:00:00|Product Major Code| null|
|2017-06-30 00:00:00|Product Major Code| null|
+-------------------+------------------+--------------------------------------+
任何人都知道为什么未正确汇总产品主要代码和总贸易销售额吗?
更新:
最后我在下面给出了PaulITs的答案,因为它更加优雅,并且不必担心反引号:
import pyspark.sql.functions as f
trydf.groupBy(f.col("TransactionDate"), f.col("Product Major Code")).agg(f.sum(f.col("Gross Trade Sales"))).show()
答案 0 :(得分:-1)
正如已经发表评论的人一样,您需要在该列的名称中用反引号将其包裹起来。
工作示例:
>>> df = sqlContext.createDataFrame([("A",100.0 ), ("A",200.0 ), ("B",500.0 ), ("C", 1000.0)], ["agg_key","value to sum"])
>>> df.registerTempTable("example")
>>> sqlContext.sql("SELECT agg_key, sum(`value to sum`) as sum_val FROM example GROUP BY agg_key").show()
+-------+-------+
|agg_key|sum_val|
+-------+-------+
| A| 300.0|
| B| 500.0|
| C| 1000.0|
+-------+-------+