在pyspark中对大量列进行累加总和的优化方法

时间:2018-12-18 18:56:31

标签: pyspark pyspark-sql

我有一个包含752(id,date和750个要素列)列和大约150万行的DataFrame,我需要对所有750个特征列按id和按日期排序应用累积总和。

以下是我目前正在使用的方法:

[Elem (Element {elName = QName {qName = "work", qURI = Nothing, qPrefix = Nothing}, elAttribs = [], elContent = [Text (CData {cdVerbatim = CDataText, cdData = "\n  ", cdLine = Just 1}),Elem (Element {elName = QName {qName = "a", qURI = Nothing, qPrefix = Nothing}, elAttribs = [], elContent = [Text (CData {cdVerbatim = CDataText, cdData = "1", cdLine = Just 2})], elLine = Just 2}),Text (CData {cdVerbatim = CDataText, cdData = "\n  ", cdLine = Just 2}),Elem (Element {elName = QName {qName = "b", qURI = Nothing, qPrefix = Nothing}, elAttribs = [], elContent = [Text (CData {cdVerbatim = CDataText, cdData = "2", cdLine = Just 3})], elLine = Just 3}),Text (CData {cdVerbatim = CDataText, cdData = "\n", cdLine = Just 3})], elLine = Just 1}),Text (CData {cdVerbatim = CDataText, cdData = "\n", cdLine = Just 4})]

运行这种当前方法时,我遇到了错误

# putting all 750 feature columns in a list
required_columns = ['ts_1','ts_2'....,'ts_750']

# defining window
sumwindow = Window.partitionBy('id').orderBy('date')

# Applying window to calculate cumulative of each individual feature column

for current_col in required_columns:
    new_col_name = "sum_{0}".format(current_col)
    df=df.withColumn(new_col_name,sum(col(current_col)).over(sumwindow))

# Saving the result into parquet file    
df.write.format('parquet').save(output_path)

请让我知道替代解决方案。似乎累积和对于大量数据而言有点棘手。请提出我可以调整以使其起作用的任何替代方法或任何火花配置。

1 个答案:

答案 0 :(得分:1)

我希望您遇到谱系过大的问题。重新分配数据框很多次后,看看您的解释计划。

为此的标准解决方案是经常检查点数据框以截断解释计划。这有点像缓存,但是对于计划而不是数据,并且对于修改数据帧的迭代算法通常是必需的。

Here是有关缓存和检查点的很好的pyspark解释

我建议每5到10次修改df.checkpoint()开始

让我们知道进展如何