我有一个包含752(id,date和750个要素列)列和大约150万行的DataFrame,我需要对所有750个特征列按id和按日期排序应用累积总和。
以下是我目前正在使用的方法:
[Elem (Element {elName = QName {qName = "work", qURI = Nothing, qPrefix = Nothing}, elAttribs = [], elContent = [Text (CData {cdVerbatim = CDataText, cdData = "\n ", cdLine = Just 1}),Elem (Element {elName = QName {qName = "a", qURI = Nothing, qPrefix = Nothing}, elAttribs = [], elContent = [Text (CData {cdVerbatim = CDataText, cdData = "1", cdLine = Just 2})], elLine = Just 2}),Text (CData {cdVerbatim = CDataText, cdData = "\n ", cdLine = Just 2}),Elem (Element {elName = QName {qName = "b", qURI = Nothing, qPrefix = Nothing}, elAttribs = [], elContent = [Text (CData {cdVerbatim = CDataText, cdData = "2", cdLine = Just 3})], elLine = Just 3}),Text (CData {cdVerbatim = CDataText, cdData = "\n", cdLine = Just 3})], elLine = Just 1}),Text (CData {cdVerbatim = CDataText, cdData = "\n", cdLine = Just 4})]
运行这种当前方法时,我遇到了错误
# putting all 750 feature columns in a list required_columns = ['ts_1','ts_2'....,'ts_750'] # defining window sumwindow = Window.partitionBy('id').orderBy('date') # Applying window to calculate cumulative of each individual feature column for current_col in required_columns: new_col_name = "sum_{0}".format(current_col) df=df.withColumn(new_col_name,sum(col(current_col)).over(sumwindow)) # Saving the result into parquet file df.write.format('parquet').save(output_path)
请让我知道替代解决方案。似乎累积和对于大量数据而言有点棘手。请提出我可以调整以使其起作用的任何替代方法或任何火花配置。
答案 0 :(得分:1)
我希望您遇到谱系过大的问题。重新分配数据框很多次后,看看您的解释计划。
为此的标准解决方案是经常检查点数据框以截断解释计划。这有点像缓存,但是对于计划而不是数据,并且对于修改数据帧的迭代算法通常是必需的。
Here是有关缓存和检查点的很好的pyspark解释
我建议每5到10次修改df.checkpoint()开始
让我们知道进展如何