合并多个大型DataFrame的有效方法

时间:2018-06-16 08:28:29

标签: python pandas dataframe merge out-of-memory

假设我有4个小型DataFrame

df1df2df3df4

import pandas as pd
from functools import reduce
import numpy as np

df1 = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
df2 = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
df3 = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])  
df4 = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])   


df1.columns = ['name', 'id', 'price']
df2.columns = ['name', 'id', 'price']
df3.columns = ['name', 'id', 'price']    
df4.columns = ['name', 'id', 'price']   

df1 = df1.rename(columns={'price':'pricepart1'})
df2 = df2.rename(columns={'price':'pricepart2'})
df3 = df3.rename(columns={'price':'pricepart3'})
df4 = df4.rename(columns={'price':'pricepart4'})

上面创建的是4个DataFrame,我想在下面的代码中找到它。

# Merge dataframes
df = pd.merge(df1, df2, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
df = pd.merge(df , df3, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
df = pd.merge(df , df4, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')

# Fill na values with 'missing'
df = df.fillna('missing')

所以我已经为4个没有很多行和列的DataFrame实现了这个目标。

基本上,我想将上面的外部合并解决方案扩展到大小为62245 X 3的MULTIPLE(48)数据框:

所以我通过构建另一个使用lambda reduce的StackOverflow答案来提出这个解决方案:

from functools import reduce
import pandas as pd
import numpy as np
dfList = []

#To create the 48 DataFrames of size 62245 X 3
for i in range(0, 49):

    dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name',  'id',  'pricepart' + str(i + 1)]))


#The solution I came up with to extend the solution to more than 3 DataFrames
df_merged = reduce(lambda  left, right: pd.merge(left, right, left_on=['name', 'id'], right_on=['name', 'id'], how='outer'), dfList).fillna('missing')

这导致MemoryError

我不知道如何阻止内核死亡......我已经坚持了两天..我执行的EXACT合并操作的一些代码不会导致{{ 1}}或者给你相同结果的东西,真的很感激。

此外,主DataFrame中的3列(不是示例中可重现的48个DataFrame)的类型为MemoryErrorint64int64,我更喜欢它们保持这种方式,因为它代表的整数和浮点数。

编辑:

我没有迭代地尝试运行合并操作或使用reduce lambda函数,而是以2个为一组进行操作!另外,我已经更改了某些列的数据类型,有些不需要float64。所以我把它归结为float64。它变得非常远但仍然最终抛出float16

MemoryError

有没有什么方法可以优化我的代码以避免intermediatedfList = dfList tempdfList = [] #Until I merge all the 48 frames two at a time, till it becomes size 2 while(len(intermediatedfList) != 2): #If there are even number of DataFrames if len(intermediatedfList)%2 == 0: #Go in steps of two for i in range(0, len(intermediatedfList), 2): #Merge DataFrame in index i, i + 1 df1 = pd.merge(intermediatedfList[i], intermediatedfList[i + 1], left_on=['name', 'id'], right_on=['name', 'id'], how='outer') print(df1.info(memory_usage='deep')) #Append it to this list tempdfList.append(df1) #After DataFrames in intermediatedfList merging it two at a time using an auxillary list tempdfList, #Set intermediatedfList to be equal to tempdfList, so it can continue the while loop. intermediatedfList = tempdfList else: #If there are odd number of DataFrames, keep the first DataFrame out tempdfList = [intermediatedfList[0]] #Go in steps of two starting from 1 instead of 0 for i in range(1, len(intermediatedfList), 2): #Merge DataFrame in index i, i + 1 df1 = pd.merge(intermediatedfList[i], intermediatedfList[i + 1], left_on=['name', 'id'], right_on=['name', 'id'], how='outer') print(df1.info(memory_usage='deep')) tempdfList.append(df1) #After DataFrames in intermediatedfList merging it two at a time using an auxillary list tempdfList, #Set intermediatedfList to be equal to tempdfList, so it can continue the while loop. intermediatedfList = tempdfList ,我甚至使用了AWS 192GB内存(我现在欠他们7美元我可以“给你一个yall”),它比我得到的更远,并且在将28个DataFrames列表减少到4之后仍然会抛出MemoryError

4 个答案:

答案 0 :(得分:3)

使用pd.concat执行索引对齐并置可能会带来一些好处。这应该比外部合并更快,更高效。

df_list = [df1, df2, ...]
for df in df_list:
    df.set_index(['name', 'id'], inplace=True)

df = pd.concat(df_list, axis=1) # join='inner'
df.reset_index(inplace=True)

或者,您可以通过迭代concat替换join(第二步):

from functools import reduce
df = reduce(lambda x, y: x.join(y), df_list)

这可能会或可能不会比merge更好。

答案 1 :(得分:1)

您可以尝试一个简单的for循环。我应用的唯一内存优化是通过int向下转换为最佳pd.to_numeric类型。

我也在使用字典来存储数据帧。这是保存可变数量变量的好习惯。

import pandas as pd

dfs = {}
dfs[1] = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
dfs[2] = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
dfs[3] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])  
dfs[4] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])   

df = dfs[1].copy()

for i in range(2, max(dfs)+1):
    df = pd.merge(df, dfs[i].rename(columns={2: i+1}),
                  left_on=[0, 1], right_on=[0, 1], how='outer').fillna(-1)
    df.iloc[:, 2:] = df.iloc[:, 2:].apply(pd.to_numeric, downcast='integer')

print(df)

   0  1   2   3   4   5
0  a  1  10  15  -1  -1
1  a  2  20  20  -1  -1
2  b  1   4  -1  -1  -1
3  c  1   2   2  -1  -1
4  e  2  10  -1  20  20
5  d  1  -1  -1  10  10
6  f  1  -1  -1   1  15

您通常不应将字符串(例如“缺失”)与数字类型组合在一起,因为这会将整个系列变为object类型系列。我们在这里使用-1,但您可能希望将NaNfloat dtype一起使用。

答案 2 :(得分:1)

看起来像是设计快数据框的一部分(具有数据框的内存操作)。看到 Best way to join two large datasets in Pandas例如代码。抱歉,没有复制和粘贴,但不想让我看起来好像想从链接条目中的应答者那里学分。

答案 3 :(得分:0)

因此,您有48个df,每个3列-名称,id和每个df的不同列。

您不必使用合并。...

相反,如果合并所有dfs

df = pd.concat([df1,df2,df3,df4])

您将收到:

Out[3]: 
   id name  pricepart1  pricepart2  pricepart3  pricepart4
0   1    a        10.0         NaN         NaN         NaN
1   2    a        20.0         NaN         NaN         NaN
2   1    b         4.0         NaN         NaN         NaN
3   1    c         2.0         NaN         NaN         NaN
4   2    e        10.0         NaN         NaN         NaN
0   1    a         NaN        15.0         NaN         NaN
1   2    a         NaN        20.0         NaN         NaN
2   1    c         NaN         2.0         NaN         NaN
0   1    d         NaN         NaN        10.0         NaN
1   2    e         NaN         NaN        20.0         NaN
2   1    f         NaN         NaN         1.0         NaN
0   1    d         NaN         NaN         NaN        10.0
1   2    e         NaN         NaN         NaN        20.0
2   1    f         NaN         NaN         NaN        15.0

现在,您可以按名称和ID分组并取总和:

df.groupby(['name','id']).sum().fillna('missing').reset_index()

如果您将使用48个dfs尝试一下,则会看到它解决了MemoryError:

dfList = []
#To create the 48 DataFrames of size 62245 X 3
for i in range(0, 49):
    dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name',  'id',  'pricepart' + str(i + 1)]))

df = pd.concat(dfList)
df.groupby(['name','id']).sum().fillna('missing').reset_index()
相关问题