我有一个包含900亿笔交易记录的数据框。数据框看起来像-
id marital_status age new_class_desc is_child
1 Married 35 kids_sec 0
2 Single 28 Other 1
3 Married 32 Other 1
5 Married 42 kids_sec 0
2 Single 28 Other 1
7 Single 27 kids_sec 0
我希望数据框看起来像-
id marital_status age is_child new_class_desc new_is_child
1 Married 35 0 kids_sec 1
2 Single 28 0 Other 0
3 Married 32 1 Other 1
5 Married 42 0 kids_sec 1
2 Single 28 1 Other 1
7 Single 27 0 kids_sec 0
我已经完成了代码,但是数据集非常大,所以每次内核死了
test_df = pd.read_csv('data.csv')
def new_is_child(var1,var2,var3):
if((var1 == 'Married') & (var2 == 'kids_sec') & (var3 >=33)):
new_var = 1
else:
new_var = test_df['is_child']
return new_var
test_df['new_is_child'] = test_df.apply(lambda row : new_is_child(row['marital_status'],row['new_class_desc'],row['age']), axis=1)
有什么好办法可以解决这个问题?
答案 0 :(得分:3)
在大型DataFrame中,使用numpy.where
并将布尔型掩码强制转换为numpy数组是最快的解决方案:
m = (df['marital_status'].values == 'Married') &
(df['new_class_desc'].values == 'kids_sec') &
(df['age'].values >=33)
df['new_is_child'] = np.where(m, 1, df['is_child'])
print (df)
id marital_status age new_class_desc is_child new_is_child
0 1 Married 35 kids_sec 0 1
1 2 Single 28 Other 0 0
2 3 Married 32 Other 1 1
3 5 Married 42 kids_sec 0 1
4 2 Single 28 Other 1 1
5 7 Single 27 kids_sec 0 0
性能:
np.random.seed(2019)
N = 1000000
df = pd.DataFrame({'marital_status': np.random.choice(['Married','Single'], N),
'age':np.random.randint(20,80,N),
'new_class_desc':np.random.choice(['kids_sec','Other'], N),
'is_child':np.random.choice([0,1], N)})
In [301]: %%timeit
...: m = (df['marital_status'].values == 'Married') & (df['new_class_desc'].values == 'kids_sec') & (df['age'].values >=33)
...: df['new_is_child'] = np.where(m, 1, df['is_child'])
...:
55.4 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [300]: %%timeit
...: cond = (df['marital_status'] == 'Married') & (df['new_class_desc'] == 'kids_sec') & (df['age'] >= 33)
...: df.loc[cond, 'new_is_child'] = 1
...: df['new_is_child'] = df['new_is_child'].fillna(df['is_child'])
...:
148 ms ± 503 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [301]: %%timeit
...: condition = ~((df['marital_status'] == 'Married') &\
...: (df['new_class_desc'] == 'kids_sec') &\
...: (df['age'] >= 33))
...:
...: df['new_col'] = df.loc[:, 'is_child']
...:
...: df.loc[:, 'new_col'] = df.where(condition, 1)
...:
926 ms ± 7.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
答案 1 :(得分:0)
您可以尝试以下方法吗?
cond = (test_df['marital_status'] == 'Married') & (
test_df['new_class_desc'] == 'kids_sec') & (test_df['age'] >= 33)
test_df.loc[cond, 'new_is_child'] = 1
test_df['new_is_child'] = test_df['new_is_child'].fillna(test_df['is_child'])
输出:
id marital_status age new_class_desc is_child new_is_child
0 1 Married 35 kids_sec 0 1
1 2 Single 28 Other 1 1
2 3 Married 32 Other 1 1
3 5 Married 42 kids_sec 0 1
4 2 Single 28 Other 1 1
5 7 Single 27 kids_sec 0 0
答案 2 :(得分:0)
使用df.where
,可以在单个条件下修改数据。如果它符合您的条件,它将使用您希望使用的参数来修改数据,否则,数据将保持不变。
我认为您应该直接修改is_child
的数据而不是创建新列,因为它将返回一个新的DataFrame,原始数据将保持不变,但是我想这取决于您的用例
df = pd.read_csv('file.txt')
print(df)
# id marital_status age new_class_desc is_child
# 0 1 Married 35 kids_sec 0
# 1 2 Single 28 Other 1
# 2 3 Married 32 Other 1
# 3 5 Married 42 kids_sec 0
# 4 2 Single 28 Other 1
# 5 7 Single 27 kids_sec 0
condition = ~((df['marital_status'] == 'Married') &\
(df['new_class_desc'] == 'kids_sec') &\
(df['age'] >= 33))
# Creating the new column, duping your original is_child.
df['new_col'] = df.loc[:, 'is_child']
# Applying your condition using df.where.
df.loc[:, 'new_col'] = df.where(condition, 1)
print(df)
# id marital_status age new_class_desc is_child new_col
# 0 1 Married 35 kids_sec 0 1
# 1 2 Single 28 Other 1 1
# 2 3 Married 32 Other 1 1
# 3 5 Married 42 kids_sec 0 1
# 4 2 Single 28 Other 1 1
# 5 7 Single 27 kids_sec 0 0
答案 3 :(得分:0)
您需要test_df['is_child'].where(~(test_df['marital_status'] == 'Married' & ...other conditions...), 1)
方法。
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.where.html
Series.where(cond, other)
请注意前面的否定项。如果series
为True,则cond
返回other
值,否则返回/usr/local/var/postgres/pg_hba.conf
值
答案 4 :(得分:0)
一种方法可能是分批读取csv,并根据需要不断添加具有架构的新df。该代码可以是:
str="2^10" #Expected output: 1024
重点是通过分块修改文件来减少内核上的负载。因此,在这种情况下,找到合适的csize(即块大小)很重要。认为这会很好。